Transfer Learning for Biomedical Named Entity Recognition with BioBERT

(1)

Transfer Learning for Biomedical Named Entity Recognition with BioBERT

submitted in partial fulfillment for the degree of master of science Anthi Symeonidou

12296082

master information studies data science

faculty of science university of amsterdam

2019-06-27

Internal Supervisor External Supervisor Title, Name Prof. Paul Groth Dr. Viachaslau Sazonau Affiliation UvA, INDE Lab Elsevier BV

(2)

Transfer Learning for Biomedical Named Entity Recognition

with BioBERT

Anthi Symeonidou

an_syme@hotmail.com University of Amsterdam Amsterdam, The Netherlands

ABSTRACT

We apply a transfer learning approach to biomedical named entity recognition and compare it with traditional approaches (dictio-nary, CRF, BiLTSM). Specifically, we build models for adverse drug reaction recognition on three datasets. We tune a pre-trained trans-former model, BioBERT, on these datasets and observe the absolute F1-score improvements of 6.93, 10.46 and 13.31. This shows that,

with a relatively small amount of annotated data, transfer learning can help in specialized information extraction tasks.

KEYWORDS

Named Entity Recognition, Drug Safety, Text Mining, Adverse Drug Reaction, BIO Tagging, Dictionary-based Approach, CRFs, BiLSTM, Transfer Learning, BioBERT

1 INTRODUCTION

Biomedical literature is growing tremendously and text analysis is of great importance. The automation of text analysis is accom-plished by text mining techniques [6, 22]. These automated meth-ods aim to obtain useful information, such as identification and Information Extraction (IE). In the Pharmaceutical and Drug Safety domain, a lot of money and time is invested on drug discovery and health treatment every year. One of the most predominant causes of death are the Adverse Drug Reactions (ADRs). ADRs cause signif-icant number of deaths worldwide and billion of dollars are spent yearly to treat people who had an ADR from a prescribed drug [19]. ADR recognition is a challenging IE task because of the context importance and multi-token phenomena such as discontinuity and overlaps. In recent years, clinical records, spontaneous reports, electronic health records and biomedical literature are invaluable sources which can be used to facilitate the IE and automatically iden-tify drugs and ADRs [2, 16, 21, 37]. This can lead to the development of automated systems and overcome the manual reading supporting the pharmacovigilance and pharmaceutical research [10].

The two main subtasks of IE from Drug Safety text is Named Entity Recognition (NER) which is used to identify entities, such as proteins, drugs, disease, and Relation Extraction (RE) which identifies and extracts relations between entities described in an an-notated text (e.g. drug-ADR). The annotations are manually curated by domain experts, which need to occur in a two-step process: en-tity identification and relation between entities identification [14]. Several annotated gold standards have been released which could be used to evaluate these systems [14, 17].

Simple techniques have been applied in NER such as dictionary-based approaches, where a string match procedure is used to iden-tify the entities in the text [6, 24]. Feature-based methods such as

Conditional Random Fields (CRFs) have been also used for biomedi-cal and chemibiomedi-cal entity recognition, where not only the word itself, but also the contextual information is taken into account until the emergence of more sophisticated and accurate machine-learning approaches in the last few years [3, 9, 34, 36]. However, these kind of machine-learning techniques require substantial amount of train-ing data, which is expensive and time-consumtrain-ing. Due to the lack of these annotated training corpora, state-of-the-art machine learning approaches without the necessity of many annotated data, merit importance.

Transfer Learning is a state-of-the-art machine learning method where a model developed for a task is exploited to improve gener-alization to another task [25]. This method is suitable when there are not enough data to train a deep learning model [33]. Giorgi and Bader have found that a transfer learning approach is bene-ficial for biomedical NER improving upon state-of-the-art results, especially for datasets with a number of labels less than 6000 [11]. A transformer model called BERT (Bidirectional Encoder Repre-sentations from Transformers for Biomedical Text Mining) was published recently and achieved outstanding performance in 11 natural language processing tasks (NLP), including NER, than any other model. It is a pre-trained language model suitable for transfer learning [7]. A promising model, the BioBERT, which has similar architecture with BERT and was trained over 18B biomedical words (from PubMed and PMC), achieved high performance in biomedical NER in many benchmarks.

The hypothesis of the current paper is that transfer learning can outperform traditional methods in solving NER task in biomedical text using only few annotated data. To examine our hypothesis, we formulated the problem into the following main research question and 3 subquestions:

Research question:Can Transfer Learning improve the quality of biomedical NER in drug safety text as compared to traditional methods?

Subquestions

• What are the traditional methods for biomedical NER in drug safety text?

• What results do these methods achieve?

• What is the performance of pre-trained models, such as BioBERT, compared to traditional methods?

In this paper, we investigate if transfer learning can outperform traditional approaches on three different datasets. The pre-trained BioBERT model was fine-tuned on three datasets (TAC2017, ADE and Elsevier’s gold set corpora) and compared to a dictionary-based method, CRF, and BiLTSM. Our main contribution is that we show that a transfer learning method based on BioBERT can achieve considerably higher performance in recognizing ADRs than

(3)

traditional methods. Our results suggest that transfer learning based on transformer architectures is a promising approach to addressing the lack of training data in biomedical NER.

2 RELATED WORK

Rule-based, dictionary-based and co-occurrence statistical approaches are traditional methods used in order to identify and extract entities and relations between drugs and ADRs and protein-protein inter-actions [4, 15, 26]. In recent years, machine learning and Natural Language Processing (NLP) approaches have been developed to facilitate the biomedical NER, which can be distinguished in kernel-based and feature-kernel-based approaches (Support Vector Machines and Conditional Random Fields) [8, 13, 32]. More sophisticated deep learning approaches, such as Convolutional Neural Networks and LSTM models, have been emerged and show high potential on accomplishing the NER tasks [9, 36].

The main bottleneck of using deep learning models for biomedi-cal NER is the lack of annotated data which will be used to train the model. The manual annotation of text documents is a time-consuming and error-prone task, as it requires a high intellectual effort from domain experts, such as biologists or pharmacists [5]. Transfer learning approaches could be applied to overcome this lim-itation [12, 33]. A transfer learning approach where both character-and word-level representations are shared between the different datasets, is compared with state-of-the-art BioNER systems and baseline neural network models and outperforms on 15 benchmark BioNER datasets [35]. Another current research has shown the ben-efit of transfer learning approach on biomedical NER, particularly in cases with less than 6000 labeled data [11].

Although these studies reveal the benefit of transfer learning on biomedical NER, this field is not fully explored. More future efforts would clarify the effectiveness of transferring knowledge from a source task to a target task, developing a robust system, able to be applied to different domains.

3 METHODOLOGY

The different approaches, datasets and evaluation metrics have been used, are described in this section. The tagging scheme was employed and the exploratory data analysis are also reported in this section.

3.1 Models

Four different types of approaches were examined on biomedical NER task. A dictionary-based model, CRFs and a BiLTSM model were the traditional approaches that used as baseline models and compared with a transformer model, BioBERT.

3.1.1 Dictionary-Based Model. In this approach, the Aho-Corasick algorithm has been applied in NER, which is a string-searching algorithm. Documents are fed as the input in the model. The output is a dictionary of keywords given an entity type, which is used to create a finite state machine which is then used to search [1]. 3.1.2 CRFs. Conditional Random Fields is a probabilistic graphical model which predicts sequence of labels for sequence of input samples (sentences in our case). The output is probability between 0 and 1 which denotes how well the model predicts and assigns the

correct label to the entities based on some features and weights. For a given sequence o and s the sequence of states that corresponds to the labels assigned to words in o, the conditional probability is given by the following formula

P(s|o) =_Z1 o exp( n Õ i=1 m Õ j=1 λjfj(si −1, si, o, i)) (1)

where Zois a normalization factor of all state sequences, fj(si −1, si, o, i)

is a feature function which expresses some sequence characteristics that the data point represents and λjis a learned weight for each

such feature function [18].

3.1.3 BiLSTM.A deep learning approach which uses a bidirec-tional LSTM model. It is a type of recurrent neural network, which consists of cells which keep track of the dependencies between the elements in the input sequence owing to the properties of the various gates they consist of. They capture contextual information from both text directions [31].

3.1.4 BioBERT. BioBERT is a pre-trained attention model over a large scale biomedical corpus, of which architecture is based on BERT, consisting of only encoder layers. The sequences go through the encoder layers and predictions are made on a classification layer on top of the encoder output. BERT is a pre-trained model in a bidirectional way which has been achieved outstanding perfor-mance on several language tasks. It carries a tokenizer (WordPiece Tokenizer) and defined vocabulary. The input representation is a sum of token (the token itself), segment (token position in the sentence) and position embeddings (token position in the sequence of the sentences). This model, with minimal task-specific modifica-tions (fine-tuning) was exploited to biomedical NER task. A NER model is trained by feeding the output vector of each token into a classification layer which predicts the label. Some thousands of annotated data are required for fine-tuning [7, 20].

3.2 Data

Three different biomedical annotated corpora were used to examine the NER task and test the model performance.

TAC2017: The data are spontaneous adverse event reports sub-mitted to the FDA Adverse Event Reporting System (FAERS) by drug manufactures in a Structured Product Labeling (SPL) format from 2009 [29]. These xml documents are converted to brat standoff format1_{. For each text document, there is the corresponding}

anno-tation file (i.e. text1.txt and text1.ann). There are around 10.000 manually annotated sentences. All the annotation files contain one annotation per line, where an ID is given and separated from the rest of the annotation by a single TAB character.

ADE corpus: ADE is an open source corpus which consists of information extracted by PubMed and contains annotated en-tities (ADRs, drugs, doses) as well as drug-ADR and drug-dose relations. The ADRs annotations were used in the current research. The documents contain information such as PubMed-ID, Sentence, Adverse-Effect, Begin and End offsets for all entity types at ’doc-ument level’. This information is organized in a column format which is separated with a pipe delimiter [14].

1_{http://brat.nlplab.org/standoff.html} 2

(4)

Elsevier gold set: These data are xml text files which come from DAILYMED2_{in SPL format and contain information for}

hu-man drugs from 2015 to present. They are similar to TAC2017 SLP documents and follow the same annotation guidelines.

3.3 Exploratory Data Analysis

The first step of analysis examined the characteristics of the dif-ferent datasets. Firstly, all the difdif-ferent entity types per dataset have been identified and can been shown on Figures 6, 7, 8 in the Appendix. Briefly, the TAC2017 corpus has six entity types: animal, negation, drug class, factor, sensitivity and ADR, the ADE corpus has 3 entity types which are: dosage, drug and ADR and the Else-vier’s gold set has 9 entity types which are animal, severity, drug combination, drug administration route, drug class, dosage, drug, ADR and disease. We focused our interest only on ADRs since as it is of great importance, being one of the biggest cause of mortality worldwide. Additionally, ADRs are one of the predominant labels, representing 9-14% of the total number of words in all datasets. We further examined the sentence length distribution for each dataset, revealing a normal distribution in all datasets (See Fig. 9, 10,11 in the Appendix). As it is shown in Table 1, the mean sentence length in all datasets is comparable at around 15 words per sentence.

Table 1: Sentence Word Length Dataset Mean Standard deviation TAC2017 15.88 14.2

ADE 14.95 8.89

Elsevier’s gold set 14.65 9.61

The number of annotated sentences and the number of ADR entities for all datasets can be shown in Table 2. In terms of the size of the data, TAC2017 corpus has the most sentences and labeled ADRs, while Elsevier’s gold set corpus has the least.

3.4 BIO Tagging Scheme

One of the standard tagging schemes in NER tasks is the BIO for-mat, which encodes the (B)eginning, (I)nside and (O)utside position of the token in the entity (e.g. for an entity type X, B_X and I_X) [28]. Each token of the annotated documents was pre-dicted with one of the above-mentioned tags. According to the annotation guidelines, the predicted tags for the ADE and TAC2017 datasets were the B_AdverseReaction, I_AdverseReaction and O, while for the Elsevier’s gold set the predicted tags were the B_Adverse_drug_reaction, I_Adverse_drug_reaction and O. When fine-tuning the BioBERT model, these tags were converted to "B",

2_{https://dailymed.nlm.nih.gov/dailymed/}

Table 2: Biomedical Datasets

TAC2017 ADE Elsevier’s gold set Nr. of sentences 7.934 4.271 3.794 Nr. of ADR entities 13.795 5.742 5.328

"I" and "O" to be supported by the algorithm. These BIO tags are assigned to the words based on their known offsets are available in the annotation files. An example is given in Figure 1.

Figure 1: An annotated example. The plain text of a sentence and the corresponding marked entities are displayed. For better interpretation of the BIO tagging scheme, the corsponding tokens, based on the Regexp Tokenizer and the re-lated "B", "I" and "O" labels are shown afterwards.

3.5 Evaluation

The model performance was assessed based on both evaluation metrics and model training runtime. A descriptive explanation about the metrics calculation is given below.

3.5.1 Evaluation metrics. The typical evaluation metrics for NER tasks are precision, recall and F1-score. When the predicted

named-entities are going to be used for downstream tasks, such as RE, it is crucial to have reliable named entity predictions. Thus, it is more useful to evaluate the system based on only exact matches of the full entity, namely exact-match evaluation. Particularly, when an entity consists of more than one word and "B" and "I" tags have been assigned to them, the entity is considered correctly predicted only when all these particular tags have been predicted correctly. Entities that are marked as "O" are not taken into account. In this paper, an exact-match entity-level evaluation method has been applied. Based on the example in Figure 1, the ADR "JC virus infection", will count as true positive only if all tags will be predicted correctly (e.g. for the tokens "JC", "virus", "infection" the correct tags should be "B_AdverseReaction", "I_AdverseReaction", "I_AdverseReaction"). The metrics were calculated based on the following equations: Precision =_{tp + f p}tp (2)

Recall = tp

tp + f n (3)

F1= 2 ·_{precision + recall}precision · recall (4)

(5)

Table 3: Confusion matrix for binary classification (entity-level)

Predicted "ADR" Predicted "O"

Actual "ADR" tp fn

Actual "O" fp tn

Table 4: Processed Data

TAC2017 ADE Elsevier’s gold set

Nr. of sentences 7.934 4.271 3.794

Nr. of sentences<130 words 7.408 4.271 3.675

Nr. of tokens 207.241 85.832 99.054

Nr. of tokens labeled as ADRs 16.469 12.226 7.123

where tp = true positives, f p = false positives, f n = false neg-atives (see Fig. 3). The F1-score is the harmonic average of the

precision and recall and is a suitable metric for imbalanced data. Furthermore, token-level evaluation metrics are also reported in or-der to examine how good the models are on predicting the different tags ("B" and "I") of an entity.

3.5.2 Model Training Runtime.An important aspect of a successful algorithm in the training runtime. For this reason, the required training runtime for the various models was calculated using the timelibrary and relevant comparisons were made.

4 EXPERIMENTS

In this section, the data preprocessing is described followed by the experimental setup for all traditional methods and transfer learning approach.

4.1 Data preparation

The same data preprocessing was accomplished for all approaches and all datasets. As have been described in section 3.2, all datasets consist of a collection of documents with a plain text file and an annotation file. The sentences and the corresponding ADR labels were extracted. Due to computational constraints, only sentences with less than 130 words were selected. This filtered out between 0-9% of the sentences depending on a dataset. Duplicate sentences were removed and overlapping entities were resolved by retain-ing only the first one of them. The number of sentences, words and labeled entities of all datasets are shown in Table 4. For the three traditional methods, the text was tokenized using the Regexp Tokenizerfrom the NLTK library3, while for the BioBERT, we used the WordPiece Tokenizer. We used 5-fold cross validation to randomly split the processed data and evaluate the models.

4.2 Traditional Models

Different algorithms have been examined in solving this Named Entity Recognition task.

3_{https://www.nltk.org}

4.2.1 Dictionary-based approach. Initially, a dictionary-based ap-proach was applied with a simple look-up on ADRs in all datasets. The dictionary for each dataset comprises of the case styles are shown in Table 5.

Table 5: Case Styles in Dictionary-based Approach Case Style

ALL UPPERCASE all lowercase First letter uppercase

First Letter Of Each Word Uppercase

4.2.2 Conditional Random Fields.For the Conditional Random Fields model, the sklearn-crfsuite4was used in order to predict the correct entity labels. This model takes into account not only the word itself but also contextual information around that. A context window size was set up to 4 (4 words before and 4 words after) and the corresponding features that have been used within this window are displayed in Table 6. After obtaining the features, a linear-chain CRF model was trained for 100 iterations and the c1 (L1-regularization) and c2 (L2-regularization) were set to 0.1 to control overffiting.

Table 6: CRF Features Features Value word itself in lowercase

text is uppercase Boolean text is lowercase Boolean text in title Boolean text is digit Boolean text is punctuation Boolean POS tags5

BOS tags6 EOS tags7

bi-/tri-/four-grams

4.2.3 Bidirectional-LSTM model.The set of features that have been used are word and character embeddings in both sides (backward and forward) using the two hidden states combined. This kind of models can capture more contextual information, providing fuller learning on the NER. The anaGo8_{library was used for the NER task.}

The hyperparameter values are shown in Table 7.

4_{https://sklearn-crfsuite.readthedocs.io/en/latest/} 5_{part-of-speech tagging} 6_{Beginning of Sentence} 7_{End of Sentence} 8_{https://pypi.org/project/anago/} 4

(6)

Table 7: BiLSTM Hyperparameters Hyperparameter Value character embeddings size 32 character lstm units 32 character embeddings size 128 word lstm units 128 predict batch size 8

dropout 0.5

batch size 8

number of epochs 10 Table 8: BioBERT Hyperparameters

Hyperparameter Value lower case False maximum sequence length (150) train batch size 32 evaluation batch size 8 predict batch size 8 learning rate 5·10−₅

number of training epochs 10 warmup proportion 0.1

4.3 BioBERT

A transfer learning approach was examined using the BioBERT model. The WordPiece Tokenizer was used in order to tokenize the words and alleviate the out-of-vocabulary issue. Thus, new words can be represented by frequent subwords (e.g. neurologic →ne##uro##log##ic). The corresponding labels were mapped to the new token created by the tokenization and subwords with "##" were marked as "X", where no prediction is made. This was done for the test set as well, where the tokens were converted to WordPiece tokens and the corresponding labels were mapped accordingly. The pre-trained BioBERT-Base, Cased model was fine-tuned on a NVIDIA Tesla T4 16GB GPU using the default hyperparameter values and set only the maximum sequence length to 150 tokens (Table 8). The official BioBERT implementation was used.9

5 RESULTS AND DISCUSSION

The results of the model performance are demonstrated according to the three research subquestions have been described in Section 1. The model performance has been evaluated in predicting correctly the "ADR" label in two levels: token- and entity-level. Precision, Recall and F1-score are reported for entity-level evaluation, while

only F1-score is reported for token-level evaluation. The full table

can be found in the Appendix (12). The 95% confidence intervals derived by the 5-fold cross validation, are also reported, proving the significance of the results. Additionally, error analysis has been conducted and reported at the end of this section.

9_{https://github.com/dmis-lab/biobert}

5.1 Models Performance Comparison

The models’ performances were assessed based on both evaluation metrics as well as the model training runtime. Entity- and token-level evaluation are discussed below.

The model performance on entity-level which has been achieved by the three traditional methods can be shown in Table 9. Clearly, the Dictionary approach gave poor performance on this biomedical NER task over all datasets, achieving 58.91% , 63.96% and 73.20% F1-score on ADE, Elsevier’s gold set and TAC2017 corpus,

respec-tively. Between the two machine learning approaches, the BiLSTM model outperformed CRFs on TAC2017 and ADE corpus, achieving 85.47% and 73.65%, respectively. On the other hand, CRFs achieved 74.39% on Elsevier’s gold set, compared to 73.84% achieved by the BiLSTM model. However, the confidence intervals reveal that there is some variation on the performance of these two models on this particular dataset, that probably come from the dataset itself. These performances confirm that the contextual information is captured by CRFs and BiLSTM models is essential for more accurate entity predictions.10

Although quite good results were achieved by machine learning methods, BioBERT outperformed all other models on all datasets. BioBERT achieved F1-score 92.40% on TAC2017, 84.11% on ADE

and 87.70% on Elsevier’s gold set corpus. The absolute improvement compared to the second best results was 6.93, 10.46 and 13.31 units on TAC2017, ADE and Elsevier’s gold set, respectively. The best F1-score reported on TAC2017 for ADR label on entity-level is

85.2% and for token-level evaluation is 87% that both achieved using a BiLSTM-CRF model [23, 30] and 86.78% for ADRs on ADE corpus [27]. Our approach outperforms previously reported scores both on entity- and token-level evaluation (it is described below) on TAC2017, while the score we achieved on ADE corpus is slightly lower than what is currently reported in the literature.

Interestingly, the results show that BioBERT achieved much higher recall improvements than precision improvements. This means that the model was able to miss much fewer ADR entities than the other methods. The simultaneous bidirectional contextual information capturing, an important characteristic of BioBERT, seems to be beneficial for model performance.

Differences in performance between the different datasets can be supported by the size and the variability of them. Particularly, the TAC2017 is the biggest dataset regarding both the number of sentences and the number of ADRs exposing the model in more variability and make it perform better. Regarding the performance on ADE corpus, it seems that BioBERT underperforms compared to other datasets. This seems to be justified by the fact that ADE dataset consists of documents of individual sentences lacking cohe-sion and no extra contextual information can be captured from the surroundings, affecting the text representations.

The token-level evaluation results are comparable with the entity-level evaluation results. Looking at the averaged performance of both "B" and "I" tags, BioBERT’s performance exceeds all other models’ performance, with an absolute improvement of 6.59, 8.09 and 10.52 units for TAC2017, ADE and Elsevier’s gold set corpus,

10_{Under limited time and computational power, we didn’t perform full-scale}

hyperpa-rameter tuning and only experimented with several hyperpahyperpa-rameter configurations. We observed that the scores fluctuate around the reported ones with deviations on F1-score not higher than 1%.

(7)

Table 9: Models Performance in entity-level ADR Named Entity Recognition. Precision (P), Recall (R) and F₁-score (F). Best scores are in bold, while second best are underlined. The 95% confidence intervals are displayed in the parentheses.

Dataset Metric Dictionary CRFs BiLSTM BioBERT TAC2017 P 65.57 (±1.38) 83.62 (±1.77) 85.84 (±1.27) 90.90 (±0.97) R 82.89 (±1.25) 80.04 (±1.18) 85.13 (±1.59) 93.98 (±0.91) F 73.20 (±0.69) 81.77 (±0.85) 85.47 (±0.73) 92.40 (±0.67) ADE P 59.02 (±1.05) 74.50 (±0.83) 72.75 (±2.86) 82.00 (±0.95) R 58.80 (±0.95) 69.77 (±1.21) 74.65 (±1.83) 86.33 (±0.70) F 58.91 (±0.99) 72.05 (±0.91) 73.65 (±1.81) 84.11 (±0.78) Elsevier’s gold set

P 64.87 (±2.25) 80.25 (±1.12) 75.42 (±3.20) 86.15 (±2.09) R 63.19 (±2.71) 69.38 (±3.09) 72.48 (±2.20) 89.27 (±0.97) F 63.96 (±1.53) 74.39 (±1.94) 73.84 (±1.20) 87.70 (±1.39)

Table 10: Token-level F1-score for ADR Named Entity Recognition. Best scores are in bold, while second best are underlined for "B" (B_ADR) and "I" (I_ADR) separately, as well as the average of both of them. The 95% confidence intervals are displayed in the parentheses.

Dataset B_ADR I_ADR Average

Dictionary CRFs BiLTSM BioBERT Dictionary CRFs BiLTSM BioBERT Dictionary CRFs BiLTSM BioBERT TAC2017 74.82 85.42 88.97 91.63 65.55 76.16 81.62 92.19 70.19 (±0.79) 80.79 (±0.84) 85.30 (±0.89) 91.89 (±0.79) ADE 61.73 76.54 78.69 86.38 55.92 74.49 78.02 86.52 58.83 (±1.08) 75.52 (±0.71) 78.36 (±1.01) 86.45 (±0.97) Elsevier’s gold set 65.86 80.23 81.09 90.20 45.72 69.51 72.73 84.65 55.79 (±2.24) 74.87 (±2.55) 76.91 (±2.30) 87.43 (±1.41)

respectively, in comparison with the second best performance ac-complished by BiLSTM model (Table 10). From an individual token perspective, interestingly, all traditional models are able to predict the "B_ADR" tag more successfully than the "I_ADR" tag. Contrar-ily, BioBERT appears to be more effective in predicting better the "I_ADR" than "B_ADR" label on TAC2017 and ADE corpus, whilst the opposite result is observed on Elsevier’s gold set. Considering the range of confidence intervals on Elsevier’s gold set for all meth-ods have been used, it reveals a distinct characteristic of the dataset itself compared to the two other datasets.

A general important remark is about the number of annotated data that have been used and the corresponding model performance. In particular, only around 7000 of labeled examples were needed in the case of the Elsevier’s gold set to achieve the results above 80% F1-score. Additionally, according to the required training time

in Table 11, no significant differences were observed between the BiLSTM and BioBERT model, considering the boost in performance. The fine-tuning of BERT requires about 30 minutes, while the BiL-STM needs about 18 minutes on the TAC2017 dataset. However, a GPU is required for the pre-trained BioBERT model due to its high number (110M) of parameters. Overall, the BioBERT-based approach clearly outperformed the traditional approaches for ADR recognition.

5.2 Error Analysis

We conducted an investigation for the different types of BioBERT classification errors. A general observation for all datasets is that BioBERT makes the least mistakes from "B" to "I" tags ("B" is the

11_{training runtime to fine-tune BioBERT-Base, Cased}

Table 11: Models Training Runtime in minutes (m) and sec-onds (s).

Model TAC2017 ADE Elsevier’s gold set Dictionary ≈1s ≈1s ≈1s CRFs 2m09s 1m2s 59s BiLSTM 17m56s 9m45s 9m30s BioBERT11 31m 24m 25m

true and "I" the predicted tag) and vice versa, meaning that the model is able to distinguish these two tags quite well. In particular, these mistakes represent 4.1-8.6% of all mistakes on TAC2017 and Elsevier’s gold set, while the model is more prone to these mistakes ( ≈ 15.5%) on ADE corpus. This difference between the different datasets could be very likely due to the lack of cohesion between annotations in the sentences in ADE corpus.

On the other hand, the most common mistakes happen from "O" to any of the two other tags, "B" or "I". This observation can be supported by the fact that some words in all corpora are labeled sometimes as "O" and sometimes as "I" or "B", depending on the context, making the model prone to interchange these labels more often. Common types of errors are the following:

• Error type 1: the model hypothesized an entity, but there is none (Fig. 2). The tokens "congestive" "heart" "failure" are predicted as "B" "I" "I" and the token "cardiomyopathy" as "B", while the true tokens are labeled as "O".

• Error type 2: the model completely missed an entity (Fig. 3). The tokens "ataxia" "and" "gum" "hypertrophy" are labeled

(8)

Figure 2: Error type 1 - The model hypothesized an entity, but there is none

Figure 3: Error type 2 - The model completely missed an en-tity

as "B" "O" "B" "I" and the model predicts them as "O" "O" "O" "O", meaning that it misses both entities.

•Error type 3: the model noticed an entity but achieved only partial matching (Fig. 4). The tokens "Aspartate" "aminotrans-ferase" ">" "5" "." "0x" "UNL" are labeled as "B" "I" "I" "I" "O" "O" "I", while they are predicted as "B" "I" "I" "I" "O" "O" "O" and the tokens "Alanine" "aminotransferase" ">" "5" "." are labeled as "B" "I" "I" "I" but are predicted as "B" "I" "I" "O". •Error type 4: the model noticed an entity but gave the

wrong label (Fig. 5). The token "bleeding" is labeled as "B", while its predicted label is "I". The model has identified an entity it this region but predicted also other tokens from the surroundings, giving the wrong label.

An interesting observation in the example of Error type 2 (Fig. 4) is the presence of a discontinuous entity, where one or two "O" labels separate the "I" tags of an entity. By using exact match evaluation system, these entities do not count on the computation of metrics, even though the model captured them partially. These type of errors where some tokens labeled as "O" are in between "I" labels found to be one of the common mistakes. A similar discontinuous entity case is when two words are connected with a hyphen but treated as separate tokens from BioBERT. Interestingly, BioBERT misses this entity completely something which was observed to

Figure 4: Error type 3 - The model noticed an entity but achieved only partial matching

Figure 5: Error type 4 - The model noticed an entity but gave the wrong label

happen quite often. In general, this discontinuity happens because of special characters as dots, digits and hyphens, that need to be more carefully treated by the tokenizer in order to build a stronger algorithm.

Another worth mentioning observation is that the model predicts as "B" and "I", tokens that are not truly labeled like this, but they are biomedical terms that potentially could be ADRs. An example is shown in Figure 2 where the tokens "congestive" "heart" "failure" are predicted as ADR.

Based on the above-mentioned cases, a model which will treat differently the text tokenization, taking into account special char-acters, could testify if these cases play a role in the overall perfor-mance. Additionally, a careful inspection of the given annotation by domain experts would elucidate the ambiguity in the way that some words are labels or not. Regarding the case of the same tokens being labeled differently based on the surrounding context, training the model for more epochs could be beneficial, which might capture more context representation variability.

6 CONCLUSION

In this paper, a transfer learning approach for ADR recognition was tested using a domain specific language model - BioBERT. Three traditional methods, a Dictionary-based approach, CRFs and BiL-STM model were used as baseline models and their performance was evaluated on three different biomedical datasets, the TAC2017, the ADE and Elsevier’s gold set corpora. Fine-tuning of BioBERT achieved significantly better performance compared to three other

(9)

traditional methods on all biomedical corpora. An interesting ob-served property of this model is its ability to find more entities compared to existing methods, using only a few thousand examples and requiring comparable amounts of training time.

Differences in the performance between the different datasets were observed, demonstrating that the size and the nature of the dataset plays an important role in the model performance. However, in case of Elsevier’s gold set corpus, only around 7000 of labeled examples were sufficient to achieve results above 80% F1-score,

sup-porting that transfer learning is a machine learning approach that can overcome the lack of annotated data and achieve outstanding results compared to other machine learning methods.

Our research has highlighted that transformer-based neural mod-els is a promising approach for complex biomedical NER problems, such as ADR recognition. Error analysis revealed some common mistakes that our approach makes due to the text tokenization sys-tem (special characters issue) is used and the ambiguity in annota-tion itself. A tokenizaannota-tion system that treats the specials characters in a better way and a more thorough investigation of the annota-tion would clarify if the above-menannota-tioned cases comprise source of common mistakes. Another remark is the presence of different labels for the same token depending on the context, something that leads to interchanges of these labels and has a negative impact on model performance. More training time could reveal if the model is capable to capture this variability and avoid those mistakes.

7 LIMITATIONS AND FUTURE WORK

Due to computational constraints and BioBERT limitations, the current study was limited only to short sentences (<130words) with resolved overlapping entities for all datasets. Special caution should be given to the nature of the datasets, reflecting that the outcome cannot be generalized to a biomedical NER task in any other dataset. Further studies, which take error analysis into account, will need to be performed.

ACKNOWLEDGMENTS

I would like to express my gratitude to Prof. Paul Groth for his considerate guidance, the useful critiques of this research work and the chance he gave me to publish it, and to Dr. Viachaslau Sazonau, my day-to-day supervisor, for his support, inspiration and the time he spent for the accomplishment of this project. I would also like to thank all the group members in Elsevier for their kind and valuable feedback throughout my internship in the company. I would also like to express my gratitude to my partner and my family for inspiring me and supporting me and to my all classmates for their feedback.

REFERENCES

[1] Alfred V. Aho and Margaret J. Corasick. 1975. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM 18, 6 (1975), 333–340.

[2] Eiji Aramaki, Yasuhide Miura, Masatsugu Tonoike, Tomoko Ohkuma, Hiroshi Masuichi, Kayo Waki, and Kazuhiko Ohe. 2010. Extraction of adverse drug effects from clinical records. Studies in health technology and informatics 160 (01 2010), 739–43. https://doi.org/10.3233/978-1-60750-588-4-739

[3] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, and Hans-Peter Kriegel. 2008. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 9, 1 (23 Apr 2008), 207. https: //doi.org/10.1186/1471-2105-9-207

[4] Razvan Bunescu, Raymond Mooney, Arun Ramani, and Edward Marcotte. 2006. Integrating co-occurrence statistics with information extraction for robust re-trieval of protein interactions from Medline. (07 2006), 49–56. https://doi.org/10. 3115/1654415.1654424

[5] K. Bretonnel Cohen, Lynne Fox, Philip V. Ogren, and Lawrence Hunter. 2005. Corpus Design for Biomedical Natural Language Processing. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. Association for Computational Linguistics, Detroit, 38–45. https://www.aclweb.org/anthology/W05-1306

[6] K. Bretonnel Cohen and Lawrence Hunter. 2008. Getting Started in Text Mining. PLOS Computational Biology4, 1 (01 2008), 1–3. https://doi.org/10.1371/journal. pcbi.0040020

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRRabs/1810.04805 (2018).

[8] Ian Donaldson, Joel Martin, Berry de Bruijn, Cheryl Wolting, Vicki Lay, Brigitte Tuekam, Shudong Zhang, Berivan Baskin, Gary D. Bader, Katerina Michalickova, Tony Pawson, and Christopher WV Hogue. 2003. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4, 1 (27 Mar 2003), 11. https://doi.org/10.1186/ 1471-2105-4-11

[9] X. Dong, L. Qian, Y. Guan, L. Huang, Q. Yu, and J. Yang. 2016. A multiclass classification method based on deep learning for named entity recognition in electronic medical records. In 2016 New York Scientific Data Summit (NYSDS). 1–10. https://doi.org/10.1109/NYSDS.2016.7747810

[10] RamÃ3n A-A. Erhardt, Reinhard Schneider, and Christian Blaschke. 2006. Status of text-mining techniques applied to biomedical text. Drug Discovery Today 11, 7 (2006), 315 – 325. https://doi.org/10.1016/j.drudis.2006.02.011

[11] John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics (Oxford, England) 34 (06 2018). https://doi.org/10.1093/bioinformatics/bty449

[12] John M Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named entity recognition with neural networks. Bioin-formatics 34, 23 (06 2018), 4087–4094. https://doi.org/10.1093/ bioinformatics/bty449 arXiv:http://oup.prod.sis.lan/bioinformatics/article-pdf/34/23/4087/26676581/bty449.pdf

[13] Claudio Giuliano, Alberto Lavelli, and Lorenza Romano. 2006. Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature. In 11th Conference of the European Chapter of the Association for Computational Linguistics. http://aclweb.org/anthology/E06-1051

[14] Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics 45, 5 (2012), 885 – 892. https: //doi.org/10.1016/j.jbi.2012.04.008 Text Mining and Natural Language Processing in Pharmacogenomics.

[15] Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck. 2005. ProMiner: rule-based protein and gene entity recogni-tion. BMC Bioinformatics 6, 1 (24 May 2005), S14. https://doi.org/10.1186/ 1471-2105-6-S1-S14

[16] Rave Harpaz, Santiago Vilar, William DuMouchel, Hojjat Salmasian, Krystl Haerian, Nigam H Shah, Herbert S Chase, and Carol Friedman. 2012. Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions. Journal of the American Medical Informatics Asso-ciation20, 3 (10 2012), 413–419. https://doi.org/10.1136/amiajnl-2012-000930 arXiv:http://oup.prod.sis.lan/jamia/article-pdf/20/3/413/17374512/20-3-413.pdf [17] MarÃa Herrero-Zazo, Isabel Segura-Bedmar, Paloma MartÃnez, and Thierry

Declerck. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drugdrug interactions. Journal of Biomedical Informatics 46, 5 (2013), 914 – 920. https://doi.org/10.1016/j.jbi.2013.07.011

[18] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Con-ditional Random Fields: Probabilistic Models for Segmenting and Labeling Se-quence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289. http://dl.acm.org/citation.cfm?id=645530.655813

[19] Jason Lazarou, Bruce H. Pomeranz, and Paul N. Corey. 1998. Incidence of Adverse Drug Reactions in Hospitalized PatientsA Meta-analysis of Prospective Studies. JAMA279, 15 (1998), 1200–1205.

[20] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Kim Donghyeon, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: pre-trained biomedical language representation model for biomedical text mining. (2019).

[21] Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in Bioinformatics6, 4 (12 2005), 357–369. https://doi.org/10.1093/bib/6.4.357 arXiv:http://oup.prod.sis.lan/bib/article-pdf/6/4/357/9731832/357.pdf [22] Zhiyong lu. 2011. PubMed and beyond: a survey of Web tools for searching

biomedical literature. Database : the journal of biological databases and curation

(10)

2011 (01 2011), baq036. https://doi.org/10.1093/database/baq036

[23] Nikola Milosevic, Goran Nenadic, Maksim Belousov, and William Dixon. 2018. Extracting adverse drug reactions and their context using sequence labelling en-sembles. http://healtex.org/healtac-2018/ UK Health Text Analytics Conference, HealTAC ; Conference date: 18-04-2018 Through 19-04-2018.

[24] Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami, and Toshihisa Takagi. 2001. Automated extraction of information on protein–protein interactions from the biological literature . Bioinformatics 17, 2 (2001), 155–161.

[25] S. J. Pan and Q. Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering22, 10 (Oct 2010), 1345–1359. https://doi.org/ 10.1109/TKDE.2009.191

[26] Alexandra Pomares Quimbaya, Alejandro Sierra Múnera, Rafael Andrés González Rivera, Julián Camilo Daza Rodríguez, Oscar Mauricio Muñoz Velandia, Angel Alberto Garcia Peña, and Cyril Labbé. 2016. Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach. Procedia Computer Science100 (2016), 55 – 61. https://doi.org/10.1016/j.procs. 2016.09.123

[27] Suriyadeepan Ramamoorthy and Selvakumaran Murugan. 2018. An Attentive Sequence Model for Adverse Drug Event Extraction from Biomedical Text. (2018). [28] Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora. Springer, 157–176.

[29] Kirk Roberts, Dina Demner-Fushman, and Joseph M. Tonning. 2017. Overview of the TAC 2017 Adverse Reaction Extraction from Drug Labels Track. In TAC. [30] Kirk Roberts, Dina Demner-Fushman, and Joseph M. Tonning. 2017. Overview of the TAC 2017 Adverse Reaction Extraction from Drug Labels Track. In Proceedings of the 2017 Text Analysis Conference, TAC 2017, Gaithersburg, Maryland, USA, November 13-14, 2017. https://tac.nist.gov/publications/2017/additional.papers/ TAC2017.ADR_overview.proceedings.pdf

[31] M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing45, 11 (1997), 2673–2681.

[32] Burr Settles. 2004. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA ’04). Association for Computational Linguistics, Stroudsburg, PA, USA, 104–107. http://dl.acm.org/citation.cfm?id=1567594.1567618

[33] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chun-fang Liu. 2018. A Survey on Deep Transfer Learning. CoRR abs/1808.01974 (2018). arXiv:1808.01974 http://arxiv.org/abs/1808.01974

[34] Mert Tiftikci, Arzucan Özgür, Yongqun He, and Junguk Hur. 2017. Extracting Adverse Drug Reactions using Deep Learning and Dictionary Based Approaches. In TAC.

[35] Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and Jiawei Han. 2018. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35, 10 (10 2018), 1745– 1752. https://doi.org/10.1093/bioinformatics/bty869

[36] David Luis Wiegandt, Leon Weber, Ulf Leser, Maryam Habibi, and Mariana Neves. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, 14 (07 2017), i37–i48. https://doi.org/10. 1093/bioinformatics/btx228 arXiv:http://oup.prod.sis.lan/bioinformatics/article-pdf/33/14/i37/25157154/btx228.pdf

[37] Elad Yom-Tov and Evgeniy Gabrilovich. 2013. Postmarket Drug Surveillance Without Trial Costs: Discovery of Adverse Drug Reactions Through Large-Scale Analysis of Web Search Queries. Journal of medical Internet research 15 (06 2013), e124. https://doi.org/10.2196/jmir.2614

(11)

APPENDIX

Figure 6: Labels frequency on TAC2017 corpus

Figure 7: Labels frequency on ADE corpus

Figure 8: Labels frequency on Elsevier’s gold set corpus

Figure 9: Text Length Distribution on TAC2017 corpus

Figure 10: Text Length Distribution on ADE corpus

Figure 11: Text Length Distribution on Elsevier’s gold set cor-pus

(12)

Table 12: Token-level evaluation performance of all models (Dictionary, CRFs, BiLSTM and BioBERT) on biomedical NER task. Precision (P), Recall (R) and F1-score (F) are reported for "B" (B_ADR) and "I" (I_ADR) separately, as well as the average of both of them. The 95% confidence intervals are displayed in the parentheses.

Dictionary CRFs

Dataset Metric B_ADR I_ADR Average B_ADR I_ADR Average TAC2017 P 67.33 (±1.11) 71.78 (±0.90) 69.56 (±1.01) 86.81 (±1.72) 81.58 (±1.50) 84.20 (±1.61) R 84.23 (±1.27) 60.32 (±1.34) 72.27 (±1.30) 84.10 (±0.97) 71.45 (±1.37) 77.77 (±1.17) F 74.82 (±0.58) 65.55 (±1.00) 70.19 (±0.79) 85.42 (±0.83) 76.16 (±0.85) 80.79 (±0.84) ADE P 62.06 (±1.15) 83.81 (±1.44) 72.94 (±1.29) 79.14 (±0.79) 79.76 (±1.56) 79.45 (±1.17) R 61.39 (±1.07) 41.98 (±1.18) 51.69 (±1.13) 74.12 (±1.10) 69.90 (±1.05) 72.01 (±1.08) F 61.73 (±1.09) 55.92 (±1.07) 58.83 (±1.08) 76.54 (±0.82) 74.49 (±0.60) 75.52 (±0.71) Elsevier’s gold set

P 66.79 (±2.20) 67.07 (±5.53) 66.93 (±3.86) 85.69 (±1.11) 77.03 (±3.40) 81.36 (±2.26) R 65.06 (±2.37) 34.87 (±3.35) 49.97 (±2.86) 75.47 (±2.96) 63.37 (±3.60) 69.42 (±3.28) F 65.86 (±1.42) 45.72 (±3.05) 55.79 (±2.24) 80.23 (±1.82) 69.51 (±3.28) 74.87 (±2.55)

BiLSTM BioBERT

Dataset Metric B_ADR I_ADR Average B_ADR I_ADR Average TAC2017 P 88.84 (±1.35) 85.03 (±1.29) 86.93 (±1.32) 92.97 (±0.90) 95.57 (±0.82) 94.25 (±0.60) R 89.14 (±1.56) 78.49 (±1.32) 83.82 (±1.44) 90.29 (±0.68) 88.80 (±1.45) 89.53 (±0.98) F 88.97 (±0.82) 81.62 (±0.96) 85.30 (±0.89) 91.63 (±0.79) 92.19 (±1.14) 91.89 (±0.79) ADE P 77.69 (±2.09) 80.52 (±3.84) 79.11 (±2.96) 84.78 (±1.48) 85.54 (±1.38) 85.16 (±1.43) R 79.80 (±1.75) 75.85 (±1.63) 77.82 (±1.69) 88.04 (±0.94) 87.55 (±0.88) 87.79 (±0.91) F 78.69 (±0.90) 78.02 (±1.12) 78.36 (±1.01) 86.38 (±1.20) 86.52 (±0.73) 86.45 (±0.97) Elsevier’s gold set

P 82.04 (±3.48) 76.73 (±6.07) 79.39 (±4.78) 89.48 (±2.01) 85.47 (±2.11) 87.47 (±2.06) R 80.34 (±2.64) 69.44 (±3.36) 74.89 (±3.00) 90.96 (±0.64) 83.88 (±1.75) 87.42 (±1.19) F 81.09 (±1.53) 72.73 (±3.08) 76.91 (±2.30) 90.20 (±1.14) 84.65 (±1.67) 87.43 (±1.41)