Transfer learning for Named Entity Recognition of Data sets names and their categories within scholarly text and source code

(1)

Transfer learning for Named Entity Recognition of Data sets names and their

categories within scholarly text and source code

submitted in partial fulfillment for the degree of Master of Science

Pavlos Evangelatos

12760390

Master Information Studies

Data Science track

Faculty of Science

University of Amsterdam

2020-09-28

evangelatospavlos@gmail.com

Supervisor

Title, Name Assistant Professor (PhD), Maarten Marx

Affiliation ILPS Informatics Institute- Universiteit van Amsterdam Email m.j.marx@uva.nl

(2)

ABSTRACT

Sophisticated automated processing of ammounts of textual data to detect the mentioned and used "data sets names" and data corpora, in scientific articles. In recent years, the academic research is rising and multiple procedures involv-ing state-of-the-art data analysis, prediction and forecastinvolv-ing are deployed globally. Technological and scientific advance-ments occurring in the computer science field, are based on numerous and various forms and sorts of data. Machine learning solutions for the exploration and the grouping of those increasing loads of data sets and the extraction of in-sights and knowledge could, be proven beneficial for the scientific community. The scope of this project was a contri-bution to the emergence of a method so inferences can be derived for the fast and often chaotic research process. The anticipated results concerned the recognition of the entities affiliated with Data sources within the text content. A trans-fer learning approach is proposed and cutting edge models, that have led to a significant improvement of NER and other Natural Language Processing tasks are compared.

KEYWORDS

Named Entity Recognition, NER , Data set Identification, Data set Recognition, BERT, XLNet, RoBERTa, SciBERT

1 INTRODUCTION

1.1 Overview

Outline:

Thousands scientific papers that reflect the parallel and concurrent academic work, are getting published every year in journals and being presented in conferences, worldwide. This immense volume of specialised text, contains valuable information scattered in their body. The information extrac-tion and text mining process, are extremely important and strongly interact with the continuous, ongoing development of breakthrough practices. Computer science scholarly ar-ticles, comprise technical terms relevant to the theoretical reasoning and the implementations. The growing, unstruc-tured huge amounts of text in articles provide a vast volume of information that needs to be manipulated and filtered so insights and inferences to be extracted. Multiple repositories for scholarly resources exist, that retain academic literature. ArXiv is a data set that has gathered just under 2 million papers over the last 30 years, including the computer science and informatics field.

Data have been characterized by economists as a new form of wealth. Data sets are important factors of the academic and research procedure and play key role. While they used to

be private and proprietary of corporations and institutions, there is an increasing trend of the free disposal of them, lately. The aid to their recognition and identification within an "ocean" of information could prove really considerable and be appreciated by the global scientific community. It is considered that it would create opportunities for new ideas to be deployed while it could save valuable consumed time that could be invested in more constructive way, add value and lead to progress and growth.

1.2 Objective & Research Questions

The objective of this project was to provide a strategy to achieve Named Entity Recognition, for data sets and data sources, on academic and computer science terminology context. The scope of the project was the following sub-questions to be answered individually:

• extension to SpaCy framework’s pre- trained NER model. Addition and training of the entity type of "data set category".

• Transfer learning for "data sets" Named Entity Recog-nition. Detection of the labels of mentioned data set names, knowledge bases and data corpora in scien-tific articles.

• Comparison of SciBERT with other baseline models like BERT base and recently released models like XLNet and RoBERTa.

1.3 Project’s Purpose & Project Structure

The main project’s contribution is the fine- tuning of pre-trained state-of-the-art models on the Named Entity Recog-nition task and the comparison of their efficiency and per-formance. Programming practices were combined with data analysis techniques, to discover an optimal, coherent strat-egy and conclude to some valid and well- grounded to the literature results. The fine- tuning was literally a method of adding extra layers of neurons on top of neural networks and train the custom models to a specific classification task (NER). After this step the models were evaluated, new data coming from academic text could be fed as inputs and safe predictions could be produced regarding their context.

The task was challenging due to the shortage of necessary data. Furthermore, data sets names which were attempted to be identified, often appear punctuation, discontinuity, abbre-viations, other complex characteristics of syntax and vari-ance across linguistic context as polysemy [15]. Moreover, the unstructured textual data cannot be visualized easily in

(3)

order patterns and outliers to be observed and the suitable tactics to be designed.

The structure of the project was a typical data science approach, that defined the frame of the problem and fol-lowed current and powerful techniques, to build an in- depth analysis. The development of a scheme, able to generate predictions on unknown inputs was also part of it.1

2 RELATED WORK & LITERATURE

2.1 NER and Entity types

Named Entity Recognition which identifies entities and Rela-tion ExtracRela-tion which discovers the relaRela-tions between them are both sub- tasks of Information Extraction. NER is critical for question answering, information retrieval, co-reference resolution and topic modeling [19]. It is the NLP task which identifies when applied to a piece of text that "James Brown" is an object referred to a Person and " New Orleans" is a Location.

2.2 Related Literature

Many relevant prior attempts regarding specific term de-tection, have been done to mine key words and blocks of words through free text, with respect to the context and the tagging of the adjacent parts of speech. Significant progress have been noted in the bioinformatics and biomedical field where pre- trained models in corresponding text have been available and transfer learning techniques have been adopted in order to discover entities like the commercial names of the drugs, dosages, proteins, DNA or adverse drug reactions within the relevant literature and medical records [9], [4]. In addition some research were oriented towards social sciences [5].

In computer science related bibliography less attempts have been done to extract useful knowledge so the related, also overlapping research, to be discrete and categorized .Google offers the "datasetsearch", a search engine that helps interested individuals to locate and have access in open, avail-able data sets. Up to present time, the Dataset Search con-cerns data uploaded on web pages and is highly dependent on a specific key that is pre- known and pre- determined and can be matched to the data set name, description or recorded metadata. The search engine also returns results when a data set is cited in a paper by the authors. The citations that a sci-entific paper receives is always a metric for the researchers in order to distinguish a right one that deserves to be stud-ied from the mass. Citation especially on data sets has been proposed in the past [12]. Nevertheless, it is not widespread by the community and its connection with the citations of the papers overall to assist the generation of a ranking list

1_{repository :https://github.com/evangelatospavlos/MScThesis}

according to the impact and the significance is not applicable yet.

2.3 Machine Learning Background

The legacy machine learning models were facing misper-formance regarding the interpretation of sequential infor-mation met in text. Recurrent Neural Network(RNN) and Long Short-Term Memory (LSTM) were until recently the major solution for NLP problems. RNN characteristics are the outputs at every step and the layer shared parameters and they focus on short- term transitions ignoring the long-term ones. LSTM are similar to RNN but they preserve infor-mation for longer period as they control what new inputs are important to be written down and what information will be placed at the memory. Both of them had requirements for large volume of data, wasting computation resources and time of training in order to achieve adequate results, while they were appearing a performance deficit and were prone to overfitting [13]. Moreover, Recurrent Neural Networks and Long Short-Term Memory solved partially the problem but appeared deficiencies in parallelism. BiLSTM- CRF stands for Bidirectional LSTM that can encode dependencies on elements at both directions, enhanced with a sequential Con-ditional Random Field layer, a probabilistic graphical model that predicts labels. CRF receives as input the scores output from LSTM to jointly model tagging decisions and denotes the correctly assigned labels [6] [16]. In 2018 groundbreak-ing releases like ELMo and BERT models, brought radical and thorough solutions to language- based problems, similar to the boom happened a few years earlier in the computer vision field and set milestones. In computer vision, ImageNet and similar data sets with labeled images were used for train-ing models based on convolutional neural networks. The image data are not structured and the feature extraction is complicated. lower layers of such networks are dedicated to edges and contours and the upper ones are more specific for more complex patterns. Fully connected layers follow the convolutional ones, placed before the classification output. Pre- trained weights are later applied to multiple data sets and tasks. ELMo (Embeddings from Language Models)[15] can represent the words of sentence with a context sensitive embedding, instead of traditional fixed embeddings (in each token a vector representation is assigned as function of the entire input sentence), which is bidirectional so next and previous word meanings are captured. BERT (Bidirectional Encoder Representations from Transformers) was released in late 2018 and set a benchmark by breaking every score record and noting remarkable performance.

A customized corpus with 7 types of NER entities and a tagging scheme that extracts entities and relations between them was constructed from 1900 chemical papers. After pre-training the BERT parameters with large-scale field data, this 2

(4)

gold standard corpus was used for a downstream NER task. In addition a BERT softmax layer and a CRF (Conditional Ran-dom Fields) layer were added to improve the performance. [14]. BIOBERT with similar architecture with base BERT, pre-trained over large scale biomedical corpus (18 billion words of abstracts and articles), has been used for drug adverse drug reaction. Therefore, with slight minimal task- specific modifications could be applied and fine tuned for biomedical NER. The sequences went through the encoder layers and predictions were made on a classification layer on top of the encoder output [16].

3 PRE- PROCESSING PHASE &

LIMITATIONS

3.1 Data

3.1.1 Annotated Data Set.Data preparation and pre- pro-cessing is critical part of a thorough analysis. In order to ac-complish this goal, a solid gold set containing annotation for the particular sort of technical entities was crucial. Despite our intense, rigorous, versatile online search, not a single sufficient, open, available, diligently annotated set could be found to be utilized for the down- streaming NER task. The most computer science relevant, potential set we managed to discover was the "Cybersecurity NER corpus 2019" that was uploaded in late June 2020. It was noisy as it was derived from tweets and not coherent to data sets but to security issues and malicious software. A solution was given by a paper from the University of Washington published in 2018, which introduced a setup for entity recognition, relation extraction and co- reference resolution. The scope of this research was the aid of the scientists and engineers to extract and cluster essential information from large collections of documents [11], that search engines are unable to retrieve. In that way they would stay updated about international recent cutting- edge technological advancements. Within the men-tioned information extraction approach, a data set named SCIERC was created and launched together with the article publication. It was produced exclusively from scientific text (500 abstracts from Artificial Intelligence conferences). The limitation here was that the data sets are usually referred in the body of the papers. 6 categories of scientific entities were highlighted (Task, Method, Evaluation Metric, Material, Generic & Other Scientific Terms). The "Material" was com-posed of Data, data sets, resources, Corpus, Knowledge base [11]. "Image Data", " Speech Data", "CoNLL" ,"Panntreebank" & "WordNet" constitute indicative examples. We utilized the above gold set focused on that particular entity type, during the pre- processing stage and the data transformation. Hand-annotated data is hard to be created because they require domain expertise and a lot of time.

Nevertheless the gold set we had was too confined and unable to cope with the task so we had to enrich it with more relevant terms concerning data set names. Kaggle, subsidiary of Google, is a platform used by Data Scientists community for predictive modelling and analytics competitions. Kaggle offers an API that was a reliable option for the enhancement. Data sets from the category ’computer science ’ were scraped. Complex names that were often separated with hyphens (’-’) were split and the appropriate label was given to them before they were aggregated to the SCIERC set. Duplicate values were removed to prevent redundancy. Therefore, the final format of our gold set consisted of 1035 sentences/ sequences of words and 9481 total tokens, which is considered poor but just adequate. The sentences, with their components grouped and consolidated, were shuffled into the data set.

Due to the deficiency of an extensive high quality anno-tation, an alternative way that was examined during our project was to create our own annotation for the Named En-tity Recognition. We already had access in some collections of papers from the computer science field. SDM from the "SIAM international conference of Data Mining" is an estab-lished collection of 1536 papers and SIGIR that stands for the "Special Interest Group on Information Retrieval" contains 2690 papers. Each paper accumulates several hundreds of sentences. A very simplified solution would be the filtering of the sentences where a key word like "data set", "corpus" or "data" is met. That would denote a probability of a data set presence it would definitely be problematic and fuzzy. A vague "dataset" word could be mentioned at numerous points in the text and the word "data" could get a range of meaningless treatments within the reasoning of academic texts. The only possible way to get something useful from this process afterwards, would be to annotate manually all the tokens in several thousands of sentences. Even if the returned sentences that contained an interesting entity were annotated/ labeled by hand with a ’1’, otherwise with a ’0’ and sequence classification was applied for NER (e.g. BERT For Sequence Classification), in order to just recognise whole sentences rather than words (tokens and entities), it is ex-tremely difficult or even not feasible for a single individual to execute such a case study work.

3.1.2 BIO Tagging Scheme.The gold set was a directory of plain text, document files containing whole sentences and additional corresponding annotation files. The annotation files usually declare the exact position for each term that be-longs to the interesting entity types, in the related sentences. This format (Brat standoff), has to be transformed in order to be recognised by transformer- based models. After the conversion, every token of the sentence is associated with a label in BIO tagging scheme. The labels of the sentences are represented as ( Beginning, Inside, Outside). Every token of

(5)

a sentence was labeled as B-label (’B_Material’) if the token was the beginning of the named entity, I-label (’I_Material’) if it was inside a named entity but not first, or O-label (’O’) if it was not part of it [8]. We were exclusively interested and concentrated to the Material entity that contained the desired terminology and notation.

3.2 Frameworks & Tools

SpaCy [7] is an open-source library utilized for advanced Natural Language Processing problems. It offers tokeniza-tion, lemmatizatokeniza-tion, vectorizatokeniza-tion, comparison of semantics similarity, part of speech tagging, named entity recognition among others, with option to enrich or manually construct and train the entity recognizer. SpaCy provides models like ’en_core_web_lg’ trained on different purpose’s texts. This fact affects their performance on named entity recognition task. The Package spacy-transformers is an integration wrap-per that provides SpaCy model pipelines where packages with PyTorch transformer- based models like BERT can be accessed and instantly used. After the transformation of the gold set to the appropriate format required by SpaCy and the enhancement of the library with the desired new en-tity type related to data sets and data resources, a barrier occurred. The limitation was that a PyTorch Transformer model could not be utilized to train a regular SpaCy recog-nizer. Currently there are not NER implementations added for spacy-transformers. The text classification with trans-former models is not available yet due to their large size and their demand on large batch sizes.

Hence, the Huggingface Transformers [17] was selected as the framework to accomplish the goal of the project’s development and overcome the constraints. Transformers library provide many variations of the sharp state-of-the-art NLP pre- trained transformer- based models that can be fine-tuned on particular data sets and tasks (e.g. classification, question answering). The two major deep learning environ-ments PyTorch by Facebook and TensorFlow by Google are both supported. Our implementation was based on PyTorch. On the contrary with traditional RNN’s, Transformer- based models, introduced in 2017, process sequential text data in an arbitrary manner, without keeping the order. This raises the parallelisation abilities which makes them faster and reduces the algorithmic complexity costs. They managed to dominate over former techniques like LSTM. They contributed to a rise at the size of sets utilized for the the pretraining and designated the proper environment for the development of the next generation of models, designed primarily for NLP, like BERT. They are composed of layers of encoders and decoders which they activate an attention mechanism, that adds weights to the linearly transformed inputs and accumu-lates some informative elements to generate the output.

3.3 Feature Extraction

Every model is accompanied by its own tokenizer and and its own vocabulary which is essential for every token to be mapped with a unique code. The limit of the length of the sub- words sequences was set to 150 or 175. The larger sequence length of our data was in the interval 142 to 175 depending on the different tokenizers, while sequence length up to 512 is supported by BERT and for XLNet is unlimited. Longer sequences were truncated and shorter sequences were padded (post- tokens) to reach the fixed defined size. The tokenizers and the models are usually offered in cased and uncased variations. Despite the uncased variants of the models are widely considered to perform better records, our strategy was to focus on case- sensitive versions which is more suitable for Named Entity Recognition. Capitalization can be indicative of the phrases referring to a data set and can exist to the involved words, so it was decided to be taken into account. Tokenizers also break complex word in pieces to be identified by the vocabulary. To manipulate this issue, corresponding labels had to be multiplied accordingly dur-ing this split process. Neural Networks inputs are not text but numerical values. The tokenizers have core features that convert input tokens to ids (indices numbers), encoding rep-resentations according to their vocabulary. BIO labels were also modified to integer numbers. At this stage special tokens are added. For BERT family, sentences are discriminated like: [CLS] + Sentence_A + [SEP] + Sentence_B + [SEP], while in XLNet: Sentence_A + [SEP] + Sentence_B + [SEP] + [CLS]. [CLS] is like a mark for classifiers. Attention masks were also created, as additional input array to the input ids and labels. They follow their corresponding variables at the training-validation set separation. They are float numbers as signals that inform the model if the respective token is an actual one (1.0) or a padding product to ignore (0.0). Then data loaders needed to be set. At training, data were shuffled us-ing random sampler while at validation, data were loaded sequentially with the use of sequential sampler.

Kfold cross validation with number of splits equal to 5 was attempted but it was abandoned due to inconsistencies that appeared. A train- test split 85% - 15% was preferred and the evaluation score metrics were calculated on unseen data.

4 METHODOLOGY

4.1 Transfer Learning

Transfer Learning uses the obtained knowledge from a task where the model was prior trained for another purpose. It improves the generalization of the model, provides faster computations and reduces the amounts of the demanding data set (gold standard annotation corpus)[6]. The former features (parameters and weights) learned and derived from the preceding training, can be acquired and loaded again as 4

(6)

the initial point. This shortcut tactic along with the ensemble learning (combination of different models) are two dominant approaches to tackle deep learning issues at the moment. In computer vision, the neural networks learn simple edges and corners at the lower layers and the complication of objects is increased gradually at the higher ones. The first layers, trained with the catholic image features can be preserved. In NLP, something similar happens regarding generic word representations and language structures.

The standard versions of language models were trained on general domain corpora and can be utilized for tasks like mining in news articles. BERT is pre- trained on Wikipedia corpus which had 2500 million words at the moment and Book corpus with 800 million words. Hence, the models obtain a deep perception of language’s structure and archi-tecture. Hundreds of training hours on GPUs are needed to train descent language models that, perform well with an acceptable accuracy, from scratch. Time and computation costs are huge and they consume large amounts of energy too. Once such models are disposed, the analysts obtain a considerable benefit as they can adapt them to their inter-ested cases. They make them fine tuned, by changing their upper layers using way smaller data sets. The trend of down-loading and incorporating pre- trained weights of the bottom layers, for a range of tasks other to the initial direction and outline, than building an entirely prototype model, became widespread within community.

4.2 Models

ELMo, BERT and its following models around transformer architecture, need vast amounts of textual data during the pretraining phase. Their major significant advantage is that they are not required to be labeled, but their initial training happens in an unsupervised mean. Generic attributes of the text are captured during this stage. Only during fine- tuning a relatively small supervised set is required. Unsupervised pretraining of models steeply increases the performance of NLP implementations by returning contextualized word em-beddings for each token, which can be passed into minimal task- specific neural architectures [2].

BERT [3] a 12 layer deep network (12 transformer blocks and 12 attention heads), adopts a multi-layer bidirectional transformer logic instead of the legacy left-to-right, predict-ing randomly masked tokens and successive sentences. It makes predictions with respect to left and right context at the same time. ELMo had separate models from for the left and right context that were not interacting. At one step BERT randomly masks out 10% to 15% of the words in the train-ing data, attempttrain-ing to predict the masked words, and the other step takes in an input sentence and a candidate sen-tence and predicts a following match[13], [3]. Except for the masked language modeling, BERT optimizes next sentence

classification objective [18]. Transformer blocks reflect the high numbers of encoder layers. BERT introduced an un-supervised learning architecture in the prediction of words, which was deeper and included more parameters. The main additional attribute was that BERT could be integrated in a downstream task-specific architecture [1]. The lower lay-ers encode local syntax (useful for part of speech tagging) and higher layers can extract complex semantics (aspects of word meaning useful for word sense disambiguation tasks) [15]. The pooler performs specific function to reduce the dimensionality of the network and the dropout ignores units (neurons) during the training stage to prevent overfitting. A classifier is a linear upper layer.

BERT, released from Google AI Language, uses WordPiece Tokenizer and defined vocabulary. The input representation is a sum of token (itself), segment (position in the sentence) and position embeddings (position in the sequence of the sentences). A NER model is trained by feeding the output vector of each token into a classification layer which predicts the label. A set of several thousands sentences of annotated data is enough but necessary for fine- tuning [16]. BERT’s number of parameters can reach hundreds of millions scale. SciBERT [2] offers unsupervised pretraining on a large cor-pus of scientific papers and can be subsequently used for downstream Natural Language Processing tasks. SciBERT uses SentencePiece library to create a new vocabulary named SCIVOCAB [2]. It includes words appearing in scientific lit-erature instead of BERT’s general domain, and it is trained on a corpus of 1.4 million papers from Semantic Scholar, 18% of them related with computer science. It is reported that it can outperform up to +3.55 F1 score (with fine- tuning) the BERT for CS oriented tasks like NER[2]. Some arguments and questioning about the disadvantages of BERT have been stated in the literature. The main are the absence of the above mentioned masks during the fine- tuning process on down-stream tasks. If the input words do not include one that was masked out during the training, defective predictions are derived. Another concerns the parallelisation of the derived predictions and the failure in the construction of dependen-cies between them that could affect each other and react on the outcome.

RoBERTa, released from Facebook [10], is a replication research on Google’s BERT that executed multiple compar-isons and presented some performance gains. It pointed out the importance of some key hyper- parameters and the size of the training data, that can have great impact on the final result. The research team declares that original BERT is con-siderably under- trained and if this drawback is improved, it can perform equivalently or even exceed posterior models, that are alleged to be superior.

XLNet, released from Google/ CMU, [20] is one of the latest breakthrough models and it is reported to outperform

(7)

BERT on a range of NLP tasks. It was developed to cover some of the BERT’s cons and is considered to have been pre- train according to a generalized autoregressive way, with respect to the bidirectional orientation, that surpasses BERT’s limitations. It revives the Recurrent Network logic (segment recurrence mechanism) and integrates a policy which is notated as "relative positional encoding", borrowed from the predecessor model named Transformer- XL. this is associated with the model’s ability to learn how to estimate some weights for the previous and the following words to the temporary central one. The authors [20] demonstrate that XLNet records better accuracy than BERT, RoBERTa and GPT models. They also claim that it has achieved sub-stantially reduced error- rate on the test sets of several text classification data sets.

4.3 Fine- tuning

BERT’s, XLNet’s and similar transformer- based models’ fine-tuning includes the insertion of a single extra output layer on top depending on the particular task. For the named entity recognition a linear classification layer is added. In order to tackle our project’s requirements, the most efficient ap-proach is the token- level classification. The weights of the last hidden state of the networks, derived from the pre- train-ing pipeline, are passed as inputs to the token- level classifier. "Bert For Token Classification" and "XLNet For Token Clas-sification", to name two, are appropriate modifications of the models, in order to make a configuration that suits our purpose. Models for sequence classification classify entire sentences, in which the desired entities coexist with the rest words and they would not be helpful.

AdamW optimiser was selected and weight decay was chosen as regularization technique to penalize weight ma-trices of the nodes. Regularization adds a term to the loss function that penalizes overfitting.

Deep learning models comprise many hyper- parameters that need to be tuned in order to achieve the optimal outcome. The selection of batch size of either 16 or 32 were recom-mended on BERT paper [3]. The authors of BERT [3] propose the selection between 2 and 4 epochs during the fine- tuning stage. They are considered enough and additional epochs do not add but slight gain.

In Table 1 the tuning hyper- parameters of the models are presented.

4.4 Running & Experiments

4.5 Training & Evaluation

A backpropagation follows the forward pass of training data through the network. At each layer, combinations of linear weighted sums occur. Activation functions transform the output of hidden output neurons to achieve non- linearity.

BERT SciBERT RoBERTa XLNet

epochs 4 4 4 4

lower case False False False False learning rate 5e-6 5e-6 5e-6 2e-5

batch size 16 16 16 32

seq. length 150 150 150 175 epsilon 1e-12 1e-12 1e-12 1e-12 max grad norm 1.0 1.0 1.0 1.0

Table 1: Parameters of the Fine- tuning.

The gradients were cleared before every time they were cal-culated at the backward pass, to avoid aggregation. No initial layers were frozen. Hence, the weights of the entire network were updated after the backpropagation and the complete feature space was participating in optimisation, which oc-curred by minimizing the cost function. Consequently, op-timiser parameters and learning rate updates were taking place. Updated weights contribute to the minimization of prediction errors. After training, evaluation was executed with a forward pass of validation data through network.

The main evaluation metrics taken into account to com-pare the models and measure their performance, were Pre-cision, Recall and F1 score of predicted tokens and entities. Except for the token level, the recognition at entity level was examined too. A rational, fair policy is to allow only the perfect matches of a full entity to be taken into consideration. Hence, the B-label and all consecutive, following I-label tags are required to match to be received that they belong to the same entity (’E’), otherwise they are marked as not entity (’NE’). Two entities with two or more words, where ’B-label’ or ’I-label’ tags have been assigned to each of them, are con-sidered equal only when all their internal equivalent tags match each other exactly[16]. This policy was adopted and applied at the evaluation stage between the predicted entities and their true validation mapping labels. every O-label was met at true validation tags, was not ignored but registered as (’NE’). Its predicted equivalent was changed to (’NE’) if it was also O-label or (’E’) otherwise.

The following equations represent the formulation of the metrics. Precision is the number of relevant items predicted divided by the the total number of items predicted and shows how many of the relevant items are selected. Recall is the number of relevant items predicted divided by number of relevant items in the collection overall and expresses how many of the selected items are relevant. F1 stands for the traditional F- measure or balanced F-score.

𝑃 𝑟= 𝑇 𝑝

𝑇 𝑝+ 𝐹 𝑝 (1) 6

(8)

𝑅𝑒𝑐= 𝑇 𝑝 𝑇 𝑝+ 𝐹𝑛 (2) 𝐹1 = 2 ∗ 𝑃 𝑟∗ 𝑅𝑒𝑐 𝑃 𝑟+ 𝑅𝑒𝑐 (3) 2_{Validation Accuracy is also indicative of how the models} are expected to perform on unseen data like the ones of the validation set. In addition, two other metrics that depict the erroneous behaviour of the models, facilitated us to get useful insights and proceed to the decision making. Average Train Loss displays the error on the training set and Validation Loss displays the error after model has been trained and run on validation set. Their comparison and their progress during epochs can often denote the presence of overfitting when a model is not generalised well on new unseen data.

The training and the evaluation errors of the models through-out epochs, were also visualised. The plots have been inserted into the Appendix.

5 EXPERIMENTAL RESULTS

5.1 Experimental Setup

In this section the results are presented. They derived through the series of steps where the mentioned above models were fine- tuned on the downstream Named Entity Recognition task trained and evaluated. GPU Tesla P100-PCIE-16GB was utilized so the model parameters to be passed to and deep learning models to be run on. GPUs (Graphic Processing Units) offer the necessary computational power that deep networks need to perform large quantity of mathematical op-erations at their hidden layers and they allow simultaneous computations and concurrent processing. Besides accelerat-ing the execution, GPUs are appropriate for the loadaccelerat-ing of great amounts of data. The explanation of this fact is related to their greater number of cores, with arithmetic logic units as components. Consequently, they lead to less demand for allocated memory.

5.2 Running & Experiments

The tables below show the results of the analysis for the vari-ous models. For token level, the average (macro/ unweighted mean) Precision, Recall and F1- score were calculated. For entity level, the metrics were calculated for each class sepa-rately.

Afterwards, the models were tested on the SDM and SI-GIR paper collections. Random sentences of some papers were fed to the fine- tuned models and the data/ data set re-lated entities were supposed to be identified. The sentences were prepared and encoded (tokenized and special tokens appended to them), to be fed as new inputs. The results were satisfying. Examples are given in the Appendix section.

2_{Tp = True Positive, Fn = False Negative, Fp = False Positive}

Token Level BERT Accuracy 0.838 F1- Score 0.771 Precision 0.760 Recall 0.850 Average Train Loss 0.426 Validation Loss 0.406

Table 2: Validation Results for BERT base.

Token Level RoBERTa Accuracy 0.842 F1- Score 0.782 Precision 0.768 Recall 0.825 Average Train Loss 0.483 Validation Loss 0.426

Table 3: Validation Results for RoBERTa.

Token Level SciBERT Accuracy 0.853 F1- Score 0.804 Precision 0.790 Recall 0.836 Average Train Loss 0.398 Validation Loss 0.361 Table 4: Validation Results for SciBERT.

Token Level XLNet Accuracy 0.890 F1- Score 0.853 Precision 0.847 Recall 0.862 Average Train Loss 0.267 Validation Loss 0.271 Table 5: Validation Results for XLNet.

6 CONCLUSION

6.1 Discussion

According to the comprehensive experiments, XLNet outper-formed every other model in all the critical metrics at token level. SciBERT, which is pre- trained on scientific vocabulary is ranked after and exceeded the other BERT family models that were pre- trained on general content corpus, as it was anticipated for the specific task. The entity level results were comparable with the token level ones. XLNet surpassed all the other models again. For the Entity tag the differences

(9)

Entity Level BERT Not Entity Entity F1- Score 0.56 0.89 Precision 0.83 0.82 Recall 0.42 0.97

Table 6: Results for the two classes Entity and Not En-tity, for BERT base

Entity Level RoBERTa Not Entity Entity F1- Score 0.57 0.89 Precision 0.75 0.84 Recall 0.46 0.95

Table 7: Results for the two classes Entity and Not En-tity, for RoBERTa

Entity Level SciBERT Not Entity Entity F1- Score 0.67 0.91 Precision 0.87 0.86 Recall 0.55 0.97

Table 8: Results for the two classes Entity and Not En-tity, for SciBERT

Entity Level XLNet Not Entity Entity F1- Score 0.74 0.93 Precision 0.82 0.90 Recall 0.67 0.95

Table 9: Results for the two classes Entity and Not En-tity, for XLNet

between models were not great but a boom was observed at the F1 score and Recall metrics improvement at the Not Entity tag. SciBERT performed well and was very effective too. The identification of an Entity was generally predicted more successfully than the Not Entity tag, where Recall was broadly low so the existence of many false negatives was obvious. XLNet also missed many but fewer than the other models.

XLNet also demonstrated the lowermost pair of losses/ errors. The average train loss and the validation loss were high in general, but they were steeply decreased when the number of the epochs was raised. For example, When the number of the epochs was set to 10 or to 15, the average train error culminated in around 0.05 and the validation error was just higher, which they are definitely acceptable values.

Nevertheless, we decided to follow the authors’ suggestions [3] and keep them up to 4. However, it is important that the losses are similar and no one is much lower or greater than the other. A limited underfitting, associated with the quality of the data set is possible. Still, for XLNet it is illustrated that both losses are relatively low and the validation loss was slightly higher than the training loss, so the fit seems to be good enough.

6.2 Future Research

Further investigation is required in order to clarify the in-fluence that a less confined, more competent data set could have and how steeply the results could be affected. If a more extensive gold annotated data set was created or acquired, it could be easily applied and the models could be fitted, run, tested and optimised. The robust set- ups for the pre- pro-cessing, feature extraction, fine- tuning, training, evaluation and visualisation have been already built and adjusted.

GPT- 3, by OpenAI is a forthcoming next generation’s pre-trained model which is reportedly immense compared to any existing model as it is supposed to have a capacity of 175 billion trainable parameters. BERT’s respective number is 110 million. It is the successor of previous models (GPT, GPT-2) and it will be to able to address many Natural Language Processing tasks. At the moment it is in beta testing stage. If such a model is released and accessible, it will be challenging for our project’s idea and practice to be applied on.

The integration of the project in order to include the source code of programming scripts, is a further step. The inclusive identification of used "data sets" in Jupyter note-books (e.g. run in Kaggle kernels), using indirect evidence like (pd.read_csv(), df.head(), df.shape, df.info() ), without referring the name or the filename would also be a challenge. Finally, after the problem of the identification of the data sets has been resolved, a sort of clustering will be advan-tageous to the scientific community. Hence, the data sets would be distinguished in discrete classes with respect to their category and utility.

7 ACKNOWLEDGMENTS

I would like to express my gratitude to prof. Maarten Marx for the guidance and the trust. I would also like to thank my family for always supporting me to everything I do, my teachers at UvA and my brilliant classmates for inspiring me. This report is dedicated to the memories of my beloved longtime comrade Nøla and my classmate Glen Cripps who was always willing to help.

REFERENCES

[1] Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. Publicly avail-able clinical BERT embeddings. In Proceedings of the 2nd Clinical

(10)

Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguis-tics.

[2] Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: Pretrained language model for scientific text. In EMNLP, 2019.

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Com-putational Linguistics.

[4] Geraint Duck, Goran Nenadic, Andy Brass, David Robertson, and Robert Stevens. Bionerds: Exploring bioinformatics’ database and software use through literature mining. BMC bioinformatics, 14:194, 06 2013.

[5] Behnam Ghavimi, Philipp Mayr, Sahar Vahdati, and Christoph Lange. Identifying and improving dataset references in social sciences full texts. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, 20th International Conference on Electronic Publishing, ELPUB, pages 105–114, Göttingen, Germany, 06/2016 2016. [6] John Giorgi and Gary Bader. Transfer learning for biomedical named

entity recognition with neural networks. Bioinformatics (Oxford, Eng-land), 34, 06 2018.

[7] Matthew Honnibal and Ines Montani. spaCy 2: Natural language un-derstanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.

[8] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California, June 2016. Association for Computational Linguistics.

[9] Ulf Leser and Jörg Hakenberg. What makes a gene name? Named en-tity recognition in the biomedical literature. Briefings in Bioinformatics, 6(4):357–369, 12 2005.

[10] Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoy-anov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.

[11] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. Multi-task identification of entities, relations, and coreferencefor scientific knowledge graph construction. In Proc. Conf. Empirical Methods Nat-ural Language Process. (EMNLP), 2018.

[12] Brigitte Mathiak and Katarina Boland. Challenges in matching dataset citation strings to datasets in social science. D-Lib Magazine, 21, 01 2015.

[13] Derek Miller. Leveraging bert for extractive text summarization on lectures, 06 2019.

[14] Na Pang, Li Qian, Weimin Lyu, and Jin-Dong Yang. Transfer learning for scientific data chain extraction in small chemical corpus with joint bert-crf model. In BIRNDL@SIGIR, 2019.

[15] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep con-textualized word representations. In Proc. of NAACL, 2018. [16] Anthi Symeonidou. Transfer learning for biomedical named entity

recognition with biobert. 06 2019.

[17] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl-vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M.

Rush. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.

[18] Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of bert, 04 2019.

[19] Vikas Yadav and Steven Bethard. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2145–2158, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics.

[20] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems 32, pages 5753–5763. Curran Associates, Inc., 2019.

(11)

A

APPENDIX

Each token of a new sentence is on the left. The predicted by the model label that signifies the existence of interesting entities is on the right.

Entities identification example for XLNet:

[(’_Our’, ’O’), (’_work’, ’O’), (’_shed’, ’O’), (’s’, ’O’), (’_light’, ’O’), (’_on’, ’O’), (’_the’, ’O’), (’_feasibility’, ’O’), (’_of’, ’O’), (’_incorporating’, ’O’), (’_social’, ’B_Material’), (’_attention’, ’I_Material’), (’_into’, ’O’), (’_traditional’, ’I_Material’), (’_text’, ’I_Material’), (’_mining’, ’I_Material’), (’_tasks’, ’O’), (’.’, ’O’), (’<sep>’, ’O’), (’<cls>’, ’O’)]

Entities identification example for BERT base:

[(’[CLS]’, ’O’), (’One’, ’O’), (’group’, ’O’), (’of’, ’O’), (’features’, ’O’), (’characterises’, ’O’), (’the’, ’O’), (’occurrences’, ’O’), (’of’, ’O’), (’the’, ’O’), (’target’, ’B_Material’), (’entity’, ’I_Material’), (’in’, ’O’), (’the’, ’O’), (’document’, ’I_Material’), (’:’, ’O’), (’the’, ’O’), (’number’, ’O’), (’of’, ’O’), (’oc’, ’B_Material’), (’-’, ’B_Material’), (’currences’, ’I_Material’), (’in’, ’O’), (’different’, ’O’), (’document’, ’B_Material’), (’fields’, ’I_Material’), (’;’, ’O’), (’first’, ’O’), (’and’, ’O’), (’last’, ’O’), (’positions’, ’B_Material’), (’in’, ’O’), (’the’, ’O’), (’document’, ’I_Material’), (’body’, ’I_Material’), (’;’, ’O’), (’the’, ’O’), (’-’, ’O’), (’[SEP]’, ’O’)]

Entities identification examples for SciBERT base:

[(’[CLS]’, ’O’), (’the’, ’O’), (’idea’, ’O’), (’dates’, ’O’), (’back’, ’O’), (’to’, ’O’), (’fox’, ’O’), (’and’, ’O’), (’shaw’, ’O’), (’,’, ’O’), (’who’, ’O’), (’conducted’, ’O’), (’experiments’, ’O’), (’on’, ’O’), (’trec’, ’B_Material’), (’2’, ’I_Material’), (’data’, ’I_Material’), (’.’, ’O’), (’[SEP]’, ’O’)]

[(’[CLS]’, ’O’), (’savoy’, ’O’), (’performed’, ’O’), (’extensive’, ’O’), (’experiments’, ’O’), (’on’, ’O’), (’clef’, ’B_Material’), (’data’, ’I_Material’), (’and’, ’O’), (’showed’, ’O’), (’that’, ’O’), (’combining’, ’O’), (’results’, ’O’), (’from’, ’O’), (’different’, ’O’), (’retrieval’, ’O’), (’models’, ’O’), (’improves’, ’O’), (’ir’, ’O’), (’performance’, ’O’), (’significantly’, ’O’), (’.’, ’O’), (’[SEP]’, ’O’)]

Figure 1: plot of the progress of training and validation loss between epochs 1 and 4 for XLNet. 10

(12)

Figure 2: plot of the progress of training and validation loss between epochs 1 and 4 for BERT base.

(13)

Figure 4: plot of the progress of training loss through the batches for RoBERTa.