Tracking Dataset across Conference Papers

(1)

Tracking Dataset use across

Conference Papers

(2)

Layout: typeset by the author using LATEX. Cover illustration: Pim Meerdink

(3)

Tracking Dataset use across

Conference Papers

Dataset mention extraction and clustering to construct a

bipartite knowledge graph

Pim Meerdink 11644095

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. M.J. Marx

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

1 Abstract

We build a system to extract the information necessary for the construction of a bipartite article-dataset graph. In this graph dataset node X and article node Y share an edge iff dataset X is used in article Y. We divide the task into two sub-tasks: dataset mention extraction and entity clustering. Dataset mention ex-traction is an example of Named Entity Recognition (NER) and entails identifying dataset referential phrases in scientific text. We train Googles sciBERT to extract dataset referential phrases on a dataset constructed for this project. SciBERT at-tains an F1 of 0.88 on the evaluation set and an F1 of 0.84 on our zero-shot test set. Entity clustering entails the clustering of our extracted dataset mentions so that dataset mentions referring to the same real-life dataset are assigned to the same cluster, it is an example of cross-document coreference resolution. We develop a task-specific graph-based algorithm that clusters based on lexical, semantic and document level features. The algorithm is able to attain a B-cubed f1 score of 0.86 on a self-constructed golden standard.

(6)

2 Introduction

In the last few years, the scientific community has experienced considerable growth in the number of articles published. As such, it has become more difficult to find relevant articles quickly and efficiently. This projects aims to identify the datasets used in large corpora of scientific articles. This information will then be used to construct a bipartite graph, with scientific articles as nodes on one side, and datasets on the other. Dataset x and article y will have an edge between them if and only if dataset x was used in article y. In this way, it is possible to create a knowledge graph regarding datasets used in scientific literature, which can be applied by, for example, extending a scientific literature search engine with a feature that allows users to explore datasets.

It is important to formulate some conditions as to which entities should be included in the bipartite graph that will be constructed. A dataset node in the graph represents some actual, findable set of data: this definition is meant to exclude more abstract material such as the English language as a whole. It also excludes ’toy’ data that is constructed for a particular research goal, but is not released to be used by others (often the article will contain a description of how the data was constructed). The definition is meant to include established datasets such as the CIFAR-10 (but not CIFAR itself, for that is a collection of datasets), as well as for example, the Penn Treebank.

This project has divided the problem into two sub-problems: first is one of extraction, where the dataset mentions are extracted. This task is an example of Named Entity Recognition (NER). The second subproblem is one of clustering, where the extracted dataset mentioned are clustered in such a way that each cluster contains dataset mentions referring to one real-life dataset. This task is an example of cross-document coreference resolution. The problem of scraping and parsing the required scientific articles does not fall within the scope of this project. The first research questions that will be investigated is: can we produce a named entity recognition system that can reliably extract dataset mentions from scientific text? Secondly, can we cluster dataset mentions from scientific text in such a way that those referring to the same real world dataset will share clusters? Both of these questions will be answered with the overarching, end to end system in mind: the data selection and pre processing will be done in a way that aligns with real-world application.

3 Background

In the preceding subsections, some background relevant to the research task at hand will be given. First, in the ’Practical Background’ section, the relevance

(7)

of this project will be explained further, and some current systems that index scientific articles and the datasets used therein will be discussed. Afterwards, relevant established NLP task and evaluation measures will be discussed in the ’Technical Background’ section.

3.1 Practical Background

Extracting information from scientific articles has become a popular topic among NLP practitioners in the previous years. This can, at least in part, be attributed to necessity; there is too much research being published for researchers within a certain domain to be able to manually sift through and filter. Besides research to-wards information extraction systems for scientific articles, which will be discussed in the ’literature review’ section, websites and search engines have started to offer related functionality. As always, Google is at the frontier of these developments, having recently launched a dataset search engine. This allows researchers and amateur practitioners to find the relevant data for their endeavours. The Google Dataset Search engine streamlines research, allowing for a faster flow of informa-tion, and providing a central knowledge base that can be queried. The search engine also allows users to find articles that cite a certain dataset.

Google dataset search is closely integrated with Kaggle, which is also owned by Google. Kaggle is an established platform where machine learning practition-ers can connect and compete. Kaggle also provides dataset search functionality, allowing a user to find data with which to train and evaluate their models, as well as different competitions with shared datasets on which to optimize per-formance. Google Dataset Search links directly to Kaggle where possible, and datasets on Kaggle are searchable on Google dataset search. Besides kaggle and Google Dataset Search, there are sites such as paperwithcode.com which provide leaderboards of a wide array of tasks on different datasets. These leaderboards list each algorithm’s performance on these datasets, linking to the paper and, as the url suggests, the code.

With Google seemingly in line to monopolize dataset search, and several other sites offering similar functionality the necessity of and demand for dataset ex-traction becomes clearer. While systems like Google Dataset Search are extremely useful, they require large knowledge bases to operate. Google dataset search allows users to navigate the underlying knowledge graph where datasets are connected to their website, download link as well as some basic information. This project aims to build a knowledge graph that is similar in nature, but distinct. This project aims to build a knowledge graph that consists of different datasets and articles, where use of dataset x in article y will result in an edge between their representa-tive nodes. Importantly, this project aims to do so in absence of a hand-curated list, instead relying completely on context to extract dataset mentions. We will

(8)

not look at citations, as google dataset search already allows their users to do, but only the text of the article. This amounts to a process of discovery: we will not utilize a hand curated list of datasets that are to be identified, but instead seek to identify used datasets through analysis of the text.

3.2 Technical Background

The task at hand can essentially be divided into two sub-tasks. First there is dataset mention extraction, this entails identifying the entities in the text that refer to a dataset. Second, entity clustering, this entails grouping the identified dataset mentions into groups so that each groupcontains all the dataset mentions corresponding to one real world dataset. The text-extraction and the scraping of the scientific articles were preformed and explored in other projects, and as such will not be discussed in this paper, or explored in this project. Constructing the actual graph, given the completion of the first two tasks, is trivial and will therefore not be discussed either.

3.2.1 Dataset Mention Extraction

Named Entity Recognition The desired input of the algorithm is some sci-entific text, partitioned into articles. The dataset mention extraction task is in essence a named entity recognition task. Named entity recognition (NER) is an established NLP task in which named entities are extracted from some text (Ya-dav and Bethard (2019)). These named entities are often classified into some pre-defined categories. Named entities, then, are real world objects that can be identified using some proper noun. Examples include cities (New York, Amster-dam), people (Barack Obama, the Dalai Lama) or organizations (the World Health Organization, Google). Given the sentence ’Google has published another search engine’, ’Google’ qualifies as a named entity, while ’search engine’ does not: it can refer to many different entities, given different context. The named entities to be recognized within this project are datasets. Since we will be limiting our information extraction task to datasets we will not need to classify our extracted entities further: they are implicitly classified into the sole category that is present. Despite the clear similarity between the more general task of named entity recognition and dataset mention extraction, it is clear that dataset mention extrac-tion poses specific challenges. Generally, one of the most important and difficult to overcome is spartsity: there are relatively very little sentences within scientific text that refer to a dataset. This makes extracting these akin to finding a needle in a haystack, both for a human annotator and a trained machine learning model. Another specific challenge is the more general difficulty for a network to, devoid of context, distinguish findable datasets from, say, toy data given a name. Often

(9)

this requires not only the context of the rest of the article, but also some inference from the, often expert, reader.

zero-shot extraction We can distinguish different types of targets for our dataset mention extractor. Given some training data X and a dataset to be ex-tracted y, y could have occurred in X, or not. If the network has already ’seen’ a certain dataset in its training data this will clearly impact the networks prediction. Henceforth examples where the training and test data are split in such a way that the set of datasets in the training data and in the test data are disjoint will be referred to as ’zero-shot’ classification. In zero-shot classification the network can only use syntactic and semantic cues to infer that some entitybrefers to a dataset, and will not be able to rely on having ’seen’ a particular dataset before. It is important to note that in any real-world application the network will inevitably encounter datasets it has seen before, and zero-shot classification is, as a result, slightly more ’difficult’ than real-life applications.

Evaluation Thorough and in depth evaluation of the NER task is important as it will serve as a preliminary task to our downstream task. Seeing as NER is an established NLP task, there is a large array of evaluation metrics that are used to evaluate NER taggers. It is important to choose an evaluation metric that gives us useful insight into the performance of the model, and also reflects the performance towards the downstream task well.

Strict evaluation (such as that proposed by CONLL in 2003 (Tjong Kim Sang and De Meulder (2003)) requires that both the left and right entity boundaries are exactly correct for a match to count as ’correct’ in subsequent accuracy and recall scores. The category of the named entity is also required to be correct, however this is not relevant for the task at hand as there is only one category. The advantage of this is of course that any match that is counted as positive is definitely usable for our downstream task, and we gain rigid insight as to percentually how many of the dataset mentions are definitely identified and passed onto the next ’part’ of our pipeline. The downside is that, due to the nature of our tagging, we may miss relevant tags that can be used, but are not identified as exactly correct (tagging CIFAR-10 instead of CIFAR-10 dataset will yield a very similar result at the end of our pipeline, but will be counted as incorrect).

There are more complex measures such as the ones proposed by ACE (200 (2008)), which constitutes a parameterized weighting of each possible type of match (i.e.boundaries correct, class correct, both correct, incorrect etc.) These measures will, for our purposes, only serve to complicate things. The ACE measures consti-tute an unnecessary aggregation, as we can observe the to be aggregated measures directly, and gain a better understanding of the performance of our network.

(10)

Observing these metrics directly constitutes the SEMEVAL metrics (Bedmar (2013)). These supply different recall/precision scores for the different types of matches (again this means boundaries correct, boundary overlap or incorrect, for example). This means that the NER tagger will not be evaluated by, say, a single f1 score, but instead a few different recall and precision scores that all correspond to an evaluation measure. During this project the following metrics were used:

• Exact: A match counts positively towards the final accuracy and recall if the boundaries are exactly correct

• Partial: A match counts positively towards the final accuracy and recall if there is some overlap between the golden standard and the prediction • Beginning: A match counts positively towards the final accuracy and recall

if the beginning of the golden standard and the prediction are the same Of all these metrics the accuracy, recall and f1 are computed and considered in the evaluation. Finally, the jaccard of the set of words identified by the tagger and the set of words in the golden standard was also computed. This gives us an indication of how many of the datasets were ’discovered’. It roughly answers the question: given this text, how many of datasets that were mentioned in it were found by our algorithm? This lines up well with the nature of this project: it is one of discovery.

It’s important to note that not all mistakes are equal. When tagging the dataset mention ’MNIST dataset’ our tagger can either (1) correctly identify both the words as being part of a dataset mention, (2) only identify MNIST and miss ’dataset’ (3) only identify ’dataset’ and miss ’MNIST’, or (4) miss both. Cases 2,3 and 4 are indistinguishable by the ’exact’ measure, yet 2 gives us much more information than 3 and 4, and could result in very similar performance for the downstream task as 1. We can sometimes distinguish these cases by using our other evaluation metrics (partial and beginning), but with more complex dataset mention structures this quickly becomes unrealistic. Besides this a false negative from our NER does not necessarily impact the eventual bipartite graph construction: if a dataset mention referring to that particular dataset has already been extracted earlier in the same article the graph construction will be identical. This makes for some disparity between the evaluation metrics, and actual real world performance. On all these accounts our evaluation metrics are more strict and reflect poorer performance than is actually the case.

3.2.2 Entity Clustering

The recognized named entities from the first task will be used for the second task: the grouping of the recognized entities into subsets that contain solely the

(11)

entities that refer to the same real-life dataset. This task is similar to entity disambiguation, but is in essence one of coreference resolution.

Coreference Resolution Coreference resolution describes the task of linking (or clustering) the different entities within some text that refer to the same real-life entity (i.e. they are coreferential). At first glance coreference resolution seems directly applicable to the entity clustering task necessary for this project, and strictly speaking we could say that our task falls within the aforementioned defi-nition of coreference resolution. However, it is important to understand the ways in which our task is not similar to regular coreference resolution. Coreference res-olution generally concerns itself with sentence structure and syntactic cues that could be indicative of coreference. Algorithms developed for coreference resolution are often developed to parse sentences like ’Mary liked John, and she would often talk to him’, identifying that ’Mary’ and ’she’ are coreferential, just like ’John and ’him’. For this project, coreferential entities referring to the same dataset (CIFAR-10 and CIFAR10, for example) which will need to be resolved will not be syntactically related: they will likely be in different articles altogether.

Despite the usual local nature of coreference resolution, research has been con-ducted that concerns itself with cross-document coreference resolution. Often these techniques concern themselves with persons or organizations. This research is rele-vant to our task, and serves as inspiration for the methods described in this paper. It is important to keep in mind that the entity clustering task tackled in this project focuses on a very specific type of entity, namely datasets. These names are structured in such a way that we can make unique assumptions and these allow us to construct specific heuristics. This is the primary reason that coreference resolution algortihms generally will not be directly relevant for our purposes. Entity Disambiguation Entity disambiguation is the task of linking named entities with some central knowledge base containing entities, and often serves as a downstream task to entity recognition. Again, the task is similar to ours, and we are disambiguating our identified entities in some sense, but our dataset clustering task lacks a central knowledge base to connect the identified entities to. Candidates for such a knowledge base for datasets include wikidata and the Google dataset search engine. Wikidata is not extensive enough, and only has entries for well known, established datasets. Google Dataset Search does not provide and API which could possibly used to query the system. However, an essential characteristic of this project is that the task is one of discovery. This means that if we utilise any external knowledge base we will limit our discovery to entities within that knowledge base. For these reasons reason we will focus on limiting our use of external information to enhancing the heuristics for our clustering.

(12)

4 Literature Review

This section provides an overview of some of the relevant literature utilised and considered in this project. Amongst this relevant literature are articles pertaining to models utilised in this project (sciBERT) as well as articles released by research teams attempting to tackle problems similar to ours.

Surveys Named entity recognition and coreference resolution are established tasks within Natural Language Processing. Two survey articles ((Yadav and Bethard (2019)) and Zhang and Zhu (2007) ) were consulted in order to gain an understanding of the named entity recognition and coreference resolution, re-spectively. These articles provide an excellent overview of established methods as well as error metrics which are utilised throughout this project.

BERT and sciBERT BERT (Bidirectional Encoder Representations from Trans-formers), is a pre-trained language model released by the Google Research Team in 2019 Devlin et al. (2018). As the name states, it makes use of a Transformer, which is an attention based model that can learn relations between entities in a text. What distinguishes BERT from previous method is it bidirectionality: it does not assume a ’direction’ in which it reads the text, instead it reads a com-plete sequence at once (so while it is not one-directional, it would be most accurate to say that it is non-directional, not bi-directional). The paper also introduced a novel method of training through masking words randomly and allowing the model to predict what the masked words are, based on their context.

BERT comes pre-trained on a corpus of approximately 320 million English words. It can then be trained further for a downstream task, making use of the word context and structures it has learnt previously. BERT achieved state of the art results in a variety of NLP tasks, and importantly also in our task of interest, named entity recognition. It does so with relatively little training data, and as such is extremely relevant as this project will require the construction of training data. What’s more, there also exists a sciBERT. A pre-trained model that has the same architecture and training procedures as BERT, except being trained on scientific articles. It seems the logical choice for identifying datasets within our articles. Beltagy et al. (2019)

Rich Context Coleridge Initiative In 2019 the Coleridge initiative launched a competition that challenged competitors to identify datasets in scientific articles (Coleridge (2019)). Not all submission have a scientific article attached, but the submission by Animesh Prasad, Chenglei Si and Min-Yen Kan includes a scientific article (Prasad et al. (2019)) . The researchers attempted to tackle the problem

(13)

by dividing it into two sub-problems, and only addressed the first sub-problem of dataset mention extraction. They used a neural approach to tackle the first sub problem. Amongst others, a Bidirectional Long Short-Term Memory model using pre-trained word embeddings was used. The team struggled to attain high scores, achieving an f1 of 23 on zero-shot classification. They attribute this in part to the extremely skewed ratio of positive examples to negative examples. The best results for zero-shot classification were obtained using a Convolutional Neural Network over character embeddings (’CNN-BiLSTM’), and adding a CRF layer over the CNN-BiLSTM outputs (‘CNN-BiLSTM-CRF’).

spERT Researchers from the RheinMain University of Applied Sciences ex-tended the BERT architecture in 2019, the result is spERT (Span-based Joint Entity and Relation Extraction with Transformer Pre-training) Eberts and Ulges (2019). spERT is currently listed as the top preforming model on a number of datasets and tasks, including the aforementioned sciERC NER task, and sciERC entity and relation extraction. This makes it the current state of the art algorithm for this task. As a result much of the information in this article will be relevant to us and up to date. Besides containing a valuable ’previous research’ section with explanations and comparisons of approaches of different research groups, the article outlines the spERT algorithm. It is an attention model for joint span-based joint entity and relation extraction, primarily the entity extraction will be rele-vant to us. The researchers also use a transformer architecture, more specifically the BERT. Beside this, they stress the value of negative samples for the relation extraction task.

sciERC In 2018 Luan et al. attempted to extract knowledge graphs from scien-tific articles Luan et al. (2018). Nodes in this graph would represent, for example, algorithms, conferences or other (scientific) entities. These nodes would be linked by edges that denote relationships e.g. ’sciBERT −→ NER’ could mean that sciBERT is ’used for’ NER. Identifying entities is an essential step in information extraction, and as such identifying entities was a key task. The algorithm that was built identifies and classifies entities into groups such as ’Material’ or ’Algorithm’. However, due to the more general nature of their approach the definitions used for these classes do not line up very well with our definitions. This project employs a rather strict definition (it must be a findable, single set of data), while the re-searchers for the University of Washington chose a more abstract notion. They included entities like ’the english language’, for example, as some material that research is conducted on. Another difference between sciERC and our approach is that the data of sciERC is restricted to abstracts of articles, while we use the complete text.

(14)

Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores In June 2019 Yufang Hou et al. from the IBM research labs in Ireland worked towards an information extraction system designed to, given a scientific article, extract tasks, datasets and metrics and scores for automatic leaderboard con-struction (Hou et al. (2019)). This is a broad and intricate task, and fittingly the researchers chose a complex approach. First extracting self-defined represen-tation (DocTAET) of each scientific article containing information on the papers title, abstract, experimental setup and extracted table information such as head-ers and captions. Afterwards, for all bold scores in an article the score context is defined, which includes the score’s column header and table caption. They continue by training two models to independently predict whether a TDM (Task Dataset Mention) triple can be extracted from a particular article’s DocTAET represenation, and whether a <dataset,metric> tuple can be inferred from a score context. These predicitons are combined when inferring TDM triple by, for each TDM triplet, association the score whose context has the highest confidence score to ’entail’ the <dataset,metric> tuple. The researchers were able to outperform baselines by a large margin, but note that more effort is needed to extract the best score within a certain paper. The developed system is intricate and complex, and while the task is similar to ours, the scope is larger which decreases the direct application of the experimental results of this paper to ours. The paper focuses on the extraction of the scores from the tables, and uses the text as context with which to justify confidence in predictions. We use exclusively the text to infer the mentioned datasets.

Cross Document Coreference Resolution In a 2009 paper by James May-field et al. the authors describe their approach to the problem of cross-document co-reference resolution (Mayfield et al. (2009). This approach focuses on filtering and featurizing the referential entities, which the authors consider the second and third step in the process of cross document coreference resolution. Most impor-tant is their investigation into possibly useful features for a machine learning based approach to solving the problem of cross-docmument coreference resolution. The paper concludes that the most important features describe lexical similarity, but that other features based on prior and external knowledge can also be efficacious.

5 Method

This section describes the method for both subproblems identified in the ’Technical Background’ section. First, the training and data of our NER model is described. The second subsubsection contains an in depth explanation of the dataset mention clustering algorithm.

(15)

5.1 Dataset Mention Extraction

5.1.1 sciBERT

The dataset mention extraction task was performed using Google’s BERT archi-tecture. As mentioned previously BERT is available pre-trained on 1.14m scientific articles, this variant of the model is called sciBERT. This meant that relatively little training data was needed as the latent space that sciBERT learned during this training is relevant to our downstream task of named entity recognition.

The scibert-scivocab-cased variant of the model was used, meaning that (1) the model was trained to distinguish capitalized and non-capitalized letters and (2) the model was trained using a vocabulary based on scientific texts, which is distinct from that used for regular BERT, pre-trained on wikipedia articles. While the uncased variant of the model generally preforms better, the choice was made to utilise a case-sensitive model as capitalisation can be indicative of whether a phrase refers to a dataset: words referring to datasets will often be capitalized. The scivocab was used as we will also be applying our model to scientific articles, and therefore this vocabulary was expected to be more relevant to the task.

The model was downloaded as a pytorch model from the allenai sciBERT github repository (Beltagy et al. (2019)). The huggingface transformers framework was used (Wolf et al. (2019)) to interact with the model. The model configuration and architecture was not altered, these can be found in the sciBERT paper Beltagy et al. (2019). Unless explicitly mentioned otherwise in the results the following hyperparameters were used for the training of the model: A learning rate of 5e-6 for the Adam optimizer, with a batch size of 16. Training lasted 15 epochs, and checkpoints were saved every 300 training steps. Gradient clipping was used, with a max gradient of 1.

5.1.2 Data

In order to train sciBERT to perform named entity recognition, sentences from sci-entific articles containing references to real-life datasets were required. Since this data is quite specific and niche it was necessary to label our own data by hand. Multiple steps were involved in this process. First, around 15 000 scientific articles from NIPS, VISION, SIGIR and SDM were scraped, and converted to text. More information about this process and these datasets can be found in Heddes (2020). This text was tokenized into sentences using spacy’s sentence split functionality. Afterwards, certain trigger words were identified that are indicative of a sentence containing a dataset, examples are ’data’, ’treebank’ and ’test’. Sentences contain-ing these words were lifted from the test. This search was done in a regex, the regex used can be found in the appendix. An annotation guideline was written, this can also be found in the appendix. This annotation guideline was then distributed

(16)

to six annotators. First a training round of annotation was performed, where all annotators annotated the same data, the differences in annotations were discussed and conflicts in the annotation resolved. Finally each annotator annotated 500 sentences containing a trigger word, labeling the datasets in each sentence. This was done through docano, a document annotation tool. Fifty of the sentences that were annotated by each annotator overlapped with another annotators sentences. This allowed for calculation of a kappa score, indicative of agreement between annotators. During this round of annotation 651 dataset mentions were identified. The second round of annotations were preformed in semi-supervised fashion. Using the first round of annotations a rule-based model was constructed to identify datasets within scientific text. This rule based model was used to find probable matches in scientific articles, and these were manually corrected. More information about this process and these datasets can be found in Heddes (2020). The process resulted in 6000 labeled sentences, containing 2864 dataset mentions. The kappa score for these sentences was 0.72, indicative of a good level of agreement between annotators. The annotated sentences were converted to BIO format.

5.1.3 Evaluation

Train, Evaluation, Test split In order to evaluate the performance of our model a train, evaluation and zero-shot test split was made. The sentences in the data were split into three disjoint sets. First, a training dataset on which the model was trained. Second, an evaluation set on which all saved checkpoints were evaluated on in order to identify the best performing model. Finally the best performing model on the evaluation set was evaluated on the zero-shot test set. As mentioned previously, zero-shot classification entails that no dataset mentions from the training data are present in the testing data. This means the model cannot ’recognize’ dataset names, but instead must infer from lexical, semantic and contextual cues whether a given entity contains a dataset mention, or not. More information on how this data was constructed can be found in Heddes (2020). Our train set consisted of 4640 sentences of which 2168 contained a dataset, our evaluation set 1160, 543 of these contained a dataset. The zero-shot test set contained 200 sentences, these all contained a dataset mention.

Difficult positive sentences To understand the performance of our model the sentences within our training data were split into a ’difficult’ and ’easy’ category, on which the model was separately trained and evaluated. This allowed insight into the models performance with regards to identifying true positives. The distinction between difficult and easy entities was made on basis of the amount of positively tagged words in a sentence: sentences with more than 4 positively tagged words were classified as difficult, and less than 4 were considered easier. The reason

(17)

behind this distinction lies largely in the annotation decision to consider an ’ellipsis’ one entity. This means that, for example, in the sentence ’We used the VOC 2007, 2008 and 2009 collections’ the entity ’VOC 2007, 2008 and 2009 collections’ was tagged as one dataset entity mention, as individual elements of the enumeration are nonsensical without the context provided by other elements. If ’2008’ was to be labeled as an individual entity it would provide inadequate information for the downstream task, as the entity ’2008’ does not provide enough information in order to infer that it refers to the VOC 2008. This is the same method of ellipsis labeling as utilized in Luan et al. (2018). For this reason, the distinction between harder and easier entities for the model was based on the length of a particular entity as longer entities were likely ellipses, and these were predicted to be harder for the model to infer. Besides this, it is of course possible that a sentence contains multiple shorter dataset mentions. This type of sentence was also considered as a ’difficult’ sentence, as the sentence stucture was assumed to be more complex. Positive to Negative ratio Finally, in research such as Prasad et al. (2019) it has been noted that a key challenge in the task of dataset mention extraction is the extremely skewed ratio of positive to negative examples. In order to investi-gate our models performance under more realistic conditions, the ratio of positive to negative examples was altered through the systematically adding negative sen-tences to our data. This was done by extracting sensen-tences that did not contain the aforementioned trigger words from 500 Association for Computational Linguistics (ACL) papers. The model was then trained and evaluated on 4 sets of data. First, the model was trained on only sentences containing dataset mentions. Next, the model was trained on data with negative sentences containing trigger words (this data was identical to the data used for the train/evaluation/zero-shot split). After-wards, the model was trained on data with negative sentences extracted from ACL, i.e. not containing trigger words. Finally, the model was trained on data contain-ing negative sentences with and without trigger words. It was expected that the model would struggle to correctly label negative sentences with trigger words, and have little difficulty correctly labeling negative sentences without trigger words.

5.2 Entity Clustering

For the entity clustering a task-specific algorithm was developed, loosely based on the 4 step process described in Dutta and Weikum (2015). The choice to divert from established practice and implement a custom solution was made in large part due to the specific nature of the entities to be clustered (i.e. they all describe datasets). This aspect of the problem allowed for specific assumptions and steps that improve performance significantly. The developed algorithm uses lexical as well as contextual features to define a similarity between any two entities within

(18)

its input data, this allows for the construction of a graph to which clustering techniques can be applied.

5.2.1 Pre-processing and normalisation

Certain pre-processing steps are essential to the performance of our algorithm as certain datasets can be identified by slightly differing phrases. These pre-processing steps are taken to assure that these differences can be resolved. The first step involves splitting the input data. As this algorithm is designed to solve the downstream task of the dataset mention extractor, the input data is in compliance with the annotation guidelines developed for the dataset mention extraction task, as this best describes the distribution that our mention extractor has learned. As such, our algorithm will label ellipses and certain summations as one entity (e.g. the VOC 2007, 2008 datasets). It is therefore necessary to split our entities on certain character (e.g. ’,’ or ’and’), as there are sometimes multiple entities within one labeled entity in the input data. 1_.

Next, normalisation is performed. This normalisation constitutes:

• Lowercasing all letters, this removes the distiction between CIFAR-10, Cifar-10 and cifar-Cifar-10

• Removing dashes, this removes the distinction between cifar-10 and cifar10 • Removing spaces before numbers, this removes the distinction between

ci-far10 and cifar 10

• Removing certain words like ’dataset’ and ’corpus’, this removes the distinc-tion between cifar-10 and cifar-10 dataset

• Removing citations, implemented as removing everything between brackets, this removes the disctinction between VOC 2007 [10] and VOC 2007. These pre-processing steps allow for the creation of a dataframe indexed by an entity id, where each row describes an entity. Furthermore, this dataframe includes information that, for each entity, identifies the sentence from which the mentioned was extracted, the article from which the mention was extracted, and the original entity it was ’split’ from. Finally, a column containing the numerical characters within the dataset mention string is added to the dataframe. This column is used during the clustering steps. Numbers are an essential part of dataset names as they

1_{No extra steps were taken to split ellipses in such a way that each split contains all information}

that is implicit in the text (i.e. the entity VOC 2007, 2008 datasets is not split into (VOC 2007, VOC 2008), expanding the algorithm with this functionality could be a focus for future research

(19)

can denote years or version numbers. This means they can disambiguate non-coreferential entities which are lexically extremely similar (100, CIFAR-10), and as such they are considered separately in our algorithm.

In order to evaluate the quality of the pre-processing and understand the effect it has on the intra-cluster variance in string similarity, the distribution of the mean Levenshtein similarity ratio within clusters was observed before and after normalisation. The Levenshtein similarity ratio of two strings is defined as follows:

lev_ratio = (|a| + |b|) −lev(a, b) |a| + |b|

Where lev(a, b) denotes the Levenshtein distance between string a and b. This gives us a value between 0 and 100, denoting similarity between a and b (100 being the most similar).

5.2.2 Intra-document clustering

The first clustering occurs at intra-document level. This clustering partitions the entities within our data, where each subset M within this partition is a set of coreferential entities that occured within the same article. This intra-document clustering is performed for two reasons. Primarily, the intra-document cluster-ing decreases the amount of more expensive computation necessary later in the algorithm. As we can identify coreferential entities at article level, we need not compute similarity for all entities with respect to all other entities. Instead, we can do so at M group level, constructing feature vectors for each of our M groups. It will thus not be necessary to compute similarities for our n entities within our data, but instead for the m identified M groups. Besides this, with the preliminary knowledge that two entities originate from the same article, we are afforded more leniency with regards to the required similarity for two entities to be classified as co referential.

The intra-document clustering conceptually involves a groupby operation on our dataframes article column, applying lexical clustering within each group. The lexical clustering applied is dependent on the character n-gram tf-idf cosine sim-ilarity between the entities. That is to say, the character n-gram tf-idf vectors over the all of the entities within our data are computed. Considering the n-gram frequency over all of the entities discounts more common words within our dataset, and instead weights overlap of rare n-grams heavier. The tf-idf vectors have 3708 dimensions, and are extremely sparse. Our n-grams were constructed using an n of 3.

We construct a matrix L which contains, for each entity within an article, the corresponding tf-idf vector, row wise. We normalize this matrix on the 1st axis (so again, row wise) and perform matrix multiplication of the normed L ¯L with its

(20)

transpose (¯LT). Finally, we subtract each element of the matrix from 1, resulting in S. S denotes, for each article, the lexical cosine similarity between each entity within that article.

S = 1 − ¯L ¯LT

Besides this a number exclusion matrix N is constructed. N denotes, for col-umn i and row j, whether entities i and j contain the same numbers, this matrix is constructed using the number column of our dataframe, and plays an essential role in partitioning the entities correctly. Both S and N are of dimensionality hxh where h is the amount of dataset-referential entities within that article. We perform element wise multiplication of the similarity matrix S and the binary num-ber exclusion matrix N. This eliminates future edges between entities containing different numbers.

A = (S ∗ N ) > θ

All elements below a certain threshold (θ) are dropped. This threshold was set to 0.75, and the resultant matrix is denoted with A. The now binary matrix A can be viewed as an adjacency matrix of an undirected graph G: we add edges to G as follows: Gij = 1 iff Ai,j = 1. Each component in G now corresponds to an equivalence class of intra document coreferring entities.

5.2.3 Inter-document clustering

Next, the algorithm clusters the identified M-groups across articles. This cluster-ing is performed through defincluster-ing similarity between M groups as a function of lexical, semantic and document level features.

Lexical The primary feature in our clustering is lexical: we capture the struc-ture and characters of our entities in n-gram tf-idf vectors between which we can calculate similarity Average pooling of the n-gram tf-idf vectors for all entities within each group M is performed, the result is the vector that expresses the lexi-cal features of that M group. Again, these vectors are of length 3708, and remain extremely sparse. In similar fashion to the intra-document clustering, cosine simi-larity is used to express the simisimi-larity between two lexical vectors, describing two M-groups.

Semantic In order to express the direct context and semantic features of the entities we attempt to capture the semantic meaning of the sentence in which a given entity was identified. This is done through sciBERT sentence embeddings,

(21)

which can be retrieved through use of ’BERT as service’ (Xiao (2018)) and us-ing the sciBERT vocabulary and model weights. These embeddus-ings are retrieved from the second-to-last hidden layer of all of the tokens in the sentence, to which average pooling is applied in order to obtain a fixed length representation of the sentence. This representation reflects the latent space identified by sciBERT dur-ing its traindur-ing for NER, which gives semantic cues about the sentence, allowdur-ing us to utilize the semantic the sentence in the form of a vector of length 768. These vectors are retrieved for all entities within the M group, to which average pooling was applied in order to represent the semantic features of our M group in one vector.

Document Finally, document-level features are constructed in the form of vec-tors that embed article content in a 64-dimensional space. These embeddings were generated by training gensims doc2vec model (Řehůřek and Sojka (2010)) on 10 000 articles extracted from the NIPS, SDM, SIGIR and VISION conferences. This model is an implementation of the model described by (Le and Mikolov (2014)). The model uses paragraph level embeddings to outperform other document em-bedding techniques, such as averaging of word emem-beddings. Words that occurred less than two times were discarded, the target vector size was set to 64 and the model was trained for 16 epochs.

Linear Interpolation Using these embeddings, we construct three similarity matrices, denoting the cosine similarities between different M groups in three dif-ferent spaces: lexical, semantic and article. These matrices are referred to as L, S and D, respectively. We also consider a number exclusion matrix N that is constructed for the all M groups. The number exclusion matrix is constructed by taking the number from the first M-group (which must necessarily be in all other entities due to the number exclusion step during the intra-document clustering) and assigning this as the number for that M group. The number exclusion ma-trix then denotes for Ni,j whether M-groups i and j have the same representative number.

These matrices all have dimensions m x m where m is the number of M groups identified during the intra-document clustering steps. These matrices are combined through linear interpolation in order to make a weight matrix W that will serve as adjacency matrix for the graph to which our clustering will be applied. This is done as follows:

W = ((λL + σS + δD) ∗ N ) > ψ

Where λ, σ, and δ are the linear interpolation coefficient with which the ma-trices are multiplied before being summed. The result of this summation is then

(22)

multiplied by the binary number exclusion matrix N, which assures that edges be-tween m-groups with different numbers are not possible. Finally, all values below a certain threshold (ψ) are set to zero W .

Graph clustering In the final stage of the algorithm, we interpret the matrix W as an adjacency matrix, and construct a graph that connects nodes representing M groups. In order to establish a baseline a rigid and interpretable method was taken in order to cluster the M groups. After each edge in W below a certain threshold (ψ, set to 1) was set to 0, the connected components algorithm was executed on the resultant graph. Grid search was implemented across the 3 linear interpolation parameters λ, σ and δ, while ψ was kept constant at 1.

Besides this rigid baseline method, the Louvain graph clustering algorithm was executed on the non-thresholded graph . The Louvain algorithm is commonly used on social networks for identifying communities. It does so through optimising the modularity, a measure for the interconnectedness within communities. Blondel et al. (2008)

5.2.4 Data

A golden standard was constructed from the labeled data for the dataset mention extraction task. In total, this golden standard contains 1042 dataset mentions, divided into 238 clusters. The golden standard was constructed in semi-supervised fashion: the described algorithm was executed on the dataset mention extraction dataset, the output of the algorithm was manually corrected. In total the data contained 1405 entities, categorized into 238 golden standard categories, thus 238 unique datasets were included in our data. As can be seen in figure 1 the category size distribution is heavily skewed towards small clusters, with the mode of the distribution being 2, and the max 55.

5.2.5 Evaluation

Evaluation of clustering is famously challenging, there is a vast array of literature on the topic. In 2007 Amigo et al. published an extensive analysis of cluster evaluation metrics and their ability to satisfy certain mathematical requirements such as cluster homogeneity and cluster completeness(Amigó et al. (2009)) The paper concludes that the B-cubed metric is the only evaluation metric to satisfy all constraints, and the output metrics are intuitive and clear. For these reasons, we will be using the B-cubed metric for evaluation of our clustering.

B-cubed In short, the B-cubed algorithm (i.e. the algorithm executed in order to obtain the B-cubed scores) computes precision and recall independently. The

(23)

Distribution of category size in our golden standard

3 Figure 1: Distribution of category sizes in our entity clustering golden standard recall and precision are computed for each element, and for both metrics the av-erage is returned. Let us define the ’category’ of an element as the set it was an element of in the golden standard. The ’cluster’ of an element is the set it was assigned through the algorithm that we are evaluating. For each element, the precision is computed by identifying percentually how many elements in its cluster are also in its category. Symmetrically, the recall for each element is computed by identifying percentually how many elements in its category are also in its cluster. The harmonic mean of these scores can then be taken in order to compute an f1 score. Although B-cubed was originally described and is commonly considered as an algorithm, it can also be defined as a function. L(e) and C(e) denote the category and cluster of item (e) respectively. Formally we define the correctness of two items as such:

Correctness(e, e0 ) =

(

1iff L(e) = L(e0) ←→ C(e) = C(e0) 0otherwise

This amounts to two items being correctly related when they share a category if and only if they appear in the same cluster. The recall of an item is the proportion of items in its category which have the item’s cluster. The precision is symmetrical, in that we simply replace ’cluster’ with ’category’. This can be formalised as

(24)

follows:

Precision BCubed = Avg[Avge0_.C(e)=C(e0₎ Correctness(e, e0)]] Recall BCubed = Avg[Avge0_.L(e)=L(e0₎ Correctness(e, e0)]]

Sentence Level The evaluation of our clustering was performed using the B-cubed algorithm at sentence level. That is to say, after clustering our entities we map the clustered entities back to their sentence of occurrence, identified by a sentence id. These clustered sentences are then compared to our golden standard, which is composed of clustered sentence id’s. This is done because the splitting and subsequently the assigning of id’s to our entities occurs within our algorithm, and our golden standard should be independent of any steps taken by our algorithm. This means that any sentence can belong to multiple clusters, this characteristic of our clustering is dubbed multiplicity.

This sentence level evaluation can result in some disparity. For example, it is possible that a sentence contains both the ’pascal voc 2007’ and ’pascal voc 2012’, and that these are subsequently assigned each others golden standard clusters. This means that ’pascal voc 2007’ is assigned the cluster that is representative of the ’pascal voc 2012’ dataset, and vice-versa. In this situation, while our clustering is clearly erroneous, our evaluation is unable to identify this mistake. This is because the sentence will be assigned to both the 2012 and 2007 clusters, which is also the case in the golden standard. This is an unfortunate but necessary concession, the disparity between actual performance and reported performance is small as these cases will be rare due to both the nature of our data and our algorithm.

Many clustering evaluation metrics struggle with multiplicity. B-cubed is no different, however Amigo et al. proposed an extension to the B-cubed metric which allows for overlapping clusters. This is done through a redefinition of the correct-ness function so that it allows for multiplicity, as overlapping clustering cannot be evaluated through a binary function. Details and analysis of this redefinition can be found in Amigó et al. (2009). This is the algorithm that we utilise to evaluate our clustering.

Graph Diameter In order to understand the structure of our graph at different threshold values the diameter values of the connected components (components) were calculated and visualised. The diameter of a graph is defined as the maximum eccentricity. The eccentricity of a node is defined as the longest shortest distance to any other connected node within the same graph. The eccentricity is discrete, in that it is a count of the edges within that shortest path, and not a summation of the weights.

(25)

For a general graph, the fastest known method to calculate the diameter of a graph G with nodes N and vertices V is by calculating the APSP (all paths shortest path), and subsequently taking the maximum value Ancona et al. (2018). The complexity of this agorithm is O(V*V*V), and as such there were computational limits to the thresholds for which we could compute the diameter: a low value (i.e. < 0.15) meant a complete graph of high density, whose diameter we were unable to calculate due to computational constraints.

6 Results

6.1 Dataset Mention Extraction

6.1.1 Hand annotated data

The primary method of evaluation for our NER model was through a train, evalu-ation, test split. The test set was ’zero-shot’, the results can be observed in table 1, where both the precision, recall and f1 on our hand annotated data for the exact and beginning evaluation measures are displayed.

Exact Begin

Prec. Recall F1 Prec. Recall F1

Train set 0.92 0.95 0.93 0.95 0.97 0.96

Eval set 0.90 0.86 0.88 0.96 0.90 0.93

Test set (Zero-shot) 0.82 0.88 0.84 0.88 0.93 0.90

Table 1: Precision, F1 and recall scores for the best performing model on the evaluation set. The scores for both the Exact and Beginning measure are reported In figure 2 the performance of the saved checkpoints during training on both the training and testing data is displayed. The networks peak performance on the testing data was achieved around training step 2400.

(26)

SciBERT performance on the training and testing data during training

Figure 2: Performance of the network on the test and train set during training (Evaluation measure exact)

Below are the cross validation results for each of the 5 folds that were con-structed. The standard deviation for both the recall and precision, and as a result the f1, is 0.01, the recall is consistently slightly higher than the precision. This cross validation was performed on the evaluation set.

(27)

Fold Prec. Recall F1 0 0.86 0.87 0.86 1 0.83 0.87 0.85 2 0.85 0.88 0.86 3 0.86 0.88 0.87 4 0.82 0.89 0.85 Mean 0.84 0.88 0.86 std. (σ) 0.01 0.01 0.01

Table 2: Precision, F1 and recall (evaluation measure: exact) for each of the folds of the cross validation, with the mean and standard deviation

As can be seen in table 1 our network was able to learn the training set distri-bution well, obtaining an f1 of 0.93 with the evaluation measure ’exact’, the recall being slightly higher than the precision. For the ’beginning’ measure the network learned the training set distribution almost entirely, resulting in a very tight fit. In general the scores for the beginning measure are consistently a few points higher than for the exact measure. This is intuitive, the measure is slightly more lenient, as a matter of fact the data that is has been correctly classified according to the ’exact’ measure is a subset of the data that is correct according to the ’beginning’ measure.

The distribution of our training data reflects well onto our evaluation and zero-shot test set, where the F1 does not drop more than 9 points for the evaluation measure exact. The most important and telling scores are those that refer to zero-shot classification, as this is both the most challenging evaluation and most indicative of real-world performance. For this zero-shot task our network was able to attain high recall, precision and f1 scores which are not much lower than for our evaluation split. This shows that, despite the tight fit of the training data, the network has learned to identify dataset mentions through syntactic and semantic cues, rather than through recognition of datasets it has previously encountered.

Figure 2 shows the networks performance on the test set during different check-points saved during training. In it, we see the network attain high F1 scores in the early stages of its training, as its f1, precision and recall curves quickly rise in during the first 1200 training steps (approximately 4 epochs). While the network continues to develop a tighter fit to the training data in subsequent steps, this does not negatively impact the networks performance on the test data, i.e. the network does not overfit.

Finally the 5-fold cross validation scores displayed in table 2 show little vari-ance, indicative of the homogeneity of the data and general fit of our model. All measures have a standard deviation of 0.01, and the model performed slightly

(28)

worse in terms of f1 than the evaluation scores displayed in table 1, but slightly better than the zero-shot results. This is likely due to the explicit optimisation of the network on the evaluation set, leading to slightly higher scores. On the other hand, the zero-shot classification is slightly more difficult due to the lack of overlap in datasets.

We can conclude that the network undergoes stable training, finding an opti-mum relatively quickly and deviating little. The networks tighter fit of the training data does not negatively impact its performance in the testing data. The data dis-tribution is homogenous, as we see little variance in performance over each of our 5 folds during cross validation.

6.1.2 Identifying difficult positive sentences

Table 3 displays scores obtained by our network on the easy/hard split data, for the evaluation measure exact and beginning. For reference the scores of our network on the all of the training data (so both the easy and hard sets) is also displayed.

Exact Begin

Trained on Evaluated on Prec. Recall F1 Prec. Recall F1

Easy Easy 0.84 0.89 0.86 0.90 0.90 0.90

Hard Hard 0.84 0.87 0.85 0.89 0.90 0.90

Easy Hard 0.51 0.52 0.51 0.73 0.72 0.72

Hard Easy 0.83 0.83 0.83 0.85 0.87 0.86

Both Both 0.90 0.86 0.88 0.96 0.90 0.93

Table 3: Precision, F1 and recall scores for the best performing models on the easy and hard data

When observing the performance of our model on the easy and hard positive split data as displayed in table 3 it becomes apparent that the model has little added difficulty correctly classifying dataset mentions that occur in sentences with more than 4 positively labelled instances. This means that the network is also able to understand and interpret ellipses and summations, these more complex rules and structures are not harder for the network to identify than simple, one- or two-word dataset mentions. These structures and patterns are difficult even for human annotators to consistently parse and classify correctly, making the networks ability to understand the nuances of the labelling task significant.

While we observe no significant difference in F1 scores between the model trained and evaluated on the same sets (easy and hard), we can see significant differences when we train on one set, and evaluate on another. Observing the

(29)

scores for the model trained on the ’easy’ data, and evaluated on ’hard’ we see a significant drop in performance: the model does not sufficiently understand the more complex sentence structures it is made to classify and as a result the f1 score for the ’exact’ measure drops by about 35 points, to 0.51. Generally, the scores for the ’exact’ and ’beginning’ measures differ by around 5 points. However, we see that the model trained on easy data and evaluated hard produces a larger gap between these scores: around 20 points. This is indicative of the models ability to somewhat consistently correctly identify the beginnings of these more complex structures, but not the other words within the mention. It can identify that in ’the pascal voc 2012, 2017 sets’ the word ’pascal’ denotes the beginning of a dataset mention. This is because this pattern is similar for ellipses and simpler dataset mentions (’the pascal voc 2012 dataset’, for example). Subsequently, however, it cannot consistently correctly label the rest of the dataset mention as it does not recognize these types of sentence structures.

Furthermore, the model trained on hard data and evaluated on easy performs well, obtaining f1 scores comparable to those that the model achieves when trained on the same data that it is evaluated on. This shows that the patterns found in the hard data generalize well to those in the easy data, despite the reverse not being true. Finally, we see that the model performs best when trained and evaluated on all of the data, as can be seen by comparing the last row to the others. This is likely due to the amount of training data, as well as the fact that the patterns found in hard data generalize well to the easy data.

6.1.3 Identifying difficult negative sentences

Table 4 shows the performance of our network with varying amounts and types of negative sentences. The row ’Negatives w/ trigger words’ corresponds to the data as used for table 1. The amount of positive sentences is constant, but varies relative to the amount of negative sentences. In the table "TW" refers to trigger words.

(30)

Exact Begin

Data Sents Neg Prec. Recall F1 Prec. Recall F1

Only positive 2860 0% 0.90 0.93 0.91 0.94 0.97 0.95

Negatives w/ TW 6000 52% 0.90 0.86 0.88 0.96 0.90 0.93

Negatives w/o TW 30 711 86% 0.91 0.92 0.91 0.96 0.95 0.95

Both negatives 33 851 92% 0.83 0.90 0.86 0.87 0.94 0.90

Table 4: Precision, F1 and recall scores for the best performing models on data with varying amount and types of negative sentences. "TW" refers to trigger words. The row labeled ’Negatives w/ TW corresponds’ to the data ratio in table 1

As noted in previous research, the ratio of positive and negative sentences has been found to be important for NER models trained on a dataset mention extraction task (Prasad et al. (2019)). As such, table 4 displays the performance of our model on data with different amount and types of added negative sentences. We see that, in general, the amount of negative sentences influences the models scores: the model trained on only positive sentences achieved an F1 of 0.91. The best performing model trained on data that was 92% negative attained a 5-point lower F1, at 0.86. These two models received the exact same positive sentences.

However, when inspecting the performance of our model on the datasets listed in table 4 more closely, the importance of the type of negative sentence becomes clear, as well. We observe no real drop in performance between our ’only posi-tive’ model and our ’negative w/o trigger words’, while the percentage of negative sentences increases from 0% to 86%. Meanwhile, when adding negative sentences with trigger words, our performance drops by 3 points in terms of F1, while signif-icantly fewer sentences were added ( 3000 as opposed to 30 000). This indicates not the importance of the amount of positive sentences, but rather the type of negative sentences. The negative sentences with trigger word are more difficult for the network to correctly label than those without, and as such cause a much larger performance drop, despite being much fewer in number.

It is important to note that these two types of negative sentences synergize, i.e. when both types of negative sentences are added the largest drop in performance occurs: the F1 drops from 91 to 86. This drop in F1 score can be attributed to a large extent to the precision, which drops by 8 points. The succeeds in identifying a large amount of the datasets, but finds more false positives in the process.

(31)

6.2 Entity clustering

6.2.1 Connected Components

Below in table 5 the linear interpolation parameters corresponding to the top 5 B-cubed f1 scores found during grid search are displayed. Besides this the last row corresponds to the linear interpolation parameters such that the graph is constructed using only lexical similarity.

λ σ δ Sum f1 Prec Recall

0.55 0.55 0.33 1.42 0.85 0.87 0.83 0.55 0.73 0.0 1.27 0.85 0.82 0.87 1.09 0.18 0.33 1.61 0.85 0.84 0.85 1.27 0.09 0.33 1.70 0.85 0.85 0.85 1.45 0.0 0.33 1.79 0.86 0.85 0.86 2.00 0.00 0.00 2.00 0.82 0.87 0.78

Table 5: The 5 highest f1 scores found with grid search, using the connected components algorithm on the constructed graph, along with the highest f1 using solely lexical similarity

Figure 3 contains surface plots of distribution the B-cubed f1 scores with respect to different combinations of λ, σ and δ. The threshold was kept constant at 1, resulting in 3 degrees of freedom. Parameters λ and δ are expressed on the axis of each surface graph, while σ is varied between each of the four panels.

The grid search results in 5 demonstrate the importance of lexical similarity to our clustering. The best performing model uses primarily lexical similarity, with a λ of 1.45, a δ of 0.33 and significantly does not use the semantic similarity from sciBERT embeddings at all. We see quite a lot of variance in the top 5 parameter combinations, the σ parameters varies between 0 and 0.73. When observing the λ parameter in combination with the σ we see that as λ decreases (and as a result the performance of the algorithm), the σ increases. It seems to ’fill’ the gap created by a lower λ, as any connections below 1 are dropped. This suggests that the information in the semantic (i.e. sentence embedding) space serves as a slightly noisier proxy for the information in the lexical space. The document embedding are, with exception of 1 row, consistently interpolated with a parameter δ of 0.33. Generally, a high precision means that the clusters are more ’fragmented’, i.e. there are more small clusters. Symmetrically, a high recall denotes larger clusters: one large cluster containing all entities would result in 100% recall, but a very low precision. With this in mind we can observe an increase in recall as the sum of the parameters grows, and the opposite is true for the precision, with the exception

(32)

F1 score when varying the linear interpolation parameters λ, σ, and δ

Figure 3: Surface plots illustrating the B-cubed F1 distribution when varying the the linear interpolation parameters

of the second row. This second row attains an unusually high recall. However, inspecting the similarities that these parameters correspond to, we can see that the mean similarity in the lexical, semantic and document space is 0.007, 0.905, 0.163, respectively. This accounts for the initially counter-intuitive data in the second row. The high σ contributes to much higher resultant interpolated similarities and thus edges in our graph, as the semantic level similarities are high themselves.

Finally, the last row shows the performance of the algorithm when using only lexical similarity. The algorithm performs well, attaining high precision and re-call. Intuitively, the lexical similarity remains the most important contributing factor within our clustering algorithm. We can conclude that besides the lexical similarity, the document level embedding are contributing to better performance, while the sentence level embeddings seem contain similar information as the lexical similarity, as they remain relatively ’balanced’.

The surface plots in figure 3 show that lower values of sigma lead to a larger ’high’ area (i.e. more combinations of parameters that result in a high f1), and that for higher values of sigma the best performing parameters combinations become smaller, the mode of the distribution shifts towards lower values of δ and λ.

(33)

minimum value for the interpolation sum to amount to, as we see that for when the interpolation parameters sum to less than 1, the f1 remains low. Besides this, when looking at the marginal distribution of f1 over lambda and delta, we see that the variance is higher for λ, indicative of the relative importance of this parameter towards the f1 score.

In general, these figures show the model performs better with a lower σ, and highlight the importance of λ.

6.2.2 Normalisation

Figure 4 displays the distribution of the pairwise Levenshtein similarity ratio within categories, along with the kernel density estimation curve. This distribu-tion was obtained by taking the Levenshtein similarity ratio between all possible pairs within a cluster, and computing the mean for each cluster. This was done before and after normalisation.

Most notable about this figure is the significant shift of the general distribution to the right, i.e. to higher similarity scores. The mode of the distribution before normalisation is the bin from 65 to 70, afterwards the mode is the bin from 95 to 100. The difference in the height of these modes is substantial. Besides this the mean of the distribution increases from 74.4 before normalisation, to 88.4 afterwards. This shift shows the effectiveness of the normalisation steps performed, within category entities are lexically much more similar after the preprocessing steps performed in the algorithm. This also explains the importance of λ, as we can observe clear lexical similarity between inter-category entities.

Inter-category pairwise Levenshtein similarity ratio before and after normalisation

Figure 4: Inter-category pairwise Levenshtein similarity ratio for all entities in the data before and after normalisation

(34)

6.2.3 Observing similarity in the lexical, semantic and article space Table 6 displays the mean pairwise inter-category cosine similarity in the article, semantic and lexical spaces. This was done through calculating the cosine similar-ity between the representative vectors for all combinations of entities within our data, and computing the mean. Afterwards, the same was done but only combi-nations of entities within the same category were considered. Categories of size 1 were not considered.

Inter Category All entities

Article vectors 0.34 0.07

Semantic vectors 0.20 0.13

Lexical vectors 0.32 0.01

Table 6: Mean pairwise vector cosine similarity within categories versus over all entities in our data in the article, semantic and lexical space

Pairwise cosine similarity (article space)

Figure 5: Pairwise cosine similarity within categories, and for all entities in our data, in the article space. The curves show kernel density estimation over the distributions

(35)

In the article, semantic and lexical space we observe significant differences in the inter and outer category pairwise similarity between the vectors representing entities. This indicates that in all of these spaces vectors representing entities within the same category are generally closer to each other than to other, outer-categorical vectors. This is important as it shows the effectiveness of the embed-dings and their distances: two vectors being similar in these spaces correlates to them belonging to the same category.

Besides this we can see that the differences are more pronounced in the article and lexical space than in the semantic space. In the article space and lexical space average similarity increases by a factor of 5.15 and 32, respectively. The factor for the semantic space is 1.53. This shows significantly more difference between inter and outer category distances in the article and lexical space, and helps explain the lesser importance of the semantic distance.

The pairwise cosine similarity distributions in these three different spaces are visualised in figures 5, 6 and 7. Figure 5 shows a continuous distribution, to which kernel density estimation was applied. The mean and mode of this distribution shift slightly to the right when we observe only entities within the same category, consistent with table 6.

Pairwise cosine similarity (lexical space)

Figure 6: Pairwise cosine similarity within categories, and for all entities in our data, in the lexical space

Tracking Dataset across Conference Papers

Tracking Dataset use across

Conference Papers

Tracking Dataset use across

Conference Papers

Dataset mention extraction and clustering to construct a

bipartite knowledge graph

Contents

1

Abstract

2

Introduction

3

Background

3.1

Practical Background

3.2

Technical Background

4

Literature Review

5

Method

5.1

Dataset Mention Extraction

5.2

Entity Clustering

6

Results

6.1

Dataset Mention Extraction

6.2

Entity clustering