Creating financial specific models for passage re-ranking

(1)

M

ASTER

T

HESIS

Creating financial specific models for passage

re-ranking

by

W

OUT

K

OOIJMAN 10749586

Tuesday 12

th

January, 2021

48 ecs March 2020 - January, 2021 Supervisors: Dr. I. MARKOV Dr. V. SAZONAU Assessor: Dr. W. AZIZ

(2)

(3)

Within the financial crime department of ING, passage re-ranking is used to search through lots of domain-specific documents and gather the relevant passages and information given questions from analysts. This is done to gain quick access to the vast amount of information about companies and private individuals that financial institutions have to deal with on a daily basis. Financial specific datasets for passage ranking are currently not shared and thus, in this work, we show a method to create a financial domain-specific dataset from an open-domain passage re-ranking set and fine-tune a passage re-ranking model on this dataset. We do so by first using weak labels to classify a part of an open-domain dataset and training different topic classification models from literature to further classify the remainder of passages and queries. Subsequently, we use the created datasets to fine-tune a passage re-ranking model and compare that to both the current state-of-the-art (SOTA) in passage re-ranking as well as baselines from academia and the current ING solution. These methods have been previously applied on open-domain datasets but not on the financial domain. We test the impact of fine-tuning the passage re-ranking models, in the form of BERT language models, on the financial domain using the mean reciprocal rank (MRR) and the mean average precision (MAP) as evaluation metrics and show that several models significantly improve the current ING re-ranking solution as well as achieve similar results to the current SOTA.

(4)

(5)

1 Introduction 5

1.1 Research questions and contributions . . . 7

1.2 Thesis outline . . . 8

2 Related work 10 2.1 Topic classification . . . 10

2.2 Passage ranking . . . 11

2.2.1 Vector space models . . . 11

2.2.2 Probabilistic models . . . 13

2.2.3 State-of-the-art . . . 13

2.2.4 ING solution . . . 16

3 Dataset creation for a financial domain 18 3.1 MS Marco dataset . . . 18

3.2 Creating a domain-specific subset of MS MARCO . . . 19

3.2.1 Weak labelling . . . 20

3.2.2 Classifying unlabeled data . . . 21

3.3 Domain-specific dataset . . . 24

3.4 Comparing the full and financial MS MARCO datasets . . . 25

4 Evaluating state-of-the-art passage re-ranking on financial data 28 4.1 State-of-the-art re-ranking method on financial data . . . 28

4.1.1 Fine-tuning BERT Model . . . 29

4.1.2 Baselines . . . 31

4.1.3 Evaluation metrics . . . 33

4.1.4 Experimental protocol . . . 34

4.2 Results . . . 35

4.2.1 Model-specific results . . . 35

(6)

4 Table of contents

5 Conclusion and Future work 40

5.1 Conclusion . . . 40 5.2 Limitations and future work . . . 41

Bibliography 45

Appendices 52

Appendix A Additional background 53

(7)

Introduction

Giving clients a safe and trustworthy feeling while remaining up to date with current affairs has become increasingly important for financial institutions. More and more focus has been put on detecting fraudulent practices done by individuals and companies as it is an important part of maintaining a clean organisation and staying ahead of criminal activities. To achieve this, most banks have an onboarding procedure for new customers and a reviewing procedure for existing customers known as ’Know Your Customer’ (KYC). Remaining compliant with regulations and detecting when this is not the case means that all information regarding clients, contracts, chamber of commerce documents and other sources of information that enter the company, have to be read thoroughly and filtered on the relevant, and possibly changing, information. This procedure will help identify and verify customers and help prevent that there are no criminal activities taking place that the financial institution should know about and act upon.

Most of these processes consist of manually reading trough statements provided by customers and scanning for information that is not lawful or not consistent with earlier findings, cross-referencing these internal documents with government documents and earlier statements of said customers. This can entitle either natural persons, think of opening a normal bank account or taking out a loan or mortgage, it can entitle a company, where the annual report or quarterly numbers are presented and might be inconsistent with earlier information or the board of directors is updated and it has to be checked if there is no conflict of interest. This information has to be saved and validated both with the government and the customer. Revisiting this information every once in a while is important to remain compliant with the current regulations and to be able to act quickly when regulations change or extra steps are required. Manual labour at this stage is very resource consuming. It will require lots of people to constantly read through long and monotonous documents to get small scraps of information out of it[2]. Analysts will follow guidelines and procedures that employees interpret differently and personally. On top of that, they need to reach out to customers for missing information. Since the main focus of these documents is human understanding, the task of retrieving the relevant information is no arbitrary one. The document should be well understood both syntactically and semantically.

(8)

6 Introduction

Furthermore, more often than not multiple documents are necessary to answer the questions posed by an end-user.

To tackle this, ING is looking at automating parts of the KYC process to assist the analysts that do the manual work. This consists of a system that can understand and quickly look up the relevant information needed for the KYC process, an information retrieval system that one can ask questions to and will quickly provide the information one is looking for presented in a structured way. A pipeline is put in place by ING to help KYC analysts in this process (see Fig. 1.1). This consists of a pre-processing step for the documents, here the uploaded document or pack of documents are divided into blocks of text, paragraphs, and stripped of images and other formatting. It is then sent to the ’vectorizer’, the step that uses the Universal Sentence Encoder model (USE) [11], which captures semantic similarities, to encode both these blocks of text as well as the query into 512 dimensional vectors that are indexed and stored together with the textual representation in an Elastic Search (ES) cluster [23]. This way the end user can return and ask more queries about already uploaded documents without having to go through this time-consuming process again. The blocks of text are then ranked against the query using the BM25 algorithm as a first step and a top N selection is re-ranked using the cosine similarity measure [71] on the USE embeddings. The outcome is presented back to the user in the form of blocks of text that are most likely to contain the answer. This gives the analyst quick access to answers out of big documents and speed up the reviewing process mentioned above. The current pipeline is an initial solution that does often not behave as expected. Answers are not found and knowledge is not captured correctly. The hypothesis is that this is due to the USE model that is trained on general knowledge and not fine-tuned for the task at hand as well as that the USE is trained to embed sentences and not whole paragraphs.

Figure 1.1: A simplified overview of the current solution for paragraph re-ranking

Recent research in Information Retrieval (IR), Natural Language Processing (NLP), word-, sentence- and document embeddings, Machine Reading Comprehension (MRC) and passage (re-)ranking tasks have given rise to more advanced techniques than word matching techniques utilized in a control-F search application. Both semantic and syntactic information in text can be captured and used to create a deeper understanding of the text and enhance search results [46]. Furthermore, tailored datasets have been created to fine-tune generic algorithms to capture more domain-specific information and excel in said domains [37], [6], [29]. Transformer language models like BERT [22], XLNet [86], GPT-1 [64] and GPT-2 [65] have given rise to higher accuracy in Language Understanding tasks and a more generalised approach to use said models for different problems. By fine-tuning a pre-trained

(9)

base model one can create problem- and domain-specific models for various downstream tasks like GLUE [78] and SQuAD [66] benchmarks which were harder before with models like GLoVE and Word2Vec as they have difficulty capturing word order [60]. These models are big and very expensive to train both in compute, time and CO2 emission [72] but have raised most leaderboards on NLP

related tasks and gain lots of traction. In an industry that is not focused on deploying big AI models, however, not a lot of practical use cases have been seen, mainly due to their huge sizes and high cost to train [12], [72].

1.1 Research questions and contributions

The topic of this masters thesis will focus on achieving a better understanding of these techniques and develop a pipeline that can better help a financial institution quickly find the answers they are looking for in multiple documents while simultaneously learn about the relevant IR and NLP techniques and how to combine the two. This is done by developing a model that can understand the documents that are presented and search through them in an easy and quick manner. Understanding in this sense means capturing semantic relevance and support the user with what they believe is important within the specific domain. This model should be easy to use for people that might not have a full understanding of automatic IR or MRC while still give enough evidence to help people understand the choices and, most importantly, trust the engine. One key point is that the pipeline has to achieve good results fast. Searching through the documents available should not take too long to have value for the analyst. Utilizing the comprehension powers of bigger models like BERT while keeping it domain-specific and small enough to work on limited resources and ’on-premises’, which is important for companies like ING that handle lots of sensitive data. A new pipeline will be introduced that captures both the language understanding of tailored BERT-like models on domain-specific data yet remains light enough to be put to practical use without huge amounts of compute.

To achieve this goal we want to answer the following questions:

RQ 1. How can we fine-tune passage re-ranking models to the financial domain in order to improve their search results in this domain?

The financial domain is a subset of knowledge within the datasets that are currently used in most passage re-ranking and Q&A tasks like the MS MARCO [4], SQuAD and SQuAD 2 [67] datasets. To excel in this domain, fine-tuning of the existing models, using either transfer learning for fine-tuning the models or public datasets, can prove helpful (see Howard and Ruder [29], Lee et al. [38] and Beltagy et al. [6] for topic-specific models) and have shown to lead to better results within these domains. The hypotheses here is that fine-tuning Language Models on domain-specific data enables the model to capture deeper linguistic properties and utilize this to perform better on said domain. The datasets need to be looked at to come up with a solution that keeps the computing costs low while excelling in the said domain.

(10)

8 Introduction

RQ 1.1 How do we create a domain-specific dataset from a subset of existing datasets to fine-tune IR pipelines on?

Creating a domain-specific dataset is no arbitrary task on its own. We will need to filter existing datasets based on the domain at hand in order to fine-tune existing language models on. RQ 1.2.1 How can we apply existing techniques in IR and NLP in real-world IR passage re-ranking problems?

It is expected that the current language models as presented in most dataset challenges like Bajaj et al. [4] and Rajpurkar et al. [66] are not one-on-one applicable in business. It will require lots of graphics processing units (GPU’s) or even tensor processing units (TPU’s) to be able to cope with the computations required to model the languages. Most companies do not have access to these kinds of resources and, because they want their data to remain ’on-premise’ due to privacy laws, can not go with cloud-based solutions. Because of this, modifications will have to be made to get comparable and competitive results in a lower compute setting. RQ 1.2.2 How can we evaluate the new passage re-ranking model and compare them with open-domain state-of-the-art passage re-ranking models in terms of quality and model size? Where almost all challenges today aim at getting a qualitatively near-perfect language model, speed is of the utmost importance for most users, especially within business. The pipeline will have to balance quality with model size and additional metrics will need to be placed to evaluate the models in the current problem setting. Common evaluation metrics will be tested and compared against.

Furthermore, it is important to keep in mind what the important design choices are when dealing with expert end-users in presenting evidence and relevant answers. While keeping the pipeline simple, usable and easily upgradable. This entails that the building blocks of this thesis should be stand-alone and explainable individually.

Expected is that the balance between model size and precision/recall will require several passes using a combination of both state-of-the-art language models as well as lightweight IR models to come up with a robust and fast model.

1.2 Thesis outline

To achieve this goal, a literature study will be performed into Q&A, IR and MRC topics. Baselines will be implemented to test the current state-of-the-art and a dataset will be selected that captures the domain in which this project sits to be able to test the results as close to a real-life scenario as possible. Finally, a tailored solution that will balance performance and model size will be created for passage re-ranking in a financial field that will be used by real users and will help understand the role of state-of-the-art models in a business setting and what the possibilities are.

(11)

In the following chapter, Chapter 2, a theoretical background will be formed where current solutions will be laid out and different techniques will be discussed. Focusing on classic and lightweight IR techniques and state-of-the-art language modelling techniques. In Chapter 3 the dataset will be discussed as well as the choices that are made to build a financial domain specific dataset, thus answering research question 1.1. Combining previous chapters, the proposed methods on fine-tuning a passage re-ranking model will be introduced in Chapter 4, where the proposed model will be described in detail. Here we will also present our evaluation metrics and baselines, answering research question 1.2.1. Results of both benchmarks and the proposed solution will also be introduced in this chapter, answering research question 1.2.2. A conclusion and follow up work will be discussed in Chapter 5 where a summary of the findings and research questions are discussed along with potential future work.

(12)

Chapter 2

Related work

In this chapter, we will discuss relevant work that has been done in the fields of passage re-ranking and combining language modelling with passage re-ranking. We will first briefly discuss topic classification methods in Section 2.1 as this is used for building our initial dataset. We will then advance to ranking methods that are commonly used in the problem area and related settings, arriving at the state-of-the-art methods that are used in passage re-ranking problems and similar domains in Section 2.2. Ending with an overview of techniques behind the current ING solution and how we want to improve on that.

2.1 Topic classification

To create a financial dataset for the task at hand we need to classify sentences as either relevant for our use-case or not. Topic classification has been a well-researched field. As stated in Kowsari et al. [36] there are many techniques that can be used to build a proper topic classification model. As this is not the main theme of the research, we will focus on classic models and will not deep dive on the current state-of-the-art. According to Nowak et al. [54] and Kowsari et al. [36] amongst many, LSTM based RNN networks are a popular choice and perform good on the task of topic classification. An advancement in LSTMs is the introduction of bi-directionality [70]. Training a networks that take not only information from the past into account but also from further down the sequence and concatenates this to use as a ’teacher’ signal. Using this technique, it has been shown that one can better capture context and achieve better results on downstream tasks that require this context [89]. As shown in Graves and Schmidhuber [24], Bi-LSTM methods work well on phoneme classification and in Tang et al. [74] and Nowak et al. [54] it has been shown to work well on text classification as well.

The current state-of-the-art has moved to attention-based networks (see Section A.2.2) and pre-trained encoding solutions using language modelling based solutions. In Sun et al. [73], where a pre-trained BERT model is used that is fine-tuned for the sentiment and topic classification, state-of-the-art results on different task-specific datasets like text classification and sentiment analysis is

(13)

achieved. Similarly, different models are improving on existing BERT models in this task like Yang et al. [86] and Liu et al. [43] which achieve competitive scores on the same topic classification tasks.

The choice was made to not work with the language modelling approaches as we want to set up a dataset without having to fine-tune big language models and thus safe resources for ING and the Bi-LSTM models have proven to work on this task1.

2.2 Passage ranking

Ranking is a fundamental part of IR. It contains ranking a set of documents2based on a given query. To do this, a standard, or information need, is necessary to evaluate the given ranking according to predefined metrics. Using this information need, documents D are then ranked given a query Q from the user according to these predefined evaluation criteria. Ranking can be seen as sorting a list of documents so that the most relevant documents according to a user end up on top and the least relevant end at the bottom of a list.

To achieve this, multiple types of models have been developed over time. These models can roughly be divided into three types; Boolean models, Vector space models and Probabilistic models. We will touch on both Vector space models as well as Probabilistic models as they will be used in the thesis. We will then proceed to the current state-of-the-art within the passage ranking and re-ranking tasks giving multiple current solutions for these problems and comparing them.

2.2.1 Vector space models

With vector space models, documents and queries are represented using vectors. This way one can capture and weigh important concepts, keywords and other factors in a dense representation of a document and create matrices containing different documents with the same features [69]. A representation of the document is built by indexing the important content and weighing it, creating multi-dimensional vectors that represent the documents and the query, see Fig. 2.1.

1_{See https://www.tensorflow.org/tutorials/text/text_classification_rnn for implementation details in standardized libraries} 2_{Continuing this report, the term document will be used for actual documents, paragraphs or pieces of text that are}

(14)

12 Related work

Figure 2.1: Conceptual representation of documents and query in term vector space from Ogheneovo and Japheth [55]

To create these weighted score representations of documents and queries, Term Frequency Inverse Document Frequency (TF-IDF) is a well-used method [5]. TF-IDF calculates the relative importance of a word with respect to a document or collection of documents. The term frequency is often determined by counting how often a term t occurs in a document d as in Eq. (2.1) and sometimes divided by the total number of terms in d. The inverse document frequency over a total number of documents N and t as the number of documents in N which has term t in it in Eq. (2.2). This is combined to determine the TF-IDF as in Eq. (2.3).

tf(t, D) = f req(t, D) (2.1) idf(N,t) = log N Nt+ 1 (2.2) tfidf(t, D, N) = t f (t, D) ∗ id f (N,t) (2.3) where: t = the term D= a document

N= the number of documents in the collection

Queries can now be matched using techniques like cosine similarity [71], as per Eq. (2.4), to calculate the distance between query and document in vector space. Using this technique one overcomes the limitation of earlier boolean models that were not being able to rank documents as the similarity measure can be used to rank the documents.

cos θ = ⃗q · ⃗D

∥⃗q∥∥⃗D∥ (2.4)

(15)

⃗q = the vector representation query ⃗

D= the vector representation a document or matrix representation of a collection of documents Creating a numerical representation of terms in documents makes it easier to numerically approach ranking problems and compare these vectors in a higher-dimensional space. A simple approach on comparing a given query with specific documents can then be to combine TF-IDF with cosine similarity so that we can easily calculate a score for each document with respect to the query and as such present a ranked list of documents back to the user. At this moment this is the foundation of the solution that ING uses and as such will be used as a benchmark model for the improvements proposed in Chapter 4.

2.2.2 Probabilistic models

Another solution to rank documents given a query is using probability theory. The main benefit of this method is that uncertainty can be modelled and taken into account. With this method, a probability of relevancy is calculated. An academic and industry-standard example of this method is the BM25 ranking algorithm [68]. This algorithm assumes binary independence, meaning that the terms used to build the representation of the document are independent of each other, much like in the vector space models.

Although both these statistical models provide an adaptable method for document ranking it makes an important assumption: The terms or stems used to create the multi-dimensional vector representation of both query and document are independent of each other. This means that sentence meaning is not captured by these solutions. To tackle this, academic research has shifted to linguistic techniques to overcome this assumption.

2.2.3 State-of-the-art

Multiple techniques have been proposed in Mitra and Craswell [48] to utilize the power of language modelling in information retrieval. Using pre-trained embeddings or using dual approaches to model both semantic and lexical properties and combining them to output, but there has been increased scepticism on these approaches [39], [83], mainly focusing on whether they actually improve baselines like BM25 or have not tuned them properly (and thus made unfair comparisons). The main problem in these papers stated that the current baselines were not fine-tuned to the task at hand. Using naively implemented baselines the increase in quality would not necessarily mean that the deep learning models actually outperformed the proposed baselines. In response to the criticism, recent advances using pre-trained language models like GPT or BERT, have raised the state-of-the-art on several question-answering- and machine reading comprehension datasets like GLuE and MS MARCO while being easily validated with tuned baselines using several metrics.

Yang et al. [84] uses the Anserini toolkit [82] to capture relevant documents from Wikipedia that are pre-indexed on an article level, a paragraph and a sentence level. The relevant question is

(16)

14 Related work

then used as ’bag-of-words’ query to retrieve the top k relevant documents, whether this is an article, paragraph or sentence. The second stage is to re-rank these gathered ’relevant’ documents using the standard BERTBASE implementation. They report the best score for their model, dubbed ’BERTserini’,

on a paragraph ranking task using k = 100, reporting an F1 score of 46.1, recall of 85.8 and EM of 38.6 on the SQuAD (v1.1) benchmark. This research is later widened to include the TREC Microblog Tracks and the TREC 2004 Robust Track3in Yang et al. [85].

In Nogueira and Cho [51] they focus solely on the re-ranking part of the problem. Having an already pre-ranked set of documents from a BM25 implementation, they truncate both query and passage to a maximum of 512 tokens (the maximum input length for BERT) before they use the BERTLARGE model to fine-tune on their re-ranking task using a cross-entropy loss as the last, and

fine-tuned, layer. They report a competitive score on both MS MARCO as well as TREC-CAR dataset using the Anserini toolkit. They expand this research further in Nogueira et al. [52] where they propose a multi-stage re-ranker that uses BM25, a ’monoBERT’ model and potentially a ’duoBERT’ model (see Fig. 2.2). MonoBERT is a pointwise re-ranker that works as described in their previous research and duoBERT is a pairwise re-ranker that takes instead of just one paragraph, two paragraphs and ranks them against each other (pairwise). At inference time, the pairwise scores are aggregated to create a single score per document. The sum proved to perform best as an aggregator. With this pipeline, a trade-off between quality, in the form of the full pipeline consisting of both monoBERT and duoBERT and BM25, can be chosen or a focus on a better latency, in the form of only using monoBERT and BM25.

Figure 2.2: The multi-stage ranking architecture of Nogueira et al. [52]: "In the first stage H0, given

query q, the top-k0(k0= 5 in the figure) candidate documents R0are retrieved using BM25. In the

second stage H1, monoBERT produces a relevance score si for each pair of query q and candidate

di∈ R0. The top-k1(k1= 3 in the figure) candidates with respect to these relevance scores are passed

to the last stage H2, in which duoBERT computes a relevance score pi, jfor each triple (q, di, dj). The

final list of candidates R2is formed by re-ranking the candidates according to these scores."

(17)

In Han et al. [26] a model is proposed that uses BERT to encode documents and queries and trains a learning-to-rank (LTR) model using TensorFlow-ranking4to optimize the ranking performance. The ranking loss is used to update the LTR model with a variety of pointwise, pairwise and listwise losses. This achieves the current state-of-the-art results on the MS Marco passage full ranking and re-ranking tasks. Using an ensemble of the large BERT, RoBERTa [43] and ELECTRA [16] models with a pre-ranking step using DeepCT [19] they currently (as of Tuesday 12thJanuary, 2021) hold a top-five spot in the leaderboard of the MS MARCO full ranking task challenge and the number one spot in the re-ranking task challenge.

BERT

Since we will be using BERT as the re-ranking step for our method in this research we will explain the inner workings in depth below. BERT is a Transformer based model (as mentioned in Section A.2.3), that has been widely used in various NLP tasks to raise the state-of-the-art on different benchmarks and problems. BERT is built using multi-layer bidirectional encoding layers of the Transformer model. There are two models presented in the original paper by Devlin et al. [22], BERTBASE, consisting

of 12 layers (Transformer blocks), a hidden layer size of 768 and 12 self-attention heads for a total of 110 million parameters. This matches the model size of the OpenAI GPT model. Next to that, a BERTLARGE model is introduced consisting of 24 layers, a hidden layer size of 1024 and 16

self-attention heads for a total of 340 million parameters. The authors have built BERT using two steps, a pre-training step and a fine-tuning step. The input that BERT requires is agnostic of whether a sequence of inputs is actually a linguistic sentence or an arbitrary span of text. This makes BERT very easy to later train on different tasks. The input starts with a classification token [CLS] that is used in classification tasks. Sequence pairs are merged using a separator token [SEP] and further differentiated using a learned embedding that is added to every token specifying whether it is part of sentence A or B.

Figure 2.3: Pre-training and fine-tuning overview of the BERT architecture with the LM and NSP training steps as per Devlin et al. [22]

(18)

16 Related work

Training

The pre-training step done by Devlin et al. [22] uses unlabeled data to perform two separate steps based on the English Wikipedia (2,500 million words) and the BooksCorpus (800 million words) [90]. First a "Masked LM" step is performed. This task is often referred to as the Cloze task [75] and randomly masks a percentage (15%) of the input tokens and tries to predict those masked tokens. This way, a deep bidirectional representation can be created without letting the model "see" the token its trying to predict itself, a problem that can arise when training bi-directional language model that uses multiple layers. To overcome the problem that the fine-tuning step does not have these masked tokens out of the 15% randomly masked tokens, 80% of the time these tokens are masked using a [MASK] token, 10% of the time a random other token is chosen to serve as a mask and 10% of the time the token itself is used (creating no mask but simply having the model to predict its own token). The other step used in pre-training is the "Next Sentence Prediction (NSP)" step. This step helps to increase the inter-sentence relationship understanding which is valuable for both downstream question answering and natural language interference tasks. This is a binary sentence predictor that will use a random sentence B to follow sentence A 50% of the time and the actual next sentence B that follows A the other 50%. These two steps together will build a pre-trained model that can then be used as a base for fine-tuning BERT for downstream tasks (see Fig. 2.3).

The authors have shown that their BASE and LARGE models, fine-tuned on the downstream tasks by adding a layer on top of the pre-trained model than is trained on the task at hand, can consistently outperform the previous state-of-the-art models [22]. Recently, Turc et al. [76] has released 24 new BERT models specifically meant for computationally restricted environments. These models differ in the number of layers, layer sizes and self-attention heads they use. All of the presented models use less or equal amounts to the BERT Base model making them have a smaller amount of parameters which makes them computationally lighter to train. These models show that competitive scoring on the state-of-the-art baselines is also possible with computationally lighter models.

2.2.4 ING solution

The current solution, as mentioned in Chapter 1 and pictured in Fig. 2.4, that is in place uses a combination of a pre-ranking step using BM25 and a re-ranking step using the Universal Sentence Encoder (USE) model [11] as encoding being ranked using cosine similarity.

(19)

Figure 2.4: The current ING pipeline for ranking relevant passages from documents for KYC Analysts The USE model5enables a user to encode greater-than-word length text, sentences and short paragraphs, into the appropriate 512-dimensional vectorized embeddings that can be used for several downstream tasks like question-answering and semantic similarity matching. The model that ING uses is already fine-tuned for the question and answering task and is meant for out-of-the-box inference. The model is built upon the Transformer model illustrated in Section A.2.2. It uses the sub-graph encodings of the Transformer model that can capture context-aware word representations by using self-attention. These word representations are converted to fixed length (512 dimensional) sentence encodings by computing the element-wise sum of the representations at each word positions. This is also divided by the square root of the length of the sentence to create a normalized sentence representation that lightens the penalty that shorter sentences will get.

The full pipeline consists of ranking the documents based on a query using BM25. The top k results of BM25 will then be encoded using the USE model and re-ranked using cosine similarity with the query Q that is also embedded using the USE model, see Eq. (2.4). This ranking is then sorted and used as the output for the ranking model.

This research will focus on finding a better way to re-rank passages found in a financial domain by improving the language model and fine-tuning it on the specific task and domain. Language models like USE are not fine-tunable and thus provide a more static result. On the other hand, models like BERT are fine-tunable on domain-specific data and is expected to use this information to increase the performance of the model. This will be done by creating financial datasets for passage re-ranking and training several BERT models on that dataset. How the dataset will be created will be discussed in Chapter 3 and how this is used to create a fine-tuned language modelling solution for re-ranking will be discussed in Chapter 4.

(20)

Chapter 3

Dataset creation for a financial domain

In this chapter, we will discuss the datasets that are used in this research. We will aim to answer the first part of the research question; How do we create a domain specific dataset from a subset of existing datasets to fine tune IR pipelines on? Since there is no passage ranking dataset that is domain-specific for the domain we are interested in, we created a pipeline to create a passage ranking dataset for the financial domain out of an already existing one. To create a domain-specific dataset we will need a starting point to built on top of. As it is not feasible to annotate existing queries and documents by hand, this will be too time-consuming, and ING does not provide data of their own, we used an existing dataset to base domain-specific subset on. First, the used baseline dataset and its properties are discussed in Section 3.1, the process on how to create a domain-specific dataset from the standard MS Marco set and its methods are discussed in detail in Section 3.2. Finally, an overview of the properties of the domain-specific dataset is presented in Section 3.3 and Section 3.4.

3.1 MS Marco dataset

The original dataset that will be used is the Microsoft MAchine Reading COmprehension dataset (MS MARCO) [4] for passage ranking. This dataset is commonly used for ranking tasks in recent research, see Chapter 2, and as such will be used as the open-domain reference dataset for this thesis as well. The purpose of the dataset is to enable models to learn to rank these passages based on their relevance with regards to a query. The MS MARCO leaderboard for passage ranking and re-ranking uses the Mean Reciprocal Rank for the top 10 ranked results (MRR@10) as their evaluation metric on both a develop and an evaluation set. This last set is kept from standard users for evaluating on the actual leaderboard. The two tasks that they propose as part of this dataset is both a full ranking task, where one ranks a query based on all passages and gathers the most relevant results out of a collection of 8.8 million passages, and a re-ranking task, where an already ranked pre-selection is made out of the total number of passages consisting of 1000 ranked results from a BM25 algorithm that needs to be re-ranked based on a query.

(21)

The dataset consists of 8,841,823 real passages out of 3,563,535 documents, 1,010,916 real anonymized questions gathered from the Bing search engine and an additional 182,669 human written questions and is split in separate train and test sets. The training set itself consists of roughly 500 thousand query and document pairs where the documents are relevant and an additional 400 million query and document pairs where the document is not relevant for the query. MS MARCO provides a training triple set for the purpose of training deep learning models which consist of a query, a passage that is relevant to the query and a randomly sampled passage that is not relevant for the query. This is all validated by humans when creating the dataset. The dataset also consists of a separate development set consisting of 6,980 queries with mainly one relevant document per query, see Table 3.4 for a precise overview. An evaluation set, consisting of 6,980 queries as well, is also made available but there are no relevant documents in this set. They are used to score on the leaderboard by Microsoft.

Most algorithms do not use the full set of data in their training, [51] uses less than 2% of the full training set, so creating a subset should not influence the number of unique training points the algorithm can see. This makes it possible to train both the domain-specific model as well as the baseline model on the same amount of data points and have a fair comparison.

3.2 Creating a domain-specific subset of MS MARCO

To create a domain-specific subset of the available MS MARCO dataset, choices have been made to create a robust pipeline that creates a subset of any given sub-domain found within the MS MARCO dataset. This pipeline is shown in Fig. 3.1 and its purpose is to create a financial domain-specific dataset out of the original MS MARCO passage re-ranking dataset. This is done by first labelling a subset of passages using an existing API. These labels are either used to directly create a financial dataset if the amount of labels is sufficient or, these labels are used to create a topic classifier algorithm that will label the remainder of the passages as either financially relevant or not and then be used to create a financial domain dataset for passage re-ranking. These steps are explained in more detail in this section.

Generic passage ranking dataset (MS MARCO)

Weak labeling of

data

Label small subset using Google NLP API

Train a text classiﬁer on the weak labelled

datapoints Pre-process

labeled data

Use classiﬁer to label remaining datapoints in

original dataset Classify

data Use Google NLP API

subset to build the domain speciﬁc dataset

No Enough data to create a domain-speciﬁc MS MARCO dataset? Domain-speciﬁc passage ranking dataset (Financial

MS MARCO) Yes

Filter relevant datapoints

Figure 3.1: Overview of the steps necessary to create a domain-specific dataset as explained in Chapter 3

(22)

20 Dataset creation for a financial domain

3.2.1 Weak labelling

One of the main challenges in creating a financial subset of the MS MARCO dataset is that the queries in the MS MARCO dataset are not labelled based on topics, we have no knowledge what the questions is about unless we look at them by eye. To create a financial subset of the existing dataset, initial labels are necessary to execute this in a supervised way. The Google Natural Language API6 was used to label a small subset of paragraphs from the existing dataset. The Language API uses a pre-trained language model optimized for classifying sentences into 700+ categories7, it will also output a confidence score for every snippet of text that is classified per topic. This is a paid language API trained on Google’s own dataset and will classify a string of text to one or more categories. One can use 30.000 API requests for free each month where one requests consist of a string of a maximum of 1.000 Unicode characters. With the help of the Google Natural Language API, we labelled 65.637 passages over the course of four months as either relevant (1) or not relevant (0) within the specified domain, this selection is made by hand and consists of 38 categories labelled as relevant for our domain (see Table B.1 in Appendix B for a full overview). The queries are too small to be reliably labelled by the API, they sometimes consist of less than 5 words, so the passages are labelled instead. This way, one can distil the queries that refer to the relevant passage by mapping the relevant query to the passage. We make the assumption here that the query will be financially relevant as well as we make sure the accessory passage is relevant.

Funding from ING also made it possible to label over 3.8 million passages, roughly 40% of the full MS MARCO dataset, using the Google Natural Language API. This is used to create another dataset that does not need to be used as input for a classifier and be further processed as it consists of enough datapoints to fine-tune the language model with good labels. If funding is not available, however, this would not have been possible. We will continue with both methods to create two passage re-ranking models based on both the topic classifier based dataset (Financial BiLSTM classifier based) and the Google Language API dataset (Financial API based). In this way, we explore both routes and create a method for both use cases while highlighting the differences where they apply.

In Table 3.1 one can see what confidence scores have been tried and validated by hand on 150 test passages. Based on these results, a confidence score of 80% has been used to generate the initial 65,637 weak labels, labelling a passage as relevant within the domain if it was labelled with the value of one of 38 relevant categories. This way we have a good balance between qualitatively well-labelled data and the number of data points. A higher confidence score means we might not have enough data to fine-tune the language model on and a lower confidence score will result in qualitatively bad data. In the following sections, we will only focus on the dataset built with weak labels with a confidence threshold of 80% or higher.

6_{https://cloud.google.com/natural-language/}

(23)

Confidence score Percentage of data labelled

as relevant in the full corpus True positives False positive False negatives True negatives Accuracy F1 score

50% 15.03% 29 37 1 83 75% 0.60

60% 10.83% 39 38 4 69 72% 0.65

70% 7.11% 43 44 3 60 68% 0.65

80% 4.74% 45 26 3 76 81% 0.76

90% 2.35% 37 23 4 86 82% 0.73

Table 3.1: Overview of the different confidence scores used labelling 65.637 pieces of text using the Google text classification API. 150 paragraphs were picked and the label was evaluated by hand to determine the effectiveness of different thresholds.

3.2.2 Classifying unlabeled data

With these initially labelled paragraphs, three different binary classifiers were trained, specified below, to label the remainder of the dataset as either financially relevant (1) or not (0) to assert we have enough data to train the passage re-ranking models comparably to the generic MS MARCO trained models. As topic classifiers we use a BiLSTM classifier, this is a Bi-directional LSTM topic classifier that has shown to perform well on these tasks (see Section 2.1), an SVM, which is a standard baseline used for binary classification due to its limited number of trainable parameters and an out-of-the-box topic classifier from FastText [31], which already has trained word-embeddings and provides a platform where one can load in the data and train a topic classification model without many other degrees of freedom. To train these classifiers, we will first pre-process the textual data to feed it to the classifier in a uniform way.

Pre-processing

To train the classifiers we first prepare the data. This way we clean the noise out of textual data and filter out words that bear little meaning in the given sentence[33].

1. Performing exploratory dataset analysis, we see that the initial dataset is imbalanced and has an abundance of non-financial sentences. We first de-duplicate and downsample the data to create a set that consists of equal part non-financial instances and financially relevant instances, see Fig. 3.2. This results in a dataset with a volume reduction of 90% giving us 6,346 labelled datapoints, balanced in both financial and non-financial instances, out of the initial 65,637. We also see that in terms of sentence length, the distribution stays close to the distribution of the full dataset (see Fig. 3.2b and Fig. 3.2d), meaning that with tokenization in next steps using BERT, we can apply existing methods without having to worry that the properties of the data will change due to sampling and the syntactic properties of both datasets stay close to each other.

2. The text is lowercased to handle differences in capitalisation, stop words are removed from the text, these are words that are deemed unnecessary or so common that they provide no predictive features for the task at hand. The NLTK framework[8] is used for this, which bases their stop

(24)

words on the Porter Stemmer[63] which is also used for the stemming of the sentences. This way we end up with a sentence that has no redundant information and only captures the essence of a passage.

3. This balanced and pre-processed data is then embedded in different ways to train the three classifiers.

(a) Data distribution of unsampled subset

(b) Distribution of sentence length in number of words of unsampled dataset

(c) Data distribution of downsampled subset

(d) Distribution of sentence length in number of words of downsampled dataset

Figure 3.2:Data distribution and sentence lengths for both unsampled and downsampled subsets of data after preprocessing

BiLSTM

The first type of algorithm that is used as a topic classifier is a BiLSTM. This is a type of LSTM that uses both forward as well as backward passes to map the context available to greater detail, giving a higher accuracy on the classification task. To train a BiLSTM model we use the pre-processed data as a starting point and embed it using the FastText[32] word embeddings. They are pre-trained for text classification and, unlike GloVe and Word2Vec, provide encodings for out-of-vocabulary tokens out of the box.

The LSTM is trained and fine-tuned using PyTorch[58] and uses a 70% of the roughly 6500 labelled datapoints as train set while the remaining 30% is used as equal part validation and test set.

(25)

TorchText is used to create the input data for the BiLSTM model and create the 300-dimensional embeddings using FastText. Using Bayesian hyperparameter optimisation using the Weights and Biases optimization software (WandB) Biewald [7] we find the optimal parameters for the model as shown in Table 3.2, achieving an accuracy of 90.70 when classifying the test set after optimising the hyperparameters. We use one single output node and a Binary Cross-Entropy loss with a sigmoid function in the forward pass to map the outcome of the model to the predicted class and calculate the loss, see Table 3.2.

Figure 3.3: Architecture of the BiLSTM text classifier, based on the image from Tensorflow

SVM

An SVM classifier was trained using Scikit-learn [59] to classify the data in the test dataset. For this, we first build the vocabulary features using TF-IDF techniques from Scikit Learn[59] and fed this into the SVM classifier to train. This is an easy way to validate the results of the other classifiers as it is intuitive to set up and quick to train.

FastText

Lastly, we also used a FastText [31] out of the box model for validation purposes. This uses the same pre-trained embeddings as used for the BiLSTM and fine-tunes those using supervised learning. Each word is embedded as a bag-of-character n-gram which is trained using a sliding window over all present words. It also uses embeddings for words not accounted to make a complete representation of the data. Discarding words that occur very frequently, as per Eq. (3.1), gives a tailored vocabulary of the domain-specific dataset. Using these embeddings a simple binary classifier is trained on the train-data much similar to the SVM.

(26)

24 Dataset creation for a financial domain P(w) = r t f(w)+ t f(w), f (w) = countw total no of tokens (3.1)

In Table 3.2 we see the different models trained for text classification and their optimal parameters found using WandB, we see that the BiLSTM achieves the highest accuracy on a test set. Also, the BiLSTM outputs a number between 0 and 1 instead of a hard class as the SVM. We will continue using the BiLSTM as classifier and set the confidence threshold at 80% as this gives the best quality of the classifier, measured by hand and is most robust for future changes. We can fine-tune this classifier for other domains and build on top of this to a higher degree than the SVM. Since training the classifier is a one-time operation the computational cost of picking a BiLSTM over the SVM does not play a role.

Model Name Parameters Features Train/Test/Validation split Accuracy on test set

Bi-LSTM

Number of hidden layers = 2, Hidden layer size = 256, Dropout rate = 0.1, Number of epochs trained = 5, Loss = Binary Cross Entropy, Optimizer = Adam

300 dimension embedded sentence

using FastText embeddings 70%/15%/15% 90.70%

SVM Kernel type = linear, Regularization parameter = 1.0

TF-IDF embedding with a

maximum of 5000 features 70%/15%/15% 90.54 %

FastText

Learning rate = 0.2, decay every 100 steps, Loss = Hierarchical Softmax [49], word Ngrams = 2,

Number of epochs = 25,

300 dimensional embeddings using FastText build in word embeddings 70%/15%/15% 87.80%

Table 3.2: Overview of the models used to binary classify the textual data along with their optimal parameters found using Bayesian parameter optimisation with WandB

3.3 Domain-specific dataset

Using the Bi-LSTM, the remaining 8.7 million passages are classified. These classified paragraphs were used to build the dataset in the same way as the original dataset is build8, see Section 3.1. Resulting in a financial subset of the existing dataset that has exactly the same structure as the original MS MARCO, this enables us to use existing finetuning methods build on the data at hand and use the original dataset to compare the models against. Having only passages classified as relevant for our domain, we end up with a dataset reduction of approximately 93%, meaning that from the initial 8.8 million passages around 600.000 are relevant within the domain we are interested in.

To train and fine-tune we need to also create a financial subset of the train triples that are used in most deep learning tasks. These triples consist of a query a positive and randomly picked negative example of a returned paragraph. This way every query has at least one example of a correctly returned passage and one incorrect example. This can in turn be fed to language models like BERT to fine-tune them for the task at hand. To create this train file we cross-reference the identified relevant queries with the queries that are initially presented in the MS MARCO train file. This is done with a unique identification number that every query has, given by the MS MARCO original dataset. We can

(27)

then filter out the queries that are not represented in our own dataset using these identifiers to create a train file that consists of solely financially relevant data points. In Table 3.3 one can see the number of training triples and the reduction in the size of the training sets used in the thesis with respect to the original triple sets.

The method described above creates a subset of the existing MS MARCO dataset and can be used with every topic the Google Natural Language API or any pre-trained classifier supports. This also enables other readers to repeat this method using different labels and create a dataset for different topics using these methods. With this, one can train and fine-tune domain-specific information retrieval models that can fetch relevant passages for user-given queries and implement them in practice.

Financial API based dataset

As mentioned, another similar dataset was made using only the Natural Language API from Google that is not used as input for a classifier to classify the remaining unlabeled data but as the dataset itself. The quality of the labels and accessory queries is substantially better than the quality of the BiLSTM classified dataset, see Table 3.5. The difference here is that there was no classifier involved to create enough data points for a full dataset (see Fig. 3.1) as we already have enough datapoints to train our passage re-ranking model, eliminating an extra step that needs to be fine-tuned in the process. This dataset is also used to create a subset of MS MARCO in the same fashion as above. This dataset consists of higher quality queries that, upon inspection by eye, represent the financial domain in a way that eliminates more non-relevant queries than the Financial BiLSTM based dataset (see Section 3.4) and is thus valuable to test against. Since this is not available without proper funding, both datasets will be used and tested against.

Train data Number of training triples Size

Relative size with regards to complete full dataset

Complete triples (Tiny set) 500,000 0.4 GB 0.13%

Original triples (Small set) 39,782,779 27.1 GB 10.00%

Original triples (Full set) 397,756,691 292.2 GB

-Financial API based triples (Full set) 7,577,749 5.6 GB 1.91%

Financial classifier based triples (Full set) 43,402,153 31.9 GB 10.91%

Table 3.3: Train triples and their corresponding sizes of both the original MS MARCO dataset and their financial counterpart

3.4 Comparing the full and financial MS MARCO datasets

In Table 3.3 and Table 3.4 we see that the Financial BiLSTM classified dataset has more data points to train on than the Financial API based dataset. We also see in Table 3.5 and upon closer inspection of both datasets by eye that the quality of the queries labelled as financially relevant of the Financial API

(28)

based dataset is qualitatively better with respect to the BiLSTM classified dataset. We see that there is a 7.3% overlap between the queries in the Financial API based dataset and the Financial BiLSTM classified based dataset. Here we can conclude that, although we end up with a bigger dataset when we label passages and queries using a trained BiLSTM classifier, the quality of the queries deteriorates. This is another reason to keep both datasets when finetuning the models used for passage re-ranking in Chapter 4.

Complete MS MARCO Financial classifier based subset Financial API based subset Number of relevant passages per query Number of queries Number of queries Number of queries

1 6590 733 41

2 331 8

-3 51 -

-4 8 -

-Total queries 6980 741 41

Table 3.4: Amount of relevant passages per query per dataset of the testset

Dataset Query

Complete tattoo fixers how much does it cost what are the four major groups of elements how does a firefly light up

Financial API based are withdrawals from roth ira taxable what is defamation harm

how long must you stay in the hospital before medicare will cover a nursing home stay Financial classifier based is autoimmune hepatitis a bile acid synthesis disorder

how much income in retirement do i need

which u.s. president is credited with inspiring the maxwell house slogan good to the last drop

Table 3.5: Sample of three queries per dataset, the normal MS MARCO, queries based on the Google Natural Language API and queries labeled as financial using the BiLSTM classifier.

Based on Loughran and McDonald [44] we also measure the readability scores of all queries and passages in the dataset. We use the Gunning Fog (FOG)[35] index and the Dale-Chall readability formula[20]. These are scores to measure the readability of passages in the English language. This is a test used in American schooling to determine, on average, what reading level a text is suited for. The FOG score measures from 6-17, where 6 is a sixth-grade scholar (similar to 8th grade in the Dutch schooling system) and 17 is that of a college graduate. In the Dale-Chall system, a grade of 4.9 or lower is easily understood by a fourth-grade student (6th grade in the Dutch system) whereas a 9.9 is easily understood by a college student. The difference between the two scores is that the FOG score ranks a passage based on so-called complex words, words that are at least three syllables long, see Eq. (3.2). Dale-Chall, based on the FOG score, uses a list of roughly 3000 difficult words to measure readability against, as per Eq. (3.3).

0.4 Words Sentences + 100 Complex words Words (3.2)

(29)

0.1579 Difficult words Words × 100 + 0.0496 Words Sentences (3.3) where:

Words = Number of words in the passage Sentences = Number of sentences in the passage

Complex words = Number of words that consist of 3 or more syllables

Difficult words = Number of words that appear in a pre-determined list of difficult words

The results of these tests on all created datasets can be seen in Table 3.6. We see here that although the Financial API based dataset is often categorised as more difficult than the others, the scores are still comparable in difficulty never having more than 1 full point above the other datasets. The results here are in line with earlier findings seen in Fig. 3.2 where we have shown that the syntactic properties of the passages at worst have minor deviations.

Dataset FOG Score Dale-Chall score

Queries Complete 7.38 7.91

Financial API based based 8.22 7.83 Financial classifier based 6.98 7.55

Passages Complete 12.89 7.85

Financial API based based 13.49 7.93 Financial classifier based 12.92 7.86

Highest score possible - 17.00 9.9

Table 3.6: Readability scores for both queries and passages for both datasets, a higher score means a more difficult to understand passage.

With both the Financial API based dataset and the Financial BiLSTM classifier based dataset, we will continue to fine-tune the re-ranking model to learn to rank passages based on a query. We have answered the first research question: How do we create a domain specific dataset from a subset of existing datasets to fine tune IR pipelines on? With both datasets we can test the effect of both a smaller but qualitatively better dataset, the Financial API based dataset, as well as a bigger but less topic relevant dataset, the Financial BiLSTM classified based dataset. The way these datasets are formatted -the same way the public MS MARCO dataset is formatted- enables us to make use of existing techniques and compare the results with other methods. In the next chapter, Chapter 4, we will talk about how we fine-tuned our re-ranking model, the choices that were made and the steps that are necessary.

(30)

Chapter 4

Evaluating state-of-the-art passage

re-ranking on financial data

In this chapter, we will discuss the steps taken to use the financial datasets created in Chapter 3 and use a current state-of-the-art re-ranking method to evaluate the performance of these models and the performance of fine-tuned state-of-the-art models on the financial datasets. Here, we want to answer the following research questions: How can we apply existing techniques in IR and NLP in real-world IR passage re-ranking problemsand how can we evaluate the new passage re-ranking model and compare them with open-domain state-of-the-art passage re-ranking models in terms of quality and model size? We will do so in Section 4.1 by first explaining how to apply the current state-of-art method, presented in Section 2.2.3, on these datasets and how to fine-tune them, what baselines we will use to evaluate these fine-tuned models and which evaluation metrics will be used to compare the models. In Section 4.2 we will present the gathered results and interpret them.

4.1 State-of-the-art re-ranking method on financial data

Below, we will discuss the steps taken to create a passage re-ranking model. We will use the datasets created earlier to fine-tune a BERT model and train it on the specific task of re-ranking passages based on a user query. We will discuss the process of fine-tuning the re-ranking model and the steps and choices in Section 4.1.1. In continuation, we will follow up with the baselines chosen and how to implement them in Section 4.1.2. We will talk about the evaluation metrics that we are using to test the effectiveness of all models tested in Section 4.1.3. Lastly, we will give a full overview of the experimental protocol in Section 4.1.4 to replicate the current thesis.

The full ranking pipeline can be seen in Fig. 4.1. Here, we will primarily focus on the re-ranking step, the step that uses BERT. BM25 is used to pre-rank the documents as it has few tunable parameters and is lightweight [68]. This way, we will not have to evaluate the full list of possibilities using language models which tend to slow the process down (see Table 4.4). It is used as a baseline as well as to pre-rank the results which will then be fed to a language model to re-rank them into the final

(31)

results. This is already done by the MS MARCO team and presented in a ’top1000’ develop and evaluation list, focusing on recall. Within ING, the current setup evaluates the top k results found by BM25 using USE and Cosine Similarity for re-ranking.

Domain-specific query Pre-ranking step with BM25 Return top K results (Top 1000) Domain-specific documents Domain-specific BERT re-ranker Domain-specific train data (documents and queries) Fine-tune model Re-ranked result list Re-rank using BERT Ranked

result list Evaluate results

Figure 4.1: A schematic overview of the full multi-stage ranking pipeline

4.1.1 Fine-tuning BERT Model

With the domain-specific datasets, one can fine-tune a language model for the task of generating a ranked list of relevant paragraphs for a given query that specializes in the given domain. The implementation to fine-tune the BERT model for passage re-ranking is based on Nogueira and Cho [51] and adds the use of financial dataset to fine-tune BERT models for the re-ranking task in the financial domain. The BERT models are trained on query-passage pairs were a label is provided stating if the passage is either relevant (1) or not-relevant (0) from the MS MARCO dataset. The BERT model is pre-trained as explained in Section 2.2.3 and uses the [CLS] vector as input to a single layer neural network to obtain a binary classification score. This is done by means of probability sj of

the passed passage dj to be relevant for a query q. These scores are collected and used to create a

ranking based on the probability scores. When we are talking about fine-tuning a BERT model we are thus talking about fine-tuning the last linear layer for the task at hand, as per Devlin et al. [22], fine-tuning the full model would require additional data and increase the computational cost of the task. It has been shown by Devlin et al. [22] that fine-tuning a task-specific last layer already raises state-of-the-art results on different NLP tasks. This in turn is used to re-rank the top-k candidates gathered using BM25. The model trained uses cross-entropy loss and has a lot of similarities with the BiLSTM model used to classify the data. The loss function in Eq. (4.1) is being optimized and we end up with a ranked list R for a query q of the top-k most relevant passages. This can be seen as a pointwise learning-to-rank approach [42].

(32)

30 Evaluating state-of-the-art passage re-ranking on financial data L= −

_∑

j∈Jpos log (sj) −

∑

j∈Jneg log (1 − sj) (4.1) where:

Jpos= Set of indexes for relevant candidate passages

Jneg= Set of indexes for non-relevant candidate passages

sj = probability score for passage j to be relevant

We note that due to memory limitations of the graphical processor units (GPU’s) used, NVidia GeForce 1080Ti and Nvidia Titan V, we can not train on the full dataset. BERT Base models are already too big to load into memory when taking a batch size of over 2 on the available GPU’s and the BERT Large model, as used in Nogueira and Cho [51], is not trainable at all on the available GPU’s. Tensor processing units (TPU’s) are necessary for training such large models. Following Turc et al. [76], we chose three model sizes to train; BERT Base, BERT Small and BERT Tiny. The Tiny model is the smallest model offered in terms of the number of parameters and quickest to train while the Base model enables us to compare our results with most other research and our baselines in this field. The Small model is the largest model that comfortably fits on GPU and as such is the third model we use, see Table 4.1 for a full overview of the model specifications

Model Layers Hidden layer size Self-attention heads Number of parameters (in millions)

BERT Tiny 2 128 2 4.4

BERT Small 4 512 8 29.1 BERT Base 12 768 12 110 BERT Large9 ₂₄ ₁₀₂₄ ₁₆ ₃₄₀

Universal Sentence Encoder (USE)10 _- _- _- ₁₄₉

Table 4.1: Size and amount of different layers along with parameter number of the different models we use.

Another solution to the size and training speed restrictions of the BERT models is the amount of training data used. It has been shown that a limited amount of data can outperform various state-of-the-art models (see Fig. 4.2). In Table 4.2 our train set sizes per model trained are shown.

10_{The BERT Large is a reference model and is not used in this thesis}

(33)

Figure 4.2: Number of MS MARCO examples seen during training vs. MRR@10 performance. From Nogueira and Cho [51]

Dataset Model Batch Size Example pairs Percentage of total dataset Complete Bert Tiny 128 12.800.000 1.61%

Bert Small 32 3.200.000 0.40% Bert Base 2 200.000 0.03% Financial API based Bert Tiny 128 12.800.000 84.46%

Bert Small 32 3.200.000 21.11% Bert Base 2 200.000 1.32% Financial classifier based Bert Tiny 128 12.800.000 14.75%

Bert Small 32 3.200.000 3.69% Bert Base 2 200.000 0.23%

Table 4.2: Dataset usage for both Financial and Normal train sets, 100.00 train steps were used in all cases

By fine-tuning the three different models on three datasets we create 27 tuned models that we can compare in terms of evaluation metrics proposed in Section 4.1.3 and model size. In Section 4.1.2 we will talk about the baselines that we will introduce to further enrich our tuned models against.

4.1.2 Baselines

Here we will introduce which baselines are being used and how the baseline results are gathered. We will first briefly explain the Anserini framework, a framework that is commonly used for evaluating IR systems. The remainder of the section will discuss the baselines, how the baseline results are gathered and what choices have been made with regards to architecture and parameters.

(34)

32 Evaluating state-of-the-art passage re-ranking on financial data

Anserini

Anserini [82] is an open-source information retrieval toolkit that enables users to set up baselines and test common models like BM25 and others against their own results. They enable a user to use pre-created notebooks with the most common algorithms on the most common information retrieval datasets as baselines and to build on top of that to create more specific baselines. In this toolkit, we use a BM25 algorithm baseline tuned on the MS Marco dataset. They offer multiple evaluation criteria making it convenient to access for academic purposes.

Baseline models

Using the Anserini toolkit we will use a MS Marco fine-tuned BM25 algorithm as a standard baseline which is a standard baseline for both industry- and academics within information retrieval [82], [40] and [81]. The tuning is done by the Anserini provided tuning algorithm for BM25, which is a grid search based optimizer that increments the tunable parameters by tenths until it finds an optimal solution. It is optimized across five different subsets of data to regularize the found optima on an 8th generation 2.6 GHz 6-core Intel I-7 CPU. We differentiate for tuned optima for recall, used in the pre-ranking step as we want all relevant results to be returned, and an optimum for the evaluation metrics, which is used in the baseline runs to achieve the highest performance using only BM25. See Table 4.3 for the values gathered.

Setting MRR@10 MAP Recall@1000 Anserini Default (k1=0.9, b=0.4) 0.1840 0.1926 0.8526 Tuned to maximise recall (k1=0.82, b=0.68) 0.1874 0.1957 0.8573 Tuned to maximise MRR@10 and MAP (k1=0.60, b=0.62) 0.1892 0.1972 0.8555

Table 4.3: BM25 tuning values using the Anserini toolkit

ING baseline

In addition to the BM25 algorithm, we will use the current ING setup. This is also a multi-stage document ranker, similar to the proposed solution. Here, BM25 is used to pre-rank the documents by means of ElasticSearch [23] and then re-rank the top K documents using the USE model. A combination of BM25 pre-ranking and USE re-ranking is used as a baseline to test if the proposed solution outperforms the current industry standard for ING. This method is already optimized within ING as per Table 4.4 and is using N = 20 to combine both good results with an acceptable latency for the product it is used in.

(35)

Top N MRR@10 MAP Latency (sec) 10 0.2134 0.2095 0.288 20 0.2289 0.2248 0.485 30 0.2370 0.2328 0.612 40 0.2391 0.2349 0.712 50 0.2405 0.2361 1.015

Table 4.4: Results of the USE re-ranker per amount of documents to be re-ranked from 1000 BM25 pre-ranked documents

To test the academic relevance and the benefit for ING, we will also test our models against non-finetuned BERT models shown in Table 4.1 as done in Nogueira et al. [53]. We will compare both all the different models that we use as well as the effect of having a domain-specific dataset. 4.1.3 Evaluation metrics

All models presented are tested using two different evaluation metrics. The choice was made to use mean reciprocal rank for the top 10 re-ranked queries (MRR@10) and the mean average precision (MAP) as qualitative measurements. Both of these metrics are commonly used in information retrieval tasks and make comparing the gathered results with baselines easier while maintaining a clear qualitative measure for the trained models [53], [51], [4], [26] and [17].

Mean Reciprocal Rank (MRR)

Reciprocal rank (RR) of a query q is measured by taking the inverse of the rank of the first relevant answer of q, see Eq. (4.2).

RR(q) = 1

rank(q) (4.2)

MRR, as seen in Eq. (4.3) is the average RR over a set of queries Q. When we measure the reciprocal rank @ 10, we only take the first 10 returned paragraphs per query Q into account. This is done both because in a real-life situation we will initially not present more than 10 results and it is also used as the evaluation metric for the MS MARCO challenge in passage re-ranking, making it comparable to results found there and research that focuses on this dataset.

MRR(Q) = 1 |Q| |Q|

∑

i=1 RR(qi) (4.3) where: qi= A query Q= A set of queries

(36)

34 Evaluating state-of-the-art passage re-ranking on financial data

Mean Average Precision (MAP)

A downside of MRR is that it only measures the appearance of the first relevant query. To tackle this and get a clearer image of the full returned paragraph rankings we also measure MAP. MAP is used to represent the area under the precision-recall curve, providing the average precision per query and per model.

MAP consists of calculating the average precision (AveP, see Eq. (4.4)). AveP combines the precision and recall to get a mean precision score of each relevant document in a returned set. This, in turn, is averaged over all queries to get the MAP for a full dataset as in Eq. (4.5).

AveP(q) =∑rP@r

|R(q)| (4.4)

where:

P@r = Precision at rank r

|R(q)| = Size of the set of ranked passages per query q

MAP(Q) =∑ |Q| i=1AveP(qi) Q (4.5) where: Q = Set of queries |Q| = Size of set Q

AveP(qi) = Average precision of query qi

We note, as can be seen in Table 3.4, that we have few queries that have more than one relevant paragraph that causes MAP to be close to MRR. This is still a valid metric to compare with other research and the baseline used at ING and can, in turn, provide a baseline metric for future research against models trained on queries with more relevant passages per query. Additionally, Normalized Discounted Cumulative Gain (NDCG), is a well-used evaluation metric in passage re-ranking. We chose not to use this as this advances MAP to work with multi-labelled ranking, think of five-star reviews. Since we are working with binary classification where a paragraph is either relevant or not, NDCG would not contribute to this.

4.1.4 Experimental protocol

Looking at Fig. 4.1, we will summarise the full experimental protocol here. Using the MS MARCO formatted dataset, either a domain-specific dataset from Chapter 3 or the complete open-domain MS MARCO dataset we train the BERT based passage re-ranker, we want to fine-tune a passage re-ranking algorithm for the task at hand. Fine-tuning a BERT model of choice is done by getting the raw weights and model structure and adding a fully connected linear layer that uses the [CLS] output node from the BERT model to classify the proposed passage as either relevant or not given