Using Sparse Latent Representations for Document Retrieval and its impact on Question Answering

(1)

Using Sparse Latent

Representations for

Document Retrieval and its

impact on Question

Answering

(2)

(3)

Using Sparse Latent

Representations for Document

Retrieval and its impact on

Question Answering

Tijmen M. van Etten 11781289

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. dr. ir. J. Kamps

Institute for Logic, Language and Computation (ILLC) Faculty of Science

University of Amsterdam Science Park 107 1098 XG Amsterdam

(4)

Abstract

Neural approaches in both Information Retrieval (IR) and Question Answering (QA) increasingly rely on large amounts of data and computing power. Therefore, new approaches have been taken that tackle this issue, such as the Standalone Neural Ranking Model (SNRM) proposed in [Zamani et al., 2018]. By using Sparse Latent Representation (SLR) to represent queries and documents, rather than the computationally expensive Dense Representations used in most neural ranking approaches, the need for computing power is significantly decreased. In this thesis, a demonstration of the SNRM will be implemented and extended to the Question Answering task. As a result, a QA system is created that can efficiently rank documents using Sparse Latent Representations, from which several candidate answers are extracted, in order to select a final target answer based on their scores in both stages. During evaluation of the QA system, the SLR-based approach is compared to a traditional word-based approach. Although the Question Answering performance for the SLR-based approach is significantly lower than for the word-based approach, it provides a solid starting point for further research into the development of Sparse Latent Representations fit for this task.

(5)

Chapter 1 Introduction

The developments of Neural Networks on a broad range of research fields in A.I. has resulted in an increasing need for data and computational power. Modern Infor-mation Retrieval systems especially, often rely on non-neural first stage rankers to efficiently narrow down the search space and to reduce computation for the ’dense’ neural ranking models in a later stage. Using an end-to-end neural approach could potentially remove the restrictive gate-keeping effect that using non-neural first stages have. An attempt to solve this problem was made in [Zamani et al., 2018], with the proposal of the Standalone Neural Ranking Model (SNRM). Within this model, Sparse Latent Representations (SLR) are used to drastically reduce neu-ral text representation sizes (as opposed to dense representations), allowing for efficient indexing and ranking. This approach offers a full neural stack, using an offline index to be able to query through whole corpora without the need of a first stage word based ranker. As stated in the paper, this approach holds ’the potential to improve the efficiency without a loss of effectiveness’.

Additionally, the SNRM can be optimized for specific NLP tasks, allowing for wide applicability in various active NLP research fields. One such active field that depends on the retrieval of relevant documents is open-domain Question Answering (QA). Like the Dense Neural Ranking models, the current state-of-the-art QA systems require immense computational power due to the complexity and size of these systems. To illustrate, Google’s BERTLARGE has a total of 340 million

parameters [Devlin et al., 2019]. Due to the increasing size of these QA models, whole-corpus answer extraction has become close to infeasible without a first stage document retriever.

For this thesis, a live demonstration of the Standalone Neural Ranking model is developed and extended to a Question Answering system in order to evaluate the performance of the SNRM on document retrieval and its impact on the down-stream Question Answering Task. The full QA system pipeline consists of three

(8)

stages: a retrieval stage, an answer extraction stage and an answer selection stage. The retrieval stage will be evaluated for both a standard word-based ranker as well as a ranker based on Sparse Latent Representations. After document retrieval in the first stage, answer extraction will be done from these documents using the current state-of-the art Answer Extraction model, BERT. Lastly, the final target answers are selected using a combination of scores from the first two stages. The QA evaluation will be done for the SQuAD dataset [Rajpurkar et al., 2018].

1.1 Research Questions

As the goal of this thesis is to implement the Standalone Neural Ranking Model and extend its use to a working Question Answering system, the research question can be stated as follows:

How can a live demonstration of the Standalone Neural Ranking Model be cre-ated and extended to the Question Answering task.

To further evaluate the full system’s retrieval and Question Answering perfor-mance, two additional sub-questions can be stated as follows:

What is the impact of using Sparse Latent Representations for document re-trieval?

What is the overall impact of using Sparse Latent Representations for a first stage answer context retriever in a Question Answering pipeline

1.2 Thesis Overview

The implementation of the Standalone Neural Ranking Model is a part of a co-authored project. This collective part has been made in collaboration with Lodewijk van Keijzerswaard, Felix Rustemeyer and Roel Klein. The correspond-ing co-authored thesis parts consist of the followcorrespond-ing parts: section 2.1, to which each of us contributed equally and the entirety of chapter 3, during which our paths split.

An overview of the thesis’ contents will be provided here. Firstly, section 2.1 will describe the theoretical background information on document ranking. Then, some background information on Question Answering is given in 2.2. Chapter 3, describes the construction of different types of indices using Anserini. The four

(9)

of us worked together on exploring the standard word based ranking in Anserini and documented the relevant findings in Section 3.1. After creating a first working implementation of the neural search engine, described at the beginning of Section 3.3, our paths split. Important to note is that although the SLR index was initially intended to be a part of the shared component, it has along the way been taken full responsibility over and finished by Lodewijk as his individual project. Chapter 4 describes the full QA system pipeline, as well as the methods used to evaluate the system on the SQuAD dataset. Results are presented in chapter 5, after which the discussion and conclusions can be found in chapter 6.

(10)

Chapter 2 Theoretical Background

2.1 Document Ranking

Information Retrieval is the process of retrieving information from a resource given an information need. Area’s of IR research include text based retrieval, image or video retrieval, audio retrieval and some other specialized area’s, but text based retrieval is by far the most active, driven by the rise of the World Wide Web. Today many systems are able to retrieve documents from enormous collections consisting of multiple terabytes or even petabytes of data. These systems have existed for some time, and continue to improve on traditional IR methods. In this Section, the basis of text based IR will be established, to be able to make a comparison between different IR methods and to be able to motivate the design choices of the neural search engine.

This section will first introduce the concepts inverted index and ranking func-tions, after which two traditional ranking functions will be introduces as point of reference. The remainder of this chapter will discuss the use of neural ranking functions in IR , their current computational challenges, and the way that sparse latent representations promise to deal with to those problems.

2.1.1 Indexing and Searching

In IR a distinction is generally made between the indexing time and the query time. The query time is defined as the time it takes the algorithm to retrieve a result (in real time). This time is dependent on the algorithm that is used as well as the hardware that it is running on. The goal of IR is then to minimize the query time which is achieved by moving as much computation as possible from query to indexing time. What to compute at indexing time and how to search this computed data is then the question that IR research is concerned with. A general

(11)

description of these two stages is given below. The following information is heavily based on [Manning et al., 2008, Ch. 2-4, 6].

Indexing

At indexing time an inverted index is computed from a collection of documents. An inverted index is a data structure that maps content such as words and numbers, sometimes also called tokens, to the locations of these tokens in that document collection. Inverted indexes are widely used in information retrieval systems, used on a large scale for example in search engines [Zobel et al., 1998]. The computation of a inverted index can be split into three different parts: the tokenization of the documents in the collection, the stemming of the extracted tokens and the mapping of each token to each document. The inverted index can optionally be append with extra data, which can be attached to either tokens or documents. These stages will be discussed below.

Tokenization Tokenization is the extraction of tokens from a document. In traditional IR the extracted tokens are usually words. However, it is also possible to use other tokens such as word sequences, also called n-grams, vector elements or labels. There is a wide range of tokenization algorithms since different applications and languages ask for specialized algorithms.

A rough token extraction algorithm would split a document on whitespace and interpunction. For example, when using words as tokens the following sentence

"He goes to the store everyday." (2.1) would be split into:

{”He”, ”goes”, ”to”, ”the”, ”store”, ”everyday”}.

When using n-grams, the results of token extraction with n = 2 is: {”He goes”, ”goes to”, ”to the”, ”the store”, ”store everyday”}.

Stemming These words will next be stemmed. Stemming is the process of reducing each word to its stem. In the example given above the word "goes" would be changed into its stem: "go". The remaining words are already in their stemmed form. An example of a very popular stemming algorithm is Porter’s Algorithm [Porter et al., 1980]. This step usually also includes case-folding, the lowering of (all) cases of each word. For more information on stemming algorithms and case-folding, see [Manning et al., 2008, Ch. 2]. The stemmed tokens set of

(12)

Sentence 2.1 would be

{”he”, ”go”, ”to”, ”the”, ”store”, ”everyday”}.

Token Mapping Given the set of stemmed tokens for a document, the tokens can now be mapped to each document. A simple algorithm would simply indicate if a token is present in the document. This mapping can be represented by a dictionary with the token as key and the set of documents containing the token as value: [”token” → {docID, docID, ..., docID}]. Using Sentence 2.1 as content for document 1, and

"We go to the store today." (2.2)

as content for document 2, the following mapping would be acquired after word tokenization and stemming:

Token Mapping: (2.3) ”he” → {1} ”we” → {2} ”go” → {1, 2} ”to” → {1, 2} ”the” → {1, 2} ”store” → {1, 2} ”everyday” → {1} ”today” → {2}

Inverted Index Construction After the token mapping is complete, the in-verted index can be constructed. In the example above the inin-verted index only indicates the presence of a word in a document by saving the document ID. For small collections with simple search algorithms this can be sufficient, but with bigger collections sizes this quickly becomes insufficient. To improve the search functionality several other statistics can be included in or appended to the in-verted index. For example the location of each term in a document can be saved by saving a list of positions for each mapping from a term to a document. A map-ping of token t to a collection of n documents d ∈ D would then be of the form ”t” → {...,0d0_i : [p1, p2, ..., pk], ...} at positions p where k is equal to the number

of occurrences of t in di for any i. Other examples of data that can be included

are token frequency per document (store the value of k explicitly for each token), the amount of tokens in each document. These are commonly used in traditional search algorithms such as TF-IDF and BM25, that will be introduced in Section 2.1.2.

(13)

Searching

Given an inverted index, it is possible to preform search operations using queries. There are two categories of search algorithms: unranked search algorithms, and ranked search algorithms. The first category is a simple retrieval of all document ID’s that contain (a part of) the query. This is, as mentioned in the previous paragraph, often insufficient, so this category will not be discussed any further. Ranking search algorithms will rank each document for a query. A document with a high score should have a high relevance for the given query. Section 2.1.2 will discuss two efficient traditional ranking functions.

2.1.2 Standard Ranking Functions

As mentioned earlier, ranking is the task of sorting given documents in order of relevance to a given query. Raking can be conducted with the use of standard ranking functions like TF-IDF and BM25. The information about these ranking functions is from [Christopher et al., 2008] and [Robertson and Zaragoza, 2009]. TF-IDF

The TF-IDF ranking approach consists of two phases. Firstly the given query and documents are converted to a TF-IDF representation. Between these query and documents vector representations, a similarity score is calculated using a scoring function like cosine similarity. After the documents are ranked according to the similarity score, the ranking task is completed.

Term frequency - inverse document frequency (TF-IDF) is a vector based model, for each document a vector representation can be built. This vector con-sist of the TF-IDF values of all the terms in the collection. TF-IDF is a factor weighting the importance of a term to a document in a collection of documents. For an important term in a document, a term that occurs frequently in the doc-ument an rarely in other docdoc-uments, a high TF-IDF value is calculated. A low TF-IDF value is calculated for unimportant terms, such as terms with a very high document frequency

Given term t, document d and a collection of Documents D, the TFIDF of term t in document d can be calculated by multiplying two statistics of term t: the term frequency and the inverse document frequency.

TF-IDF(t,d,D) = tf(t,d) · idf(t,D) (2.4)

Term Frequency The term frequency of a term is calculated by dividing the frequency of term t in document d by the number of occurrences of the most

(14)

frequent term in document d (maxkfkd):

T Ftd =

ftd

maxkfkd

(2.5) Hence, a TF of 1 is calculated for the most frequent term in a document, and fractions are calculated for the other terms in the document.

Inverse Document Frequency Solely using TF as a ranking function is not an accurate weighting scheme. Since, TF does not include the fact that the most frequent words, "stopwords", are not important, because these words are included in a lot of documents. Therefore, documents are not distinguishable by solely using TF.

Hence, TF is multiplied with IDF, decreasing the weight of very frequently terms and increasing the weight of rarely occurring terms. The IDF of a term t is calculated by the binary logarithm of the number of documents N , divided by the number of documents containing the term Ni:

IDFt= log2(

N Nt

) (2.6)

By multiplying a term’s TF and IDF for all the terms in a document, the TF-IDF representation of a document is constructed .

Cosine similarity After TF-IDF vector representations are build, the similarity between documents and queries can be calculated with a scoring function. A standard scoring function, often used for calculating similarity scores between TF-IDF vector representations, is cosine similarity. The cosine similarity between two TF-IDF vectors is ranged between zero and 1. Because TF-IDF values cannot be negative, the angle between two TF-IDF vectors cannot be greater than 90◦. Cosine similarity, between the query Q and document D of length N, is derived by the following formula:

cos(θ) = PN i=1QiDi q PN i=1Q 2 i q PN i=1D 2 i (2.7) Now the documents can be ranked, based on the cosine similarity score between the documents and query

BM25

Another widely used ranking function is BM25, which ranks documents on the basis of the occurrences of query terms in the documents. Instead of the TF-IDF approach, where a scoring function is employed for calculating the similarity

(15)

scores between the document vectors and query vector, BM25 is one function that calculates a BM25 score between document D and query Q. Lucene has switched from using TF-IDF as ranking function to BM25, because of several advantages that are incorporated in BM25

Similiar to TF-IDF, the formula of BM25 contains TF and IDF. However, the versions of these statistics used in BM25 differ with the traditional forms.

Asymptotic TF By the standard TF, a term occurring 1000 times is two times as special as a term with 500 occurences. Due to this type of TF, the TF-IDF values sometimes do not match the expectations of users. For example, an article with 8 times the word "lion" is twice as relevant as an article with 4 times the word "lion". Users do not agree with this, according to them the first article is more relevant but not two times as relevant. BM25 handles this problem by adding a maximum TF value, variable k. Due to this maximum, the TF curve of BM25 is an asymptote that never exceeds k. In equation 4, term frequency is indicated with f (qi, D).

Smoothed IDF The only difference between the standard IDF – utilized in the calculation of TF-IDF — and the BM25 IDF, is an addition of one before taking the logarithm. This is done because the IDF of terms with very high document frequency can be a negative value. This is not th case wth the IDF in the BM25 formula, due to the addition of one.

Document length Another distinction between TF-IDF and BM25 is the fact that BM25 includes document length. TF-IDF does not take document length into account, this leads to unexpected TF-IDF values. Fore example, the TF-IDF value of the term "lion" that appears two times in a document of 500 pages is equal to the TF-IDF value of the same term appearing in a document of one page. Thus, according to TF- IDF the term "lion" is equally relevant in document of 500 pages and a document consisting of one page. This is not intuitive, and BM25 handles this by normalizing TF with the document length (1 − b + b · _avgdl|D| ). |D| is the length of document D and avgdl is the average document length. The influence of document length on the calculation of BM25 is controlled by the variable b.

BM25 Score(D, Q) = n X i IDF (qi) · f (qi, D) · (k + 1) f (qi, D) + k · (1 − b + b · |D| avgdl) (2.8)

After the calculation of the BM25 scores — for all the documents in the collection—, the documents can immediately be ranked based on these scores.

(16)

2.1.3 Neural Ranking

In recent years, neural networks have been successfully applied to a variety of infor-mation retrieval tasks, including ad-hoc document ranking [Dehghani et al., 2017a] [Guo et al., 2017] [Xiong et al., 2017]. Neural ranking can in general be under-stood by dividing the ranking process in three steps: generating a document rep-resentation, generating a query representation and relevance estimation. Neural ranking models can be classified based on the step where a neural network is applied.

In general, there are two types of neural ranking models: early combination models and late combination models [Dehghani et al., 2017a]. Early combination models often use text-based representations of the document and the query, after which interactions between the document and the query are captured in a single representation. This interaction can for example be manually created features or exact query term matches. The document-query representation is then given as input to a neural model to compute a similarity score. Examples of early combination models include DRRM, DeepMatch and ARC-II.

On the other hand, the late combination models use neural networks to cre-ate separcre-ate query and document embeddings. The document relevance is then computed using a simple similarity function, such as the cosine similarity or a dot product. Examples include DSSM, C-DSSM and ARC-I.

One advantage of late combination is that all document embeddings are query independent, which means that they can be precomputed. This way only the query embedding and a simple similarity function have to be calculated at query-time, whereas early combination models have a more complex neural network for relevance estimation.

However, most of the time late combination model are still not efficient enough to provide close to real-time retrieval. This is because the document and query embeddings lack the sparsity of traditional word-based representations. Creating an inverted index using document embeddings is still technically possible, but will not give a speedup unless they are sparse vectors. Document embeddings are usually so dense that there are no non-zero values. As a result, the posting lists for each latent term would contain all documents and no efficiency is gained. This is one of the main reasons that standalone neural rankings model are not used in practice. However, there are methods to still include neural ranking models in the ranking process.

The most popular way is to use a multi-stage ranker. Here a traditional first stage ranker passes the top documents to a second stage neural ranking model. These stacked ranking models seem to take advantage of both the efficiency of traditional sparse term based models and the effectiveness of dense neural models. However, this first-stage ranker acts as a gatekeeper, possibly reducing the recall

(17)

Figure 2.1: The document frequency for the top 1000 dimensions in the actual term space (blue), the latent dense space (red), and the latent sparse space (green) for a random sample of 10k documents from the Robust collection.

of relevant documents.

2.1.4 Standalone Neural Ranking Model

In a recent paper, [Zamani et al., 2018] took a different approach to neural ranking in IR and proposed the Standalone Neural Ranking Model (SNRM). According to the paper, the original multistage approach leads to inefficiency due to the stacking of multiple rankers —that are working at query time— and also to a loss in the retrieval quality due to the limited set of documents continuing to the neural stage. They propose to use a sparse latent representation (SLR) for documents and queries instead of a dense representation to achieve similar levels of efficiency and quality as a multistage ranker. These representations consist of significantly less non-zero values than the dense representations but are still able to capture meaning full semantic relations due to the learned latency in the sparse representations. Figure 2.1 shows the distributions of different representations and illustrates the Zipfian nature of the SLR, matching far fewer documents than the dense representation —that is used for the multistage rankers—. This approach with sparse latent representations allows for a fully neural standalone ranker, since a first stage ranker is not required for efficiency reasons. In the following sections we will provide an overview of how the SNRM works and how it is trained, as explained in [Zamani et al., 2018].

(18)

Model

The efficiency of the SNRM lies in the ability to represent the query and docu-ments using a sparse vector, rather than a dense vector. For ranking purposes, these sparse representation vectors for documents and vectors should be in the same semantic space. To achieve this desideratum, the parameters of the repre-sentation model remain the same for both the document and the query. Simply using a fully-connected feed-forward network for the implementation of this model would result in similar sparsity for both query and document representations. In-stead, since queries are intuitively shorter than the actual documents, a desirable characteristic for efficiency reasons is that queries would also have a relatively more sparse representation. To solve this, a function based on the input length is used for the sparse latent representation. This function to obtain the representation for a document, and similarly for a query, is provided below.

φD(d) = 1 |d| − n + 1 |d|−n+1 X i=1

φngram(wi, wi+1, ..., wi+n−1) (2.9)

As can be seen, the final representation is averaged over a number n-grams that is directly controlled by the length of the input. The larger the input, the more n-grams the representation is averaged over and the more likely it is that different activated dimensions are present. Another advantage of this function is the capturing of term dependencies using the sliding window for a set of n-grams. These dependencies have been proven to be helpful for information retrieval [Metzler and Croft, 2005].

The φngram function is modelled by a fully-connected feed-forward network,

this does not lead to density problems because the input is of fixed size. Using a sliding window, embedding vectors are first collected for the input using a relevance based embedding matrix. After embedding the input, the vectors are then fed into a stack of fully-connected layers with an hourglass structure. The middle layers of the network have a small number of units that are meant to learn the low dimensional manifold of the data. In the upper layers the number of hidden units increases, leading to a high dimensional output.

Training

The SNRM is trained with a set of training instances consisting of a query, two document candidates and a label indicating which document is more relevant to the query. In the training phase there are two main objectives: the Retrieval Objective and the Sparsity Objective. In order to achieve the Retrieval Objective,

(19)

Figure 2.2: A general overview of the model learning a sparse latent representation for a document.

Hinge loss is employed as the loss function during the training phase since it has been widely used for literature ranking for pairwise models.

L = max(0, − yi[ψ(φQ(qi), φD(di1)) − ψ(φQ(qi), φD(di2))] (2.10)

For the Sparsity Objective, the minimization of the L1 norm is added to

this loss function, since minimizing L1 has a long history in promoting sparsity

[Kutyniok, 2013]. The final Loss Function can be computed as follows:

L(qi, di1, di2, yi) + λL1(φQ(qi)||φD(di1)||φD(di2)) (2.11)

Where q is the query, d1document 1, d2 document 2 and y indicates the relevant

document using the set {1, -1}. Furthermore, φQ indicates the SLR function for a

query and φD for a document. Lastly, ψ() is the matching function between query

and document SLR explained in the next section.

An additional method called ’weak supervision’ is used to improve the model at training. This unsupervised learning approach, works by obtaining ’pseudo-labels’ from a different retrieval model (a ’weak labeler’), and then uses the obtained labels to create a new training instance. For each instance: (qi ,di1,di2,yi), two

(20)

candidate documents from the result list or one from the result list are sampled along with one random negative sample from the collection. yi is defined as:

yi = sign(pQL(qi|di1) − pQL(qi|di2)) (2.12)

where pQL denotes the query likelihood probability, since in this case a query

likelihood retrieval model [Ponte and Croft, 1998] with Dirichlet prior smoothing [Zhai and Lafferty, 2017] is used as the weak labeler.

This method has shown to be effective approach to train Neural Models for a set of IR and NLP tasks, including ranking [Dehghani et al., 2017b], learning relevance-based word embedding [Zamani and Croft, 2017] and sentiment classifi-cation [Deriu et al., 2017].

Retrieval

The matching function between document and query is the dot product between the intersection of non-zero elements their sparse latent representations. The score can be computed with:

retrieval score(q, d) = X − →_q i|>0 − →_q i − → di (2.13)

The simplicity of this function is essential to the efficiency of the model, since the matching function is used frequently at inference time. The matching function will calculate a retrieval score for all the documents that have a non-zero value for at least one of the non-zero latent elements in the sparse latent representation of the query.

Finally, using the new document representations, an inverted index can be constructed. In this case, each dimension of a SLR can be seen as a "latent term", capturing meaningful semantics about the document. The index for a latent term contains the ID’s of the documents that have a non-zero value for that specific latent term. So, for the construction, if the ith dimension of a document representation is non zero, this document is added to the inverted index of the latent term i.

2.2 Question Answering

Question Answering (QA) is the objective of creating systems that can automat-ically answer questions formulated in a natural language using a combination of Information Retrieval and Natural Language Processing (NLP).

(21)

Question Answering can be divided into two sub-domains: open-domain QA and domain QA [Lopez et al., 2011]. The main difference is that closed-domain question answering is usually confined to a specific closed-domain and makes use of formal models such as pre-structured databases and applies reasoning and logic to formulate an answer, whereas open-domain question answering deals with finding answers in natural sources of text.

Although the origins of closed-domain Question Answering can be traced back to the 1960’s with BASEBALL, a computer program able to answer baseball-related questions using stored information [Green et al., 1961], the open-domain field did not take its current day shape until the late 1990’s with the Text Re-trieval Conferences (TREC)[Voorhees and Tice, 2000]. TREC is an ongoing series of conferences focusing on various different IR tasks or ’tracks’, including a QA track. Because of this origin in IR, the conventional approach to QA was to con-tinually narrow down the retrieval of candidates in several stages of an end-to-end pipeline. These stages often include: document retrieval, passage ranking and fi-nally, answer extraction. However, as NLP became more concerned with QA, the field has seen a shift over the years from IR to a new field called ’reading com-prehension’, the ability to read text and understand its meaning [Haynes, 2010]. This can be exemplified by the nature of various popular QA benchmark datasets today, such as TrecQA [Yao et al., 2013] and SQuAD [Rajpurkar et al., 2018]. By already supplying a relatively small input of text containing the answer, these benchmark datasets place the focus on the later stages in the pipeline and can therefore best be described as answer extraction tasks

Given the large amount of available natural language data through the in-ternet and the developments in Neural Networks to contextually understand text [Peters et al., 2018], open-domain QA has gained a lot of traction in the last decade [Devlin et al., 2019, Yu et al., 2014]

2.2.1 QA Approaches

Due to their ability to train on sequential data, the last years have firmly seen convolutional neural networks and recurrent neural networks as the basis for state of the art approaches for both open-domain question answering and language models in general. Using the basis of recurrence and convolution, the bound-aries have been pushed by the development of long short-term memory networks [Hochreiter and Schmidhuber, 1997] and later with gated recurrent neural net-works [Chung et al., 2014].

However in 2017, an innovative new model architecture was proposed in the pa-per ’Attention-is-all-you-need’ [Vaswani et al., 2017] called the Transformer. The newly proposed architecture demonstrated that the attention mechanism is enough to achieve state-of-the-art results on the machine translation task. The attention

(22)

Figure 2.3: Transformer Architecture [Vaswani et al., 2017]

mechanism is able to learn to focus on things such as pronouns and a word’s meaning in a context to be able to conclude what or who is being referred to in a context. One approach based on the Transformer model will be explained in the next section.

2.2.2 BERT

Another large breakthrough in natural language modelling in recent years came with the publication of the Bidirectional Encoder Representations from Trans-formers or ’BERT’ [Devlin et al., 2019]. BERT innovates the previously discussed Transformer by applying bidirectional training. Previously, models were trained by reading text sequences from either left-to-right, right-to-left or a ’shallow’ com-bination. The paper concludes that a language model that is trained using this new bi-directional reading has a deeper sense of language context than previous efforts. Another strong aspect of BERT is that it uses a fine-tuning strategy (as opposed to a ’feature-based’ strategy). This essentially means that the framework consists of two steps: the pre-training step, where a general language model is learned using unsupervised learning tasks, followed by the fine-tuning step, where the pre-trained model is fine-tuned for a specific NLP task. Using this strategy, BERT was able to present state-of-the-art results in multiple different tasks, in-cluding Question Answering.

(23)

Figure 2.4: Overall procedure for BERT training from [Devlin et al., 2019]. The [CLS] token is added in front of every input example and the [SEP] token is used to separate two inputs, in this case question/answer

and on the pre-training and fine-tuning steps, visualised in figure 2.4.

Architecture The architecture is largely based on the Transformer model posed in [Vaswani et al., 2017]. In the original BERT paper, two models are pro-posed with differing sizes: BERTBASE and BERTLARGE. BERTBASE consists 12

layers (i.e. Transformer blocks), a hidden size of 768 and 12 self-attention heads. In total, BERTBASE has 110 million parameters in total. This size was selected

for comparison purposes with the OpenAI GPT, a language model developed by OpenAI. BERTLARGE has 24 layers, a hidden size of 1024 and 16 self-attention

heads, with a combined total of 340 million parameters.

Pre-Training Two different unsupervised learning tasks are employed for the pre-training stage. The data used for this training stage are the BookCorpus [Zhu et al., 2015], consisting of around 800 million words and the text passages from the English Wikipedia, consisting of roughly 2500 million words.

The first task is the "masked language model" procedure (MLM). For this task, 15% of the input tokens are randomly selected and ’masked’ by replacing it with a special [MASK] token, the model is then trained to predict the original token. Normally given this setup, the model would only be able to predict a token when the [MASK] token is present. This would result in problems in the fine-tuning stage, where the token does not appear. To resolve this issue, only 80% of the original 15% randomly selected tokens are actually replaced by the [MASK] token, 10% is replaced by a random word and 10% is kept the same.

(24)

The second task is called Next Sentence Prediction (NSP). Because under-standing cross-sentence relationships is an important aspect of QA and Natural Language Inference, the Next Sentence Prediction task is used to learn whether a given sentence B is the next sentence in a corpus after sentence A. Training is done by feeding the model sentence-pairs A and B from a corpus where B is the actual next sentence 50% of the time and it is a randomly selected sentence the other 50%.

Fine-tuning The fine-tuning stage is fairly straightforward and relatively inex-pensive compared to pre-training. The model is designed to be able to take in various different representations of inputs such as single text or text-pairs pairs, therefore it can easily be fine-tuned for a specific NLP-task by only modifying the model with a small task-specific addition. For Question Answering this is the in-troduction of a start vector and an end vector, containing the probabilities for each token in a context to be the start or end of the answer. To fine-tune the model for an NLP task, the task-specific inputs and outputs can simply be plugged into the modified model and the parameters will be tuned from end-to-end.

(25)

Chapter 3 Anserini: The (Neural) Search

Engine

In this chapter, the implementation of a search engine that utilizes SLRs in Anserini will be examined. First, section 3.1 covers how a search engine using standard searching approach can be constructed. Then, in section 3.2, SLRs will be employed for re-ranking and the final section will focus on the implementation of a SNRM.

Open-source toolkits play an important role in information retrieval research. Especially allowing for the evaluation of performance benchmarks is an important aspect of these toolkits that would provide a valuable addition to IR research.

This is the exact motivation behind Anserini: Reproducible Ranking Baselines Using Lucene [Yang et al., 2017]. As its description states, Anserini focuses on the ability to reproduce results of IR benchmarking, serving as a widely accessible point of comparison in IR research. Due to this ’reproducibility promise’ of Anserini [Lin et al., 2016], it is a suitable toolkit for research into the use of sparse latent representations for document ranking and the impact that this use has on IR and down-stream NLP tasks. The fact that Anserini is built on Lucene provides is another argument to use it. Lucene is a Java library that provides many advanced searching functionalities. It has become the de-facto standard for many search engines and would therefore make this research widely applicable.

Anserini provides the basis for the neural search engine. The search engine is an important part, because it provides a way to index and retrieve documents based on their sparse latent representation. However, in order for Anserini to actually be able to use the SNRM’s sparse latent representations, some changes had to be be made. The changes necessary to be able to use a sparse latent representation besides a term frequency representation, can be divided in those related to the document indexing and those related to query representation and

(26)

search. In this chapter, the standard implementation of the Anserini indexing and searching processes is outlined, after which a description of the SNRM implemen-tation follows. The complete implemenimplemen-tation can be found on the Anserini fork [van Keizerswaard et al., 2020].

3.1 Standard Ranking

3.1.1 Standard Indexing

The Anserini indexing component consists of several Java files and classes. The Java class "IndexColection", that is located in "IndexCollection.java", is the main Anserini indexing component and is controlling the indexing pipeline.

For the task of inverted indexing, the Anserini component can be classified as a wrapper of Lucene. Because Anserini is used for assembling the Lucene indexing components into an end-to-end indexer. Lucene supports multi-threaded indexing, but it is only providing access to a collection of indexing components. Hence, an end-to-end indexer can not be employed by solely using Lucene.

Usually the Anserini indexing pipeline is used for indexing text from a collection of text documents. In this section this standard indexing approach is discussed. To perform the indexing pipeline, a command has to be executed. Several flags —that determine the collection, output folder and the index settings— have to be added.

Index The file "DefaultLuceneGenerator.java" is the first component in the in-dexing pipeline. A given collection is converted into Lucene documents containing three fields: contents , id and raw. By converting a collection into Lucene doc-uments, an index can be constructed consisting of the inverted index and the documents.

Inverted Index ’Contents’ stores the data, which Anserini will utilize to make an inverted index. In the standard indexing approach an inverted index is con-structed with the text of the documents in the collection. Hence, the field ’contents’ in the Lucene document j is storing the text of document j. As discussed in 2.1.1 and illustrated in 3.1, an inverted index consists of terms —words in case of stan-dard indexing— and the id’s of the documents that contain these words. (and document frequencies) The ID value of a document is stored in the ’id’ field of the Lucene document and connects the inverted index with the documents stored in the index.

(27)

Figure 3.1: Overview of the index

Documents The documents are stored together with the document ID and the content from the ’raw’ field. This is an optional field and can store additional information about the document, like a title, the number of words in the document or the document’s sparse latent representation (see 3.1.2).

3.1.2 Standard Querying

Given the index described above, the traditional TF-IDF or BM25 algorithms, can perform a ranked search on this index. Inside the search function of the SearchCollection object, several objects will be constructed to perform scoring. This function takes in the query, an IndexReader and a Similarity object, and returns the scored documents in a ScoredDocuments object. The query is the raw input string read by Anserini, the IndexReader provides a way to read the inverted index, and the Similarity object is used to compute document scores. The construction sequence consists of the QueryGenerator, BooleanQuery, one or multiple TermQuery’s, TermWeights, FixedSimilarity’s and Scorer objects, for which the type of the last two will depend on the type of ranking function that is used. The significant parts of all these objects, with respect to this project, will be discussed below. An overview of this process can be found in Figure 3.2.

(28)

Figure 3.2: A simplified overview of the standard document scoring process of Anserini.

QueryGenerator The QueryGenerator object takes in the query and extracts tokens (terms) from it. Then, an analyser removes stopwords and stems the terms if needed. What remains is a collection of terms, also called a bag of words. For every term a TermQuery object is created, which is then added to the BooleanQuery. This process effectively splits the scoring of documents into scoring per term. TermQuery For the purpose of this project, the TermQuery object is simply a wrapper for the TermWeight object. The TermQuery contains the term that needs to be matched and scored, and creates a TermWeight object.

TermWeight The function of the TermWeight object is to compute the data needed to score a document, and to find the documents that match the term. The IndexReader is used to gather the information required to create a TermStatistics and a CollectionStatistics object for the term. These objects are passed to a Sim-ilarity object to create a FixedSimSim-ilarity object (SimFixed in Figure 3.2), which has a score function that can score a document for a single term. The document matching also makes use of the IndexReader object to create a document enumera-tion, which confusingly is called a TermsEnum in the Anserini source code. This is done by looking up the document id’s that the term maps to, as described in Sec-tion 2.1.1, in the inverted index. Both the SimFixed and document enumeraSec-tion are passed to the TermScorer object.

Similarity The Similarity object determines the type of ranking function that is used (options include TF-IDF, BM25, query likelyhood, and others).

(29)

FixedSimilarity The SimFixed is an object of which the only function is to take a document as a parameter and return a term-document score. The exact computation of the score differs between ranking functions.

Scorer A Scorer object receives the list of matched documents and a term spe-cific FixedSimilarity object. The search function will score every matched docu-ment for each term using the score function of the SimFixed object. The score for every document will be calculated by taking the sum over the matched term specific scores. These scores are accumulated in a ScoredDocuments object and are returned to the parent function to obtain a document ranking.

3.2 Sparse Neural Re-Ranking

3.2.1 Sparse Neural Re-Ranking Indexing

One way to improve re-ranking efficiency is to precompute all document embed-dings. For this reason an index is created that stores the standard text-based representation as well as the sparse latent representation. Posting lists are still created using the text-based representation to facilitate efficient first-stage rank-ing usrank-ing BM25.

The raw field is changed to a dictionary that can contain both the raw text and the document embedding. This way the embeddings can simply be retrieved at query time instead of having to be computed live. This allows the re-ranking to score more of the top documents in roughly the same time, leading to a higher accuracy.

3.2.2 Sparse Neural Re-Ranking Querying

Building on top of the Standard Querying discussed previously, we implemented a way to compare the ranking scores of the conventional retrieval to the SLR scoring. This ’hybrid’ implementation searches using the conventional term based ranking and then recomputes the scores for the top-k documents based on their SLR.

As stated in 3.2, the SLR is stored in the index as a string value and can be retrieved by accessing the dictionary in the raw document field. This string is converted to a vector and its dot product with the query SLR is computed to get the neural score.

An advantage of this method is that it uses the efficiency of the traditional retrieval to directly examine the effectiveness of the neural search to be able to re-rank the top-k documents using the SLR ranking capabilities. A disadvantage of this approach, is that the SLR score re-ranking only happens for the top-k

(30)

documents and other documents with potentially higher neural scores might be withheld by the first stage ranker.

This method is only aimed at directly comparing the ranking effectiveness of the two rankers, it does not evaluate the performance efficiency of the SLR. For that purpose, a fully standalone neural ranker had to be implemented.

3.3 Standalone Neural Ranking

To keep the implementation of the indexing process as simple as possible the choice was made to store the activation value of each token (latent index), in the token frequency as if the neural search is a word based search. Because large parts of the standard analysis pipeline can be reused this way, the implementation will cost considerable less time and the difference between standard and neural searching will be minimal. This means that the impact of neural search can be better understood.

Using this approach, the extension to Anserini has to store the activation value, which is a floating point, as a term frequency, which is an integer. A first version of this could be to store the latent term x = di ∗ p times in a string, where di

is the activation value i of the document and p is the desired decimal precision. This string can then be put in the content field of a document and then fed to the default Anserini word processing pipeline. For small collections with a value of p < 3 this would be an option, but when the collection gets larger, the string generation complexity explodes. For a collection of size N , a sparsity model with d dimensions with sparsity ratio sr, the generated data in bytes would be

S = hd ∗ (1 − sr)i∗hdlog₁₀(d) + 1e ∗ 10pi∗hNi. (3.1) The first part of Equation 3.1 is the average number of active dimensions per document, the second part is the average number of bytes each active dimension takes up (log₁₀(d) digits and a space), and the last part is the number of documents in the collection. As an example, a collection of 10,000 documents, a sparsity model with 1000 dimensions, a sparsity ratio of 0.9 and a precision of 2 decimal places would generate 476 MB of data. However, when the desired decimal precision would be 5 decimal places, a total of 47MB/d ∗ 10, 000d = 465 GB of data would be generated. This algorithm requires a lot of resources even for small collection, and is not desirable for the larger scale goal of this project.

The final efficient implementation used for the Standalone Neural Ranking has been created by Lodewijk van Keizerswaard [van Keizerswaard et al., 2020] and is extensively explained in his thesis. This section will provide a short overview of the retrieval process.

(31)

Figure 3.3: A simplified overview of the document scoring process implemented in [van Keizerswaard et al., 2020]

This implementation of the SNRM indexing and searching stays close to the word based version. At indexing time, the SLRGenerator takes in precomputed document representations and stores it in a specific format. This format is then read by the SLRTokenizer, which extracts the original representation and writes the activation value to the term frequency of every latent term up to a precision of 9 decimals. The document ranking for a precomputed query representation is retrieved by passing down the activation value to the term-document scoring function, through the SLRQueryGenerator, SLRQuery, SLRWeight, SLRSimFixed and SLRScorer classes. An overview for the SNRM query process can be found in Figure 3.3.

3.4 Summary

In summary, Anserini’s indexing pipline consists of a document generator and an analyser. The corresponding classes, DefaultLuceneGenerator and Standard-EnglishAnalyser, are responsible for pre-processing, tokenizing and analysing the documents. At query time, the document scores are computed by looking up the document list per term and then computing individual term-document scores. These are then added to obtain the final document scores and the resulting docu-ment ranking.

As an intermediate step between the word based-retrieval and the SNRM, the neural re-ranker was created to evaluate the difference in effectiveness of both approaches. By embedding the SLRs in the word based index for each document, documents could be retrieved using the word based ranking and on top of that,

(32)

have the top-k hits re-ranked according to their SLR. Using this approach, the efficiency of the word based ranking could be retained by only calculating SLR scores for the top-k relevant documents.

Finally, the implementation of the Standalone Neural Ranking model from [van Keizerswaard et al., 2020] is used to index and retrieve documents based on their sparse latent representations.

(33)

Chapter 4 Method

In order to evaluate the impact of using Sparse Latent Representations for in-formation retrieval on the down-stream Question Answering task, a full Question Answering pipeline is created similar to the system proposed in [Yang et al., 2019]. As stated in the introduction, the system consists of three stages: a retrieval stage, where candidate documents that possibly contain the answer are retrieved using a search engine; an answer extraction stage, where an answer is extracted out of each of the candidate documents and finally, a selection stage, where the final target answer is selected out of the candidate answers based on the combined score of the retrieval and answer extraction stages.

The first section of this chapter describes the dataset that will be used for the experiments. Then, section 4.2 will provide an overview of the retrieval stage com-ponent. Section 4.3 will detail the answer extraction stage using BERT. Section 4.4 will explain how the final answer is selected in the final system stage. Lastly, the different evaluation methods that are used to evaluate the QA system at its different stages will be explained in section 4.5.

Figure 4.1: Overview of the end-to-end system.

(34)

4.1 The SQuAD Dataset

The dataset used is the Stanford Question Answering Dataset (SQuAD), a reading comprehension dataset purposed for QA [Rajpurkar et al., 2016, Rajpurkar et al., 2018]. It contains around 100,000 questions posed by crowdworkers based on a set of Wikipedia articles. Along with a question, each entry also provides the correct answer and the original context out of which the answer was retrieved. The ques-tions are sourced from 500+ different Wikipedia articles in total, all from various different domains, to provide extensive coverage of a broad range of subjects.

The version of SQuAD that is used is version 1.1. SQuAD v1.1 consists of two different files, one for the training set and another one for the development set. The training set, containing 87599 questions out of a total of 18891 different answer contexts, was used for the fine-tuning of the BERT model. The smaller development set, containing 10570 questions based on 2067 unique answer contexts, is used to evaluate the QA performance of the system. The structure of the development set can be seen in listing 1.

{

"title": "Super_Bowl_50",

"paragraphs": [ {

"context": "Super Bowl 50 was an American...",

"qas": [ {

"answers": [ {

"answer_start": 177,

"text": "Denver Broncos"

} ],

"question": "Which NFL team represented...?",

"id": "56be4db0acb8001400a502ec" } ] } ] }

(35)

4.2 Context Retrieval

The first stage in the QA system pipeline is the retrieval stage. In order to be able to retrieve candidate answer contexts for a given question, and index is created for the SQuAD dataset.

In order to create this index, some pre-processing is required. Since SQuAD is more of an an answer extraction task than a question answering task (as mentioned in section 4.1), the dataset is not inherently designed for IR tasks. To simulate a more realistic open-domain question answering task, all answer contexts are combined into a joint collection of unique contexts. This adds the IR objective of selecting the correct context before extracting the answer. As a result, the context collection contains 2067 unique answer contexts. This collection is saved in a json file with the following format, usable by the anserini JsonCollection class:

[

{"id": context_id, "contents": answer_context}, ...

]

Collection indexing is done using two different methods: the standard word based approach and the SNRM approach. By using two indexes, the difference in effec-tiveness can directly be compared at the retrieval stage.

Standard Context Ranking The first method creates a standard index as described in section 3.1. This allows for a word based index that can use the TF-IDF and BM25 algorithms to retrieve documents.

Indexing and retrieval can be done by using the following indexing commands in anserini:

1 sh target/appassembler/bin/IndexCollection −collection JsonCollection −input /path/to/

squad−dev−v1.1−contexts/ −threads 4 −index /indexes/lucene−index.squad−dev− v1.1.raw −generator DefaultLuceneDocumentGenerator −storePositions −

storeDocvectors −storeRaw &

2 sh ./target/appassembler/bin/SearchCollection −index indexes/lucene−index.squad−dev

−v1.1.raw −topicreader TsvInt −topics path/to/squad−dev−v1.1−questions.tsv −slr −slr.ip 7 −output /results/results_lucene−index.squad−dev−v1.1_slr.txt

SLR Context Ranking For the SNRM approach, sparse latent representations are obtained for each context and question by using a model trained on the Ro-bust04 datasets, containing around 525k news articles. The exact parameters for the SNRM training can be found in Appendix B.

(36)

After obtaining the sparse latent representations for each document in the collection and the questions, indexing and retrieval are done using the following commands:

1 sh ./target/appassembler/bin/IndexCollection −collection JsonCollection −input /path/to/

squad−dev−v1.1−contexts−slr −index indexes/slr−index.squad−dev−v1.1.raw − generator SLRGenerator −slr −slr.index −slr.decimalPrecision 7 −storeRaw &

2 sh ./target/appassembler/bin/SearchCollection −index indexes/slr−index.squad−dev−v1.1.

raw −topicreader TsvInt −topics path/to/squad−dev−v1.1−questions−slr.tsv −slr − slr.ip 7 −output /results/results_slr−index.squad−dev−v1.1_slr.txt

where the slr.ip flag allows the integer precision for the representations in the index to be controlled in decimal numbers.

4.3 Answer Extraction

After the context retrieval stage, BERT is used to extract an answer out of each context for the top-5 results. The version of BERT used is the ’bert-large-uncased-whole-word-masking-finetuned-squad’ model, available from the hugging-face Transformers library [Wolf et al., 2019]. This package provides a pre-trained Tokenizer and a Prediction model. The model is based on the BertForQuestio-nAnswering architecture and as the name suggests, has been fine-tuned for the SQuAD QA task using the version SQuAD v1.1 training data. The specifics of the model’s architecture can be found in table 4.1.

Several steps are involved in using BERT to select an answer out of a context. The code for the answer extraction can be found in Appendix A

Encoding Firstly, the encoder is applied taking in the raw question and answer context as input and treating them as a text-pair. As mentioned in 2.2.2, the reason these are combined into a text-pair is because BERT is designed to be able to handle many different NLP tasks that in some cases require only a single input. The question and context are separated using a special [SEP] token and a [CLS] token is added at the start of the sequence. The encoder then translates each word into an id from the fixed 30,000 token vocabulary [Wu et al., 2016] used in BERT to create the Token Embeddings. In order to later reconstruct the textual answer, the Tokenizer is applied to convert the Token Embeddings back to the actual textual representations. Secondly, an extra step in order to differentiate the question from the answer text is using Segment Embeddings. The Segment Embedding is a single binary sequence where segment A represents the question, including the [SEP] token and segment B represents the answer context. The Positional Embeddings, which are used to show the positions of the tokens, are created internally by the model at a later stage.

(37)

attention_probs_dropout_prob 0.1 hidden_act gelu hidden_dropout_prob 0.1 hidden_size 1024 initializer_range 0.02 intermediate_size 4096 layer_norm_eps 1e-12 max_position_embeddings 512 model_type bert num_attention_heads 16 num_hidden_layers 24 pad_token_id 0 type_vocab_size 2 vocab_size 30522

Table 4.1: BERT Model Parameters

Figure 4.2: Visualisation of the Embeddings used to represent input text-pairs in BERT, from [Devlin et al., 2019]

(38)

Model Prediction After pre-processing, the token embeddings and segment embeddings are converted into pytorch tensors. The BERT model can then easily be applied by inputting the token and segment embeddings, after which the osition embeddings are generated internally. To create the final input representation, BERT sums the corresponding token, segment, and position embeddings. After then being fed through the model, the model outputs two vectors, a start vector S ∈ RH _{and an end vector E ∈} _RH _{containing the scores for each token as the}

answer start end token, respectively.

Answer reconstruction Given the start and end score vectors, the positions of the start and end token of the answer are selected by taking the arg max of their corresponding vectors. Using these indices, the original sentence is reconstructed by appending the original word tokens in the index span. Important to notice is that tokenized sub-words are represented starting with ’##’, therefore this is removed when recombining words with sub-words.

After selecting the answer from a context, the answer confidence score is cal-culated by multiplying the answer span’s start score and end score. This score is a representation for how certain BERT is with its answer prediction and can be used to compare answers from different answer contexts for the same question.

4.4 Answer Selection

In order to select the final answer out of a list of extracted answers, a new final answer score is calculated for each candidate. This is done using the scores from the retrieval stage and the extraction stage.

Firstly, the Anserini ranking scores and BERT answer confidence scores are normalized over the total candidate scores using min-max normalization. After normalization, the final candidate score for a question-document combination’s candidate answer is calculated using the following formula:

Sqd= (1 − µ) · SAnserini+ µ · SBERT (4.1)

Where qd represents the question-document combination, SAnserini is the

rank-ing score of the document given the question as query in Anserini and SBERT is

the calculated confidence score for the extracted answer using BERT. The final answer is then selected by taking the argmax of the final candidate scores.

Several intervals of the hyperparameter µ will be evaluated to find its optimum value for the QA prediction task, which in its turn provides insight into relevance ratio between the two individual scores. To provide intuition: at µ = 0, the final

(39)

answer is selected purely based on the corresponding answer context’s anserini retrieval score and at µ = 1, the answer is selected based on the highest BERT confidence score.

4.5 Evaluation Methods

In order to get a deeper understanding of the the question answering system, several evaluation methods are employed after the retrieval and answer selection stage. The evaluation for these stages are described in the following two sections.

4.5.1 Retrieval Evaluation

In order to evaluate quality of the Retrieval Stage, the success rate (equation 4.2 is used. The motivation behind this measures is that for this task, every question only has one specific answer context containing the target answer, and therefore only one ranking score has to be taken into account. This is different from more popular IR evaluation measures such as the Mean Average Precision, where multiple relevant results are taken into account.

Success rate We define success as the target answer context being retrieved in the top-k results (the number of hits the searcher will return) for its corresponding query. It can be calculated using this formula:

SR = 1 |Q| |Q| X i=1 qi = ( 1, if qi ∈ topk results 0, otherwise (4.2)

In other words, if the desired answer context to a question can be found in the top-k results of the index by using the question as the query, that query is considered to have a successful hit. Next, the fraction of the total amount of questions correspond to a hit, is calculated.

4.5.2 Answer Selection Evaluation

The BERT Model Evaluation is done using the official SQuAD evaluation script. The evaluation script is available on the official SQuAD github and can be run with the following command:

(40)

{

unique question id: answer string, ...

}

Listing 2: SQuAD Prediction Format

where<path to dev-v1.1> specifies the path to the original SQuAD dev file and <path to predictions> specifies the path to the model’s output file using the format specified in listing 2.

Upon evaluation, several string normalization steps are taken to ensure the model’s predicted answers can correctly be compared to the ground truth answers. These steps include: punctuation, article and extra whitespace removal, aswell as text lowercasing. As an example, the following two answers can be counted as corresponding:

The European Union = european union

Evaluating and comparing the performance of different QA systems, requires reliable evaluation metrics. The SQuAD evaluation script uses two different met-rics, the accuracy or ’exact match’ score and the F1-score, which according to according to [Yao, 2014], satisfies all the requirements for QA evaluation.

Exact Match For this specific task only one answer is selected, which can either be a True Positive or a Negative Positive, depending on whether its correct or incorrect. Hence, all not-selected candidate answers are ignored. Due to this property, the accuracy or ’exact match’ can be calculated with:

Correct Answers

Total Answers (4.3)

Where the amount of correct answers is the total number of exact matches between a prediction and any one of the corresponding ground truth answers.

An issue that arises with this metric for Question Answering system evalua-tion is the inability to classify an answer as partially correct. Since there is only one exact answer, any slight difference in the answer span would be counted as incorrect.

For this task, the F1-score is calculated for each individual prediction and the total score is averaged over the maximum scores of each prediction. To calculate

(41)

F1-Score:

2(Precision ∗ Recall)

Precision + Recall (4.4)

Precision = True Positive

True Positive + False Positive (4.5)

Recall = True Positive

True Positive + False Negative (4.6)

the F1-score, the prediction and ground truth answers are treated as bags of to-kens. In this case, correctly selected words are True Positives, incorrectly selected words are False Positives and not-selected correct words are False Negatives. True Negatives are neglected, since those are simply absent in both answer spans. A more simplified formula for Precision and Recall can be formulated as:

Precision = Matching Tokens

Predicted Tokens (4.7)

Recall = Matching Tokens

Ground Truth Tokens (4.8)

Finally, after evaluating the predictions, the script outputs the following:

(42)

Chapter 5 Results

5.1 Context Retrieval

This section provides the results for the retrieval stage of the SQuAD dataset for both indexing methods, according to the evaluation measures discussed in section 4.5.1. The success rate can be interpreted here as follows: given the top-k retrieved answer context documents using the question as corresponding query, how often will the desired answer context be present in these results. The retrieval scores for the word-based BM25 and SLR-based ranking can be found in table 5.1. Along with the evaluation scores, a visualisation for the ranking results is provided in figure 5.1. Figure 5.1 shows the sorted lists of ranking scores for each query instance, representing the ranking score distributions for both ranking methods.

k: 1 2 5 10 50 100 500 1000

Word-based: 0.7707 0.8581 0.9217 0.9498 0.9842 0.9910 0.9971 0.9974 SLR-based: 0.0086 0.0149 0.0306 0.0543 0.1490 0.2291 0.5390 0.7454 Table 5.1: Success rate scores for both retrieval methods at different values of k.

In total, the word-based approach was able to retrieve the right context for a question in 10542 cases out of 10570 at k = 1000, the lowest scoring context was 885. For the SLR-based approach a total of 7879 out of 10570 were retrieved, with the lowest scoring rank being the maximum of 1000.

(43)

Figure 5.1: The sorted rank distribution of desired answer contexts using SLR and BM25 retrieval.

5.2 Answer Selection

This section provides the results for the QA evaluation of the final Answer Selec-tion stage. As stated in secSelec-tion 4.4, the final answer is selected out of the answers candidates extracted from the top-5 retrieved documents using the previously men-tioned formula:

Sqd= (1 − µ) · SAnserini+ µ · SBERT

Results for the different values of µ are acquired by running the official squad-dev-v1.1 evaluation script and can be found in table 5.3. For comparison purposes, table 5.2 shows the BERT scores for the ideal situation where the desired target context for a given question is directly given.

Exact Match: F1-Score:

BERT + Target Context 0.7670 0.8572

(44)

(a) Word-based (b) SLR-based

µ Exact Match: F1-Score Exact Match F1-Score

0.0 0.6723 0.7555 0.0014 0.0273 0.1 0.6798 0.7633 0.0150 0.0285 0.2 0.6858 0.7690 0.0161 0.0294 0.3 0.6875 0.7721 0.0174 0.0307 0.4 0.6882 0.7737 0.0199 0.0331 0.5 0.6775 0.7623 0.0248 0.0385 0.6 0.6644 0.7475 0.0294 0.0431 0.7 0.6521 0.7350 0.0301 0.0432 0.8 0.6397 0.7215 0.0301 0.0427 0.9 0.6309 0.7122 0.0301 0.0428 1.0 0.6217 0.7024 0.0301 0.0428

(45)

Chapter 6 Discussion

At both different evaluation stages, the difference in performance between the word-based system and the SLR-based system is quite clear.

The retrieval results show a drastic difference in performance between the two systems. At k = 1 the word-based system is able to retrieve roughly 75% of desired contexts, whereas the SLR-based system only finds roughly 0.8%. When k is increased, the word-based system continually finds more relevant contexts, although improvement decreases gradually. The success rate for the SLR-based system also increases, but more slowly and more evenly (figure 5.1). At k = 1000 the word-based system is able to find almost every desired answer context, whereas the SLR-based system still has a success rate lower than the word-based system at k = 1 Theoretically it shows that this system is able to find answer contexts by only retrieving few relevant documents for a question using the word-based ranker. Upon further evaluation of the retrieval results of each system, figure 6.1 was plotted to show the retrieval frequency of each unique answer context in the com-bined top-5 answer context results for all questions. It shows that for the SLR-based system, out of all answer contexts a few appear highly frequently, while many do not appear at all (figure 6.1a). In total for the SLR-based system, only 266 out of a total of 2067 answer contexts appear in the top-5 results, indicating a bias towards the SLRs of a few answer contexts. For the word-based model, each answer context appears at least once in the total top-5 results. This can be seen from the more evenly distributed frequencies in figure 6.1b.

Since the final answer selection stage is entirely dependent on the quality of the retrieved answer contexts in the first stage, performance of the system in the final QA evaluation is consequently limited by the first stage performance.

For the word-based system, the highest score was achieved at µ = 0.4 with an exact match score of roughly 68.8% and an F1-score of roughly 77.4%. These are only decreases of 11.4% and 11.0% respectively compared to the optimal situation