Document Ranking Using Sparse Latent Representations in Anserini

(1)

Document Ranking Using

Sparse Latent Representations in

Anserini

(2)

Layout: typeset by the author using LA_TEX.

(3)

Document Ranking Using

Sparse Latent Representations in Anserini

evaluating the performance of standalone neural ranking models

using a Lucene back end

Lodewijk A.W. van Keizerswaard 11054115

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor Prof. dr. ir. J. Kamps

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 904 1098 XG Amsterdam

(4)

Abstract

In a recent paper [Zamani et al., 2018] a standalone neural ranking model (SNRM) was proposed that outperforms standard word based models, and does so without compromising on execution time. They overcome the computational challenges of previous attempts at a standalone neural document ranker by using sparse document and query representations. Although these results are very promising, the lack of a search engine capable of using sparse vectors, make them hard to reproduce. This thesis implements such a search engine in Anserini, that uses the highly optimized industry standard Lucene as a back end. Neural ranking is enabled by storing the document activation values in the latent term frequency as integer, and reconvert-ing to floats at query time. Comparreconvert-ing four different rankreconvert-ing models confirmed the findings of Zamani et al. (2018). This engine can be used to build interactive neu-ral information retrieval systems and enables further research into the use of sparse latent representations in information retrieval.

(5)

Introduction

To date, information retrieval (IR), specifically document ranking, has not seen the extensive use of neural networks as much as other natural language processing tasks. Traditional retrieval models are still irreplaceable, because of efficiency constraints. Neural models are until now only deployed as second stage rerankers, after a efficient first stage traditional gatekeeper. Dense document and query representations in document ranking are too computationally demanding to use in any real time information retrieval system, and the influence of the efficient first stage ranker on the reranking neural model is unclear.

These are the findings of Zamani et al. (2018) and they propose a new kind of neural model that promises to deal with these computational and combinatorial challenges. They conceptually proved that a standalone neural ranking model with sparse query and document representation vectors can be used in real time information retrieval systems. However, the lack of a search engine capable of using sparsity in neural representations, makes it hard to reproduce these results. The question of this thesis is: how can support for sparse latent representations be implemented and what are the performance implications of their use in a standalone neural ranking model?

The first part is done by implementing a search engine that supports standalone document ranking on sparse latent representations (SLRs). This implementation takes the form of an extension to the research focused search engine Anserini, which uses the highly optimized and industry standard Lucene framework as back end. The activation values of document represen-tations get stored in the term frequency of the latent term, by converting between floats and integers and reconverting at query time. The drawback of using the Lucene framework is the fact that the heavy efficiency optimizations make the actual implementation a lot more complex. However, this means that if the implementation is successful, the use of SLRs will become more widely available, immediately for future research, and potentially later also for live systems.

The second part is then to evaluate the performance of the search engine by gathering data on performance, effectiveness and space complexity for dense and sparse standalone neural ranking models. The comparison between these models makes the performance difference clear between sparse and dense neural representations. Data on a traditional word based model is also gathered to put this performance in a broader context.

The required background information on traditional and neural document ranking is provided in Chapter 2. This chapter is co-authored with Tijmen van Etten, Felix Rustemeyer and Roel Klein, who use this information for their research on the impact of sparse latent representations on downstream natural language processing tasks. Each of us contributed equally to the writing of this chapter. The four of us also worked closely together on exploring the standard word

(7)

based ranking in Anserini and documented the relevant findings in Section 3.1. After creating a first working implementation of the neural search engine, described at the beginning of Section 3.2.1, our paths split. This thesis then continues with a description of a more precise and com-putationally less intensive implementation in the remainder of Chapter 3. The data set used for testing, the used evaluation metrics, the compared document ranking models, the experimental setup as well as the results can be found in Chapter 4. This thesis ends with an interpretation of the results together with the presentation of a live demonstration of a working standalone neural ranking model, and a discussion of further research directions.

(8)

Chapter 2

Document Ranking

Information Retrieval (IR) is the process of retrieving information from a resource given an information need. Area’s of IR research include text based retrieval, image or video retrieval, audio retrieval and some other specialized area’s. Text based retrieval is by far the most active, driven by the rise of the World Wide Web. Today many systems are able to retrieve documents from enormous collections consisting of multiple terabytes or even petabytes of data. These systems have existed for some time, and continue to improve on traditional IR methods. In this chapter, the basis of text based IR will be established, to be able to make a comparison between different IR methods and to be able to motivate the design choices of the neural search engine.

The following section will introduce the concepts inverted index and ranking functions, after which two traditional ranking functions will be introduced as point of reference. The remainder of this chapter will discuss the current use of neural ranking in IR, their current computational challenges, and the way that sparse latent representations (SLRs) promise to deal with to those problems.

2.1 Indexing and Searching

When discussing IR, a distinction is made between the indexing time and the query time. The query time is defined as the time it takes the algorithm to retrieve a result. This time is dependent on the algorithm that is used as well as the hardware that it is running on. The goal of IR is then to minimize the query time, which is achieved by moving as much computation as possible from query to indexing time. What to compute at indexing time and how to search this computed data is then the question that IR research is concerned with. A general description of these two stages is given below. The following information is heavily based on [Manning et al., 2008, Ch. 2-4, 6].

Indexing

At indexing time an inverted index is computed from a collection of documents. An inverted index is a data structure that maps content such as words and numbers, sometimes also called

tokens, to the locations of these tokens in that document collection. Inverted indexes are

widely used in information retrieval systems, used on a large scale for example in search en-gines [Zobel et al., 1998]. The computation of a inverted index can be split into three different parts: the tokenization of the documents in the collection, the stemming of the extracted tokens and the mapping of each token to each document. The inverted index can optionally be append

(9)

with extra data, which can be attached to either tokens or documents. These stages will be discussed below.

Tokenization Tokenization is the extraction of tokens from a document. In traditional IR the

extracted tokens are usually words. However, it is also possible to use other tokens such as word sequences, also called n-grams, vector elements or labels. There is a wide range of tokenization algorithms since different applications and languages ask for specialized algorithms.

A rough token extraction algorithm would split a document on whitespace and interpunction. For example, when using words as tokens the following sentence

"He goes to the store everyday." (2.1)

would be split into {”He”, ”goes”, ”to”, ”the”, ”store”, ”everyday”}. When using n-grams, the re-sults of token extraction with n = 2 is: {”He goes”, ”goes to”, ”to the”, ”the store”, ”store everyday”}.

Stemming The next step is to stem these words. Stemming is the process of reducing each

word to its stem. In the example given above the word "goes" would be changed into its

stem: "go". The remaining words are already in their stemmed form. An example of a very popular stemming algorithm is Porter’s Algorithm This step usually also includes case-folding, the lowering of (all) cases of each word. For more information on stemming algorithms and case-folding, see [Manning et al., 2008, Ch. 2]. The stemmed tokens set of Sentence 2.1 would be

{”he”, ”go”, ”to”, ”the”, ”store”, ”everyday”}.

Token Mapping Given the set of stemmed tokens for a document, the tokens can now be

mapped to each document. A simple algorithm would simply indicate if a token is present in the document. This mapping can be represented by a dictionary with the token as key and the set of documents containing the token as value: [”token” → {docID, docID, ..., docID}]. Using Sentence 2.1 as content for document 1, and

"We go to the store today." (2.2)

as content for document 2, the following mapping would be acquired after word tokenization and stemming: Token Mapping: (2.3) ”he” → {1} ”we” → {2} ”go” → {1, 2} ”to” → {1, 2} ”the” → {1, 2} ”store” → {1, 2} ”everyday” → {1} ”today” → {2}

Inverted Index Construction After the token mapping is complete, the inverted index can

be constructed. In the example above, the inverted index only indicates the presence of a word in a document by saving the document ID. This can be sufficient for small collections with simple

(10)

search algorithms, but with bigger collections sizes this quickly becomes insufficient. To improve the search functionality, several other statistics can be included in or appended to the inverted index. For example the location of each term in a document can be saved by saving a list of positions for each mapping from a term to a document. A mapping of token t to a collection of n documents d ∈ D would be of the form ”t” → {...,0d0_i: [p1, p2, ..., pk], ...} at positions p where

k is equal to the number of occurrences of t in di for any i. Other examples of data that can be

included are token frequency per document (store the value of k explicitly for each token), the amount of tokens in each document. These are commonly used in traditional search algorithms such as TF-IDF and BM25, that will be introduced in Section 2.2.

Searching

Given an inverted index, it is possible to preform search operations using queries. There are two categories of search algorithms: unranked search algorithms, and ranked search algorithms. The first category is a simple retrieval of all document ID’s that contain (a part of) the query. This is often insufficient, as mentioned in the previous paragraph, so this category will not be discussed any further. Ranking search algorithms will rank each document for a query, by computing a score representing how relevant a document is to the query. A document with a high score should have a high relevance for the given query. The following section will discuss two efficient traditional ranking functions.

2.2 Standard Ranking Functions

As mentioned previously, ranking is the task of sorting given documents in order of relevance to a given query. Ranking can be conducted with the use of standard ranking functions like TF-IDF and BM25. The information about these ranking functions is from [Christopher et al., 2008] and [Robertson and Zaragoza, 2009].

TF-IDF

The TF-IDF ranking approach consists of two phases. Firstly the given query and documents are converted to a TF-IDF representation. Between these query and documents vector represen-tations, a similarity score is then calculated using a scoring function like cosine similarity. After the documents are ranked according to the similarity score, the ranking task is completed.

Term frequency - inverse document frequency (TF-IDF) is a vector based model; for each document a vector representation can be built. This vector consist of the TF-IDF values of all the terms in the collection. TF-IDF is a factor weighting the importance of a term to a document in a collection of documents. For an important term in a document, a term that occurs frequently in the document an rarely in other documents, a high TF-IDF value would be obtained. A low TF-IDF value is calculated for unimportant terms, such as terms with a very high document frequency

Given term t, document d and a collection of documents D, the TF-IDF of term t in document d can be calculated by multiplying two statistics of term t: the term frequency and the inverse document frequency.

(11)

Term Frequency The term frequency of a term is calculated by dividing the frequency of term t in document d by the number of occurrences of the most frequent term in document d (maxkfkd):

T Ftd=

ftd

maxkfkd

(2.5) Hence, a TF of 1 is calculated for the most frequent term in a document, and fractions are calculated for the other terms in the document.

Inverse Document Frequency Solely using TF as a ranking function is not an accurate

weighting scheme. TF does not include the fact that the most frequent words, "stopwords", are not important, because these words are included in a lot of documents. Therefore, documents are not distinguishable by solely using TF.

Hence, TF is multiplied with IDF, decreasing the weight of very frequently terms and in-creasing the weight of rarely occurring terms. The IDF of a term t is calculated by the binary logarithm of the number of documents N , divided by the number of documents containing the term Ni:

IDFt= log2(

N Nt

) (2.6)

By multiplying a term’s TF and IDF for all the terms in a document, the TF-IDF representation of a document is constructed .

Cosine similarity After TF-IDF vector representations are build, the similarity between

doc-uments and queries can be calculated with a scoring function. A standard scoring function, often used for calculating similarity scores between TF-IDF vector representations, is cosine similarity. The cosine similarity between two TF-IDF vectors is ranged between zero and 1. Because TF-IDF values cannot be negative, the angle between two TF-IDF vectors cannot be greater than 90◦. Cosine similarity, between the query Q = (Q1, Q2, . . . , QN) and document

D = (D1, D2, . . . , DN) of length N , where N is equal to the vocabulary size |V |, is derived by

the following formula:

cos(θ) = PN i=1QiDi q PN i=1Q 2 i q PN i=1D 2 i (2.7)

Now the documents can be ranked, based on the cosine similarity score between the documents and query

BM25

Another widely used ranking function is BM25, which ranks documents on the basis of the occurrences of query terms in the documents. Instead of the TF-IDF approach, where a scoring function is employed for calculating the similarity scores between the document vectors and query vector, BM25 is one function that calculates a BM25 score between document D and query Q. Lucene has switched from using TF-IDF as ranking function to BM25, because of several advantages that are incorporated in BM25

Similiar to TF-IDF, the formula of BM25 contains TF and IDF. However, the versions of these statistics used in BM25 differ with the traditional forms.

(12)

Asymptotic TF By the standard TF, a term occurring 1000 times is two times as special as a term with 500 occurences. Due to this type of TF, the TF-IDF values sometimes do not match the expectations of users. For example, an article with 8 times the word "lion" is twice as relevant as an article with 4 times the word "lion". Users do not agree with this; according to them the first article is more relevant but not two times as relevant. BM25 handles this problem by adding a maximum TF value, variable k. Due to this maximum, the TF curve of BM25 is an asymptote that never exceeds k. In equation 2.8, term frequency is indicated with f (qi, D).

Smoothed IDF The only difference between the standard IDF – utilized in the calculation of

TF-IDF — and the BM25 IDF, is an addition of one before taking the logarithm. This is done because the IDF of terms with a very high document frequency can be negative. This is not the case with the IDF in the BM25 formula, due to the addition of one.

Document length Another distinction between TF-IDF and BM25 is the fact that BM25

includes document length. TF-IDF does not take document length into account, and this leads to unexpected TF-IDF values. Fore example, the TF-IDF value of the term "lion" that appears two times in a document of 500 pages is equal to the TF-IDF value of the same term appearing in a document of one page. Thus, according to TF- IDF the term "lion" is equally relevant in document of 500 pages and a document consisting of one page. This is not intuitive, and BM25 handles this by normalizing TF with the document length (1 − b + b ·_avgdl|D| ). |D| is the length of document D and avgdl is the average document length. The influence of document length on the calculation of BM25 is controlled by the variable b.

BM25 Score(D, Q) = n X i IDF (qi) · f (qi, D) · (k + 1) f (qi, D) + k · (1 − b + b ·_avgdl|D| ) (2.8) After the calculation of the BM25 scores — for all the documents in the collection—, the docu-ments can immediately be ranked based on these scores.

2.3 Neural Ranking

In recent years, neural networks have been successfully applied to a variety of information retrieval tasks, including ad-hoc document ranking [Guo et al., 2017, Xiong et al., 2017, Dehghani et al., 2017a]. Neural ranking can in general be understood by dividing the ranking process in three steps: gen-erating a document representation, gengen-erating a query representation and relevance estimation. Neural ranking models can be classified based on the step where a neural network is applied.

In general, there are two types of neural ranking models: early combination models and late combination models [Dehghani et al., 2017a]. Early combination models often use text-based representations of the document and the query, after which interactions between the document and the query are captured in a single representation. This interaction can for example be manually created features or exact query term matches. The document-query representation

is then given as input to a neural model to compute a similarity score. Examples of early

combination models include DRRM, DeepMatch and ARC-II.

On the other hand, the late combination models use neural networks to create separate query and document embeddings. The document relevance is then computed using a simple similarity function, such as the cosine similarity or a dot product. Examples include DSSM, C-DSSM and ARC-I.

One advantage of late combination is that all document embeddings are query independent, which means that they can be precomputed. This way only the query embedding and a simple

(13)

similarity function have to be calculated at query-time, whereas early combination models have a more complex neural network for relevance estimation.

However, most of the time late combination model are still not efficient enough to provide close to real-time retrieval. This is because the document and query embeddings lack the sparsity of traditional word-based representations. Creating an inverted index using document embeddings is still technically possible, but will not give a speedup unless they are sparse vectors. Document embeddings are usually so dense that there are no non-zero values. As a result, the posting lists for each latent term would contain all documents and no efficiency is gained. This is one of the main reasons that standalone neural rankings model are not used in practice. However, there are methods to still include neural ranking models in the ranking process.

The most popular way is to use a multi-stage ranker. Here a traditional first stage ranker passes the top documents to a second stage neural ranking model. These stacked ranking mod-els seem to take advantage of both the efficiency of traditional sparse term based modmod-els and the effectiveness of dense neural models. However, this first-stage ranker acts as a gatekeeper, possibly reducing the recall of relevant documents.

2.4 Standalone Neural Ranking Model

In a recent paper, Zamani et al. (2018) took a different approach to neural ranking in IR and proposed the Standalone Neural Ranking Model (SNRM). According to the paper, the original multistage approach leads to inefficiency due to the stacking of multiple rankers —that are working at query time— and also to a loss in the retrieval quality due to the limited set of documents continuing to the neural stage. They propose to use a sparse latent representation (SLR) for documents and queries instead of a dense representation to achieve similar levels of efficiency and quality as a multistage ranker. These representations consist of significantly more zero valued elements than the dense representations but are still able to capture meaning full semantic relations due to the learned latency in the sparse representations. Figure 2.1 shows the distributions of different representations and illustrates the Zipfian nature of the SLR, matching far fewer documents than the dense representation –which gets used by multistage rankers–. This approach with sparse latent representations allows for a fully neural standalone ranker, since a first stage ranker is not required for efficiency reasons. In the following sections we will provide an overview of how the SNRM works and how it is trained, as explained in [Zamani et al., 2018]. Model

The efficiency of the SNRM lies in the ability to represent the query and documents using a sparse vector, rather than a dense vector. For ranking purposes, these sparse representation vectors for documents and vectors should be in the same semantic space. To achieve this desideratum, the parameters of the representation model remain the same for both the document and the query. Simply using a fully-connected feed-forward network for the implementation of this model would result in similar sparsity for both query and document representations. Instead, since queries are intuitively shorter than the actual documents, a desirable characteristic for efficiency reasons is that queries would also have a relatively more sparse representation. To solve this, a function based on the input length is used for the sparse latent representation. This function to obtain the representation for a document, and similarly for a query, is provided below.

φD(d) = 1 |d| − n + 1 |d|−n+1 X i=1

(14)

Figure 2.1: The document frequency for the top 1000 dimensions in the actual term space (blue), the latent dense space (red), and the latent sparse space (green) for a random sample of 10k documents from the Robust collection.

As can be seen, the final representation is averaged over a number n-grams that is directly controlled by the length of the input. The larger the input, the more n-grams the representation is averaged over and the more likely it is that different activated dimensions are present. Another advantage of this function is the capturing of term dependencies using the sliding window for a set of n-grams. These dependencies have been proven to be helpful for information retrieval [Metzler and Croft, 2005].

The φngramfunction is modelled by a fully-connected feed-forward network. This does not

lead to density problems because the input is of fixed size. Using a sliding window, embedding vectors are first collected for the input using a relevance based embedding matrix. After embed-ding the input, the vectors are then fed into a stack of fully-connected layers with an hourglass structure (see Figure 2.2). The middle layers of the network have a small number of units that are meant to learn the low dimensional manifold of the data. In the upper layers the number of hidden units increases, leading to a high dimensional output.

Training

The SNRM is trained with a set of training instances consisting of a query, two document candidates and a label indicating which document is more relevant to the query. In the training phase there are two main objectives: the Retrieval Objective and the Sparsity Objective. In order to achieve the Retrieval Objective, Hinge loss is employed as the loss function during the training phase since it has been widely used for literature ranking for pairwise models.

L = max(0, − yi[ψ(φQ(qi), φD(di1)) − ψ(φQ(qi), φD(di2))] (2.10)

For the Sparsity Objective, the minimization of the L1 norm is added to this loss function, since

minimizing L1has a long history in promoting sparsity [Kutyniok, 2013]. The final Loss Function

can be computed as follows:

(15)

Figure 2.2: A general overview of the model learning a sparse latent representation for a docu-ment.

Where q is the query, d1document 1, d2document 2 and y indicates the relevant document using

the set {1, -1}. Furthermore, φQindicates the SLR function for a query and φD for a document.

Lastly, ψ() is the matching function between query and document SLR explained in the next section.

An additional method called ’weak supervision’ is used to better train the model. This

unsupervised learning approach, works by obtaining ’pseudo-labels’ from a different retrieval model (a ’weak labeler’), and then uses the obtained labels to create a new training instance. For each instance: (qi ,di1,di2,yi), two candidate documents from the result list or one from the

result list are sampled along with one random negative sample from the collection. yi is defined

as:

yi= sign(pQL(qi|di1) − pQL(qi|di2)) (2.12)

where pQLdenotes the query likelihood probability, since in this case a query likelihood retrieval

model [Ponte and Croft, 1998] with Dirichlet prior smoothing [Zhai and Lafferty, 2017] is used as the weak labeler.

This method has shown to be effective approach to train Neural Models for a set of IR and NLP tasks, including ranking [Dehghani et al., 2017b], learning relevance-based word embedding [Zamani and Croft, 2017] and sentiment classification [Deriu et al., 2017].

Retrieval

The matching function between document and query is the dot product between the intersection of non-zero elements their sparse latent representations. The score can be computed with:

(16)

retrieval score(q, d) = X − →_q i|>0 − →_q i − → di (2.13)

The simplicity of this function is essential to the efficiency of the model, since the matching function is used frequently at inference time. The matching function will calculate a retrieval score for all the documents that have a non-zero value for at least one of the non-zero latent elements in the sparse latent representation of the query.

Finally, using the new document representations, an inverted index can be constructed. In this case, each dimension of a SLR can be seen as a latent term, capturing meaningful semantics about the document. The index for a latent term contains the ID’s of the documents that have a non-zero value for that specific latent term. So, for the construction, if the ith dimension of a document representation is non zero, this document is added to the inverted index of the latent term i.

2.5 Summary

This chapter started with an introduction of a traditional indexing and searching pipeline, in-cluding a description of tokenization, stemming, token mapping and two traditional ranking algorithms: TF-IDF and BM25. These efficient traditional word based ranking algorithms form the basis for multistage neural reranker models, that are currently the most popular way to circumvent the computational challenges of using dense neural representations. These models have their own challenges, since the combinational effects of two ranking stages are not clearly understood, and the first stage ranker functions as a gatekeeper thereby possibily reducing the potential of the second stage neural ranker.

The standalone neural ranking model (SNRM) described in the previous section promises to overcome both the computational challenges of dense SNRMs and the combinatorial challenges of multistage rankers by using sparse latent representations. The training of this model is done on both a sparsity and a retrieval objective, which results in sparse latent query and document representations. This sparsity is reflected by the Zipfian characteristic of the term frequency distribution.

These sparse latent representations can be indexed similar to traditional word based inverted indexing. The remaining challenge is integrating support for computation of document relevance scores, based on these sparse vectors, in Anserini.

(17)

Chapter 3

Anserini: The (Neural) Search

Engine

Open-source toolkits play an important role in information retrieval research. The ability to evaluate performance benchmarks is an important aspect of these toolkits that provides a valuable addition to information retrieval (IR) research.

This is the exact motivation behind Anserini: Reproducible Ranking Baselines Using Lucene [Yang et al., 2017]. As its description states, Anserini focuses on the ability to reproduce results of IR benchmarking, serving as a widely accessible point of comparison in IR research. Due to this ’reproducibility promise’ of Anserini [Lin et al., 2016], it is a suitable toolkit for research into the use of sparse latent representations for document ranking and the impact that this use has on IR and down-stream NLP tasks. The fact that Anserini is built on Lucene provides is another argument to use it. Lucene is a Java library that provides many advanced searching functionalities. It has become the de-facto standard for many search engines and would therefore make this research widely applicable.

Anserini provides the basis for the neural search engine. In order for Anserini to actually be able to use the SNRMs sparse latent representations, some changes had to be made. The changes necessary to be able to use a sparse latent representation besides a term frequency rep-resentation, can be divided in those related to the document indexing and those related to query representation and search. In this chapter, the relevant parts of Anserini’s standard indexing and searching implementation is outlined, after which a description of the SNRM implementa-tion follows. The complete implementaimplementa-tion can be found on the github page of a forked Anserini version [van Keizerswaard et al., 2020].

3.1 Standard Ranking

To be able to implement an extension to Anserini that supports SLR based ranking, the standard indexing and searching pipelines have to be understood. The relevant parts of these pipelines are presented below.

3.1.1 Standard Indexing

Lucene provides a lot of functionality, such as multi-threaded indexing, but only as a collection of indexing components. This means that an end-to-end indexer can not be employed by solely

(18)

using Lucene. For the task of inverted indexing, Anserini can be seen as a wrapper of Lucene, because it assembles the Lucene indexing components into an end-to-end indexer.

The Anserini indexing pipeline is used for indexing text from a collection of text documents. The indexing component consists of several Java files and classes. The Java class "IndexColec-tion", that is located in "IndexCollection.java", is the entry point for the indexing process and is controls the indexing pipeline. In this section this standard indexing approach is discussed. To perform the indexing pipeline, a command has to be executed. Several flags –that determine the collection, output folder and the index settings– have to be added.

Index The file "DefaultLuceneGenerator.java" is the first component in the indexing pipeline. A given collection is converted into Lucene documents containing three fields: contents , id and raw. By converting a collection into Lucene documents, an index can be constructed consisting of the inverted index and the documents.

Inverted Index ’Contents’ stores the data, which Anserini will utilize to make an inverted

index. In the standard indexing approach an inverted index is constructed with the text of the documents in the collection. This is done by the StandardEnglishAnalyser, that processes each term in the ’contents’ field. As discussed in 2.1.1 and illustrated in 3.1, an inverted index consists of terms –words in case of standard indexing– and the ID’s of the documents that contain these words as well as document frequencies per term. The ID value of a document is stored in the ’id’ field of the Lucene document and connects the inverted index with the documents stored in the index.

Documents The documents are stored together with the document ID and the content from

the ’raw’ field. This is an optional field and can store additional information about the document, like a title, the number of words in the document or the document’s sparse latent representation (see 3.1.2).

3.1.2 Standard Querying

Given the index described above, the traditional TF-IDF or BM25 algorithms, can perform a ranked search on this index. Inside the search function of the SearchCollection object, several objects will be constructed to perform scoring. This function takes in the query, an IndexReader and a Similarity object, and returns the scored documents in a ScoredDocuments object. The query is the raw input string read by Anserini, the IndexReader provides a way to read the inverted index, and the Similarity object is used to compute document scores. The construc-tion sequence consists of the QueryGenerator, BooleanQuery, one or multiple TermQuery’s, TermWeights, FixedSimilarity’s and Scorer objects, for which the type of the last two will de-pend on the type of ranking function that is used. The significant parts of all these objects, with respect to this project, will be discussed below. An overview of this process can be found in Figure 3.2.

QueryGenerator The QueryGenerator object takes in the query and extracts tokens (terms)

from it. Then, an analyser removes stopwords and stems the terms if needed. What remains is a collection of terms, also called a bag of words. For every term a TermQuery object is created, which is then added to the BooleanQuery. This process effectively splits the scoring of documents into scoring per term.

(19)

Figure 3.1: Overview of the index

(20)

TermQuery For the purpose of this project, the TermQuery object is simply a wrapper for the TermWeight object. The TermQuery contains the term that needs to be matched and scored, and creates a TermWeight object.

TermWeight The function of the TermWeight object is to compute the data needed to score

a document, and to find the documents that match the term. The IndexReader is used to gather the information required to create a TermStatistics and a CollectionStatistics object for the term. These objects are passed to a Similarity object to create a FixedSimilarity object (SimFixed in Figure 3.2), which has a score function that can score a document for a single term. The document matching also makes use of the IndexReader object to create a document enumeration, which confusingly is called a TermsEnum in the Anserini source code. This is done by looking up the document ID’s that the term maps to, as described in Section 2.1, in the inverted index. Both the SimFixed and document enumeration are passed to the TermScorer object.

Similarity The Similarity object determines the type of ranking function that is used (options

include TF-IDF, BM25, query likelyhood, and others).

FixedSimilarity The SimFixed is an object of which the only function is to take a document

as a parameter and return a term-document score. The exact computation of the score differs between ranking functions.

Scorer A Scorer object receives the list of matched documents and a term specific

Fixed-Similarity object. The search function will score every matched document for each term using the score function of the SimFixed object. The score for every document will be calculated by taking the sum over the matched term specific scores. These scores are accumulated in a ScoredDocuments object and are returned to the parent function to obtain a document ranking. Something worthy to note is that this search algorithm architecture does not score every document. If a document contains none of the terms from the query, it is never matched, and thus never scored. This is a property that is later used in the design of the sparse dot product computation (see Section 3.2.2)

3.2 Standalone Neural Ranking

The changes that were made to the Anserini engine to support a full SNRM are extensions to the existing implementation. By implementing the support for SNRM in new classes that can be used by specifying new command line argument, the Anserini engine loses non of its functionality. In the following paragraphs, the design of the new classes will be discussed, first for indexing and then for searching. To make the impact analysis of using a SNRM easier, the implementation tries to stay as close as possible to the standard indexing and searching algorithms, described in Section 3.1.

3.2.1 Standalone Neural Indexing

To keep the implementation of the indexing process as simple as possible the choice was made to store the activation value of each token (latent index), in the token frequency as if the neural search is a word based search. Because large parts of the standard analysis pipeline can be reused

(21)

Figure 3.3: An example JSON document collection containing sparse latent representations for each document.

this way, the implementation will cost considerable less time and the difference between standard and neural searching will be minimal. This means that the impact of neural search can be better understood.

Using this approach, the extension to Anserini has to store the activation value, which is a floating point, as a term frequency, which is an integer. A first version of this could be to store the latent term x = di∗ p times in a string, where di is the activation value i of the document

and p is the desired decimal precision. This string can then be put in the content field of a document and then fed to the default Anserini word processing pipeline. For small collections with a value of p < 3 this would be an option, but when the collection gets larger, the string generation complexity explodes. For a collection of size N , a sparsity model with d dimensions with sparsity ratio sr, the generated data in bytes would be

S =hd ∗ (1 − sr)i∗hdlog₁₀(d) + 1e ∗ 10pi∗hNi. (3.1)

The first part of Equation 3.1 is the average number of active dimensions per document, the second part is the average number of bytes each active dimension takes up (log₁₀(d) digits and a space), and the last part is the number of documents in the collection. As an example, a collection of 10,000 documents, a sparsity model with 1000 dimensions, a sparsity ratio of 0.9 and a precision of 2 decimal places would generate 476 MB of data. However, when the desired decimal precision would be 5 decimal places, a total of 47MB/d ∗ 10, 000d = 465 GB of data would be generated. This algorithm requires a lot of resources even for small collection, and is not desirable for the larger scale goal of this project.

A second version, and for this project the final version, would be to set the term frequency directly instead of relying on the word counter of Lucene. This would eliminate the need to generate enormous amounts of data. Setting a custom term frequency is possible by adding a TermFrequencyAttribute to a custom document analyser. The way that the latent terms are fed to this analyser needs to be customized as well for this to be able to work. The classes that achieve this are the SLRGenerator, SLRAnalyzer and the SLRTokenizer. The significant differences between these classes and their standard counterparts follow below.

SLRGenerator A SLRGenerator object will be created that will convert a given document

into a Lucene document. An example of an input document collection in JSON format can be found in Figure 3.3. The both the value of "content" and "id" are passed to the Generator object per document. The difference between the DefaultLuceneGenerator and the SLRGenerator is that the latter has to retrieve the sparse latent representation from the document and store it in a specific way. The first step is done by reading the content value and passing it to the get-SLRFromContent function. As seen in Listing 3.1, the tab separated values are split and stored

(22)

in a HashMap that contains pairs of latent term id’s and activation values. The floats that are formatted in the scientific notation get converted to the normal float representation by the nor-malizeFloatFormat function. For instance, this converts "3.0194822e-06" to "0.0000030194822". This is important to prevent unwanted formatting by indexing pipeline inbetween the SLRGen-erator and the SLRTokenizer.

1 p r i v a t e v o i d g e t S L R F r o m C o n t e n t ( S t r i n g c o n t e n t ) { 2 s l r M a p . c l e a r () ; 3 %% S t r i n g %% [] s p l i t V a l u e s = c o n t e n t . s p l i t (" \\ t ") ; 4 for(int i = 0; i < s p l i t V a l u e s . l e n g t h ; i ++) { 5 if( s p l i t V a l u e s [ i ] != n u l l && ! s p l i t V a l u e s [ i ]. i s E m p t y () && s p l i t V a l u e s [ i ] != " \ n ") { 6 try { 7 if( F l o a t . p a r s e F l o a t ( s p l i t V a l u e s [ i ]) != 0) { 8 S t r i n g m a p V a l u e = n o r m a l i z e F l o a t F o r m a t ( s p l i t V a l u e s [ i ]) ; 9 s l r M a p . put ( I n t e g e r . t o S t r i n g ( i ) , m a p V a l u e ) ; 10 } 11 } c a t c h( E x c e p t i o n e ) { } 12 } 13 } 14 }

Listing 3.1: The getSLRFromContent function of the SLRGenerator object that fills the slrMap. When constructing a SLR based index, indicated by the corresponding "-slr.index" command line argument, the document representation is stored in the content field. The format is a string based concatenation of latent term and activation value: "$(index)$(activation value) ...". This concatenation is computed by the slrToContent function in Listing 3.2.

1 p r i v a t e S t r i n g s l r T o C o n t e n t () {

2 S t r i n g rep = " ";

3 for( Map . Entry < String , String > c u r s o r : s l r M a p . e n t r y S e t () ) {

4 rep += " ␣ " + c u r s o r . g e t K e y () + c u r s o r . g e t V a l u e () ;

5 }

6 r e t u r n rep ;

7 }

Listing 3.2: The slrToContent function of the SLRGenerator object that concatenates the latent-term-value pairs in a string.

This format will later be used by the SLRAnalyzer and SLRTokenizer to reconstruct the term-value pairs.

SLRAnalyzer The SLRAnalyzer constructs components of a TokenStream, just as the

De-faultEnglishAnalyzer. However, for the construction of the neural index the only content that needs to be analysed is the sparse latent representation. Since the format of the content is com-pletely known, a single component suffices to correctly tokenize a document: the SLRTokenizer. SLRTokenizer The "incrementToken" function of the SLRTokenizer processes every token in the content field constructed by the SLRGenerator. This means each token is a latent-term-value pair. The first step that is taken is the extraction of the latent-term and the latent-term-value into the corresponding buffers of this object. This is done based on the position of the ’.’ (dot) character, since the format outputted by the SLRGenerator guarantees that the activation value starts with "0.". The latent term is then extracted by the getSLRToken function visible in Listing 3.3.

(23)

1 p r i v a t e v o i d g e t S L R T o k e n (c h a r[] b u f f e r ) { 2 int v a l S t a r t = g e t S L R D o t P o s ( b u f f e r ) - 1; 3 int z e r o P a d d i n g L e n g h t = S L R _ T O K E N _ L E N G H T - v a l S t a r t ; 4 for(int i = 0; i < t o k e n B u f f e r . l e n g t h ; i ++) { 5 t o k e n B u f f e r [ i ] = ( i < z e r o P a d d i n g L e n g h t ) ? ’ 0 ’ : b u f f e r [ i -z e r o P a d d i n g L e n g h t ]; 6 } 7 }

Listing 3.3: The getSLRToken function of the SLRTokenizer object that extracts the latent-term from a character array.

Since the length of the character buffer that holds the latent term determines the maximum num-ber of neural dimensions, this length needs to be considered. Because the size of the buffer will not have a significant influence on indexing or searching performance, and a neural model with more then 100,000 dimensions is very rare, the latent term buffer has a length of five characters. If the latent term has less characters than five, the left size of it will be padded with zero’s, to prevent reallocating a character array with a shorter length.

After the latent term, the latent value gets extracted. Since the float format is non scientific, this function can extract the float value given the position of the dot character (see Listing 3.4).

1 p r i v a t e v o i d g e t S L R V a l u e (c h a r[] b u f f e r ) { 2 int d e c i m a l S t a r t = g e t S L R D o t P o s ( b u f f e r ) + 1; 3 for(int i = 0; i < v a l u e B u f f e r . l e n g t h ; i ++) { 4 v a l u e B u f f e r [ i ] = ( C h a r a c t e r . i s D i g i t ( b u f f e r [ i + d e c i m a l S t a r t ]) ) ? b u f f e r [ i + d e c i m a l S t a r t ] : ’ 0 ’; 5 } 6 }

Listing 3.4: The getSLRValue function of the SLRTokenizer object that extracts the activation value from a character array.

Since the term frequency needs to be stored as an integer, and the conversion from a floating point to an integer is imprecise, the valueBuffer is filled with the n characters trailing the ’.’. This can then later be precisely converted to an integer representing the activation value multiplied by 10p_,

where p is the desired decimal precision. This value can be set using the "-slr.decimalPrecision [p]" command line argument, and is subject to some restraints. First of all, the integer rep-resenting the activation value can not be larger than the maximum value of a 32-bit integer (2, 147, 483, 647 ≈ 2 ∗ 109_{). This means that the decimal precision has a maximum value of 9.}

However, this is not the only restraint. Every document also has a maximum term count that is equal to 2, 147, 483, 647. This means that the maximum decimal precision pmax, is dependent

on the document with the highest sum of active latent terms (see Equation 3.2).

pmax= p| max

doc∈C(

X

i∈doc

doci∗ 10p) < 2, 147, 483, 647 (3.2)

This equation does not have to be calculated for every index creation, but serves as a guideline when choosing a value for the decimal precision. When this value is set to high, the program will raise a integer overflow exception when processing the document of which the activation value sum is to high.

Given the decimal precision value, the TermFrequencyAttribute is added to the SLRTokenizer to set the term frequency of a latent term in the inverted index. The term frequency of the latent index can now simply be set using the "setTermFrequency" function. The activation value is now stored in the term frequency as an integer.

(24)

Figure 3.4: A simplified overview of the document scoring process implemented in [van Keizerswaard et al., 2020]

activation values up to a precision of nine decimal places. Furthermore, the indexing process is efficient: using pre-computed representations this implementation is just as fast as word based indexing. This means that the computation of document representations is not included. Finally, because the vocabulary size is much smaller, the space complexity of the resulting Lucene index is lower than a word based index.

3.2.2 Standalone Neural Querying

The goal of the extension at query time is to get the sparse latent representation of the query down to a document scoring function, that implements the sparse dot product between two vectors. To be able to do this, the entire scoring sequence is customized. However, most objects are only slightly different from their traditional counterparts. The way that a document score is calculated for a word based query, using the BooleanQuery object, is very similar to a sparse dot product calculation: only documents that match a term, get scored and ranked. The description of the new implementation below will only describe the significant changes relative to word based ranking, described in Section 3.1.2. An overview for the SNRM query process can be found in Figure 3.4.

SLRQueryGenerator The task of the SLRQueryGenerator is twofold. First it extracts the

neural representation from the provided query. The input query is required to be a whitespace separated array of floats. The getSLRFromQuery function in Listing 3.5, extracts the non-zero latent terms, pads every latent term with zero’s on the left to match the latent terms in the index, and then puts every pair in a hash map.

(25)

1 p r i v a t e v o i d g e t S L R F r o m Q u e r y ( S t r i n g q u e r y ) { 2 s l r M a p . c l e a r () ; 3 S t r i n g [] l a t e n t T e r m s = q u e r y . s p l i t (" \\ s ") ; 4 5 for (int i = 0; i < l a t e n t T e r m s . l e n g t h ; i ++) { 6 if( F l o a t . p a r s e F l o a t ( l a t e n t T e r m s [ i ]. t o S t r i n g () ) != 0) 7 s l r M a p . put ( z e r o P a d d i n g L a t e n t T e r m ( I n t e g e r . t o S t r i n g ( i ) ) , F l o a t . p a r s e F l o a t ( l a t e n t T e r m s [ i ]. t o S t r i n g () ) ) ; 8 } 9 }

Listing 3.5: The getSLRFromQuery function of the SLRQueryGenerator object that fills the slrMap.

The second step is the creation of a SLRQuery object for every pair of latent term and activation value. They are all added to the BooleanQuery object.

SLRQuery The SLRQuery can be seen as a normal TermQuery from the Lucene engine, with

a string reprensentation of the latent index number in the term property. The SLRQuery has an activation value property besides this standard term property. The rest of the implementation is quite similar to the TermQuery implementation. The only exeption that is worth mentioning is the createWeight function, that returns a SLRWeight object instead of a TermWeight object.

SLRWeight The SLRWeight, very similar to the TermWeight, is responsible for the creation

of a TermStatistics and a CollectionStatistics object, for finding the documents that contain the latent term, and using these to create a term specific FixedSimilarity object. Only the computation of the statistics is different from the standard version. A simply way to get the activation value of a term into the scoring function, is by storing it in the TermStatistics, which normally stores the term frequency in a long. This is why the bytes of the activation value, a float, are directly copied (twice) to the bytes of a long and passed on to the constructor of the TermStatistics. The CollectionStatistics object is left empty, since no collection statistics, such as term frequency over the entire collection, are used.

Although the matching of documents is not different from the standard implementation, the matched documents in the DocumentEnum object are worth mentioning, because the compu-tation of this list is where the functionality of the Lucene engine can be used for the efficient neural scoring of sparsely represented documents. Since the latent terms are sparse in the repre-sentation, the list of matched documents is significantly shorter for each term, then when using a dense neural representation.

SLRSimFixed The TermStatistics and CollectionStatistics objects are used for the creation

of a SLRSimFixed object, that is returned by the SLRSimilarity scorer function (see Listing 3.6). This function extracts the term specific activation value from the TermStatistics and uses it and the decimal precision of the index (specified by the "-slr.ip [p]" command line argument) to create a SLRSimFixed object. The activationValueDivider is equal to 10p_.

(26)

1 p u b l i c f i n a l S i m S c o r e r s c o r e r (f l o a t boost , C o l l e c t i o n S t a t i s t i c s c o l l e c t i o n S t a t s , T e r m S t a t i s t i c s ... t e r m S t a t s ) { 2 O b j e c t s . r e q u i r e N o n N u l l ( t e r m S t a t s ) ; 3 T e r m S t a t i s t i c s ts = t e r m S t a t s [ 0 ] ; 4 5 f l o a t q u e r y V a l u e = l o n g T o F l o a t ( ts . d o c F r e q () ) ; 6 7 r e t u r n new S L R S i m F i x e d ((f l o a t) t h i s. a c t i v a t i o n V a l u e D i v i d e r , q u e r y V a l u e ) ; 8 }

Listing 3.6: The scorer function of the SLRSimilarity object returns a SLRSimFixed object that can score a document on a single term.

The final score function in Listing 3.7 then takes in the term frequency of a document, and a unused normalization factor and returns the score for that document on the object specific latent term. This function is called by the SLRScorer object.

1 p u b l i c f l o a t s c o r e (f l o a t freq , l o n g n o r m ) {

2 r e t u r n t h i s. q u e r y V a l u e * f r e q / t h i s. a c t i v a t i o n V a l u e D i v i d e r ;

3 }

Listing 3.7: The function of the SLRSimFixed object that scores a single document for a single latent term.

All these objects are passed to the SLRScorer object that is responsible for putting everything together.

SLRScorer The behaviour of the SLRScorer object is in the current version exactly the same

as the TermScorer object; it has been used for debugging in previous versions.

In short, this document ranking implementation uses the same matching and scoring pipeline as the word based version, but takes precomputed sparse vectors as an input. The activation values of active latent terms are passed on in this pipeline and individual latent term-document scores are calculated based on this value.

3.3 Summary

Anserini’s indexing pipeline consists of a document generator and an analyser. The

corre-sponding classes, DefaultLuceneGenerator and StandardEnglishAnalyser, are responsible for pre-processing, tokenizing and analysing the documents. At query time, the document scores are computed by looking up the document list per term and then computing individual term-document scores. These are then added to obtain the final term-document scores and the resulting document ranking.

The full implementation of SNRM indexing and searching stays close to the word based ver-sion. At indexing time, the SLRGenerator takes in precomputed document representations and stores it in a specific format. This format is then read by the SLRTokenizer, which extracts the original representation and writes the activation value to the term frequency of every latent term up to a precision of 9 decimals. The document ranking for a precomputed query representation is retrieved by passing down the activation value to the term-document scoring function, through the SLRQueryGenerator, SLRQuery, SLRWeight, SLRSimFixed and SLRScorer classes.

In conclusion, this contribution to the Anserini search engine is able to index precomputed sparse document representations up to a theoretical maximum of 9 decimal places per activation value. In practice a precision of 7 decimal places should be usable by any sufficiently sparse

(27)

model. This index can then efficiently be searched using precomputed query representations. This all seems to work pretty universally as long as the document and query representations remain sparse; preliminary testing made it clear that using dense representations can increase searching and indexing times. This will be the subject of the next chapter.

(28)

Chapter 4

Experiments

To test the performance and effectiveness of a standalone neural ranking model (SNRM) for document ranking in the implemented search engine the Robust04 data set has been chosen, which is described in the first section. Using the specifics of this data set, the evaluation metrics for performance, effectiveness and space complexity were determined and can be found in Section 4.2. The SNRM is compared to a simulated dense SNRM, a simulated Zipfian sparse SNRM and a word based BM25 model. These four models are discussed in Section 4.3. A comparison of performance is done for all models, and the effectiveness was only compared between the sparse SNRM and the BM25 model, since the simulated representations are randomly generated and thus not relevant. The results of the experiments can be found in the second to last section of this chapter.

4.1 Data Set

The collection that was used is from the Text REtrieval Conference (TREC) 2004 Robust Track, and contains about half a million documents, 250 queries (also called topics), and relevance labels for those topics. Since this collection is a part of the TREC program, it means that a standard program for evaluating the effectiveness is available. How these scores will be retrieved, will be discussed in the next section.

4.2 Space Complexity, Performance and Effectiveness

The evaluation metric for query execution time was kept relatively simple. The time it takes for Anserini to take in a query and give back a scored document list of up to 1000 documents was considered. Since the execution time of a single query is not representative of overall performance, the average query execution time (QET) of the 250 queries was taken for each of the models. Measuring QET is not a part of the TREC evaluation, so a timer inside the SearchCollection program was added, that prints every QET to the command line in milliseconds.

One thing to note on query execution time is that it is dependent on the hardware setup that runs the Anserini program. This means that performance will be affected by the availability of system resources. By running the 250 queries 4 times for each of the models and then taking the average, the influence of this factor on the results should be minimized. The relative average QET of all the models will be compared.

(29)

Several metrics are available to score the effectiveness of the ranking models. Since the focus of this project is on the performance, only the mean average precision (MAP) over the returned relevant documents is compared to roughly indicate the relative effectiveness for each of the models. MAP is part of the TREC evaluation toolkit, which makes the retrieval straight forward, using the provided relevance labels.

Besides average query length and term frequency, another model depended variable that has an influence on the real time usability is the space complexity of an index. The metric that will be used an indication of model space complexity is the size of the resulting Lucene index. If an index takes up less disk space, the cost of transferring the index or storing multiple version of it goes down, which increases the models (re-)usability.

4.3 Document Ranking Models

The central comparison in this thesis was between a sparse SNRM (SSNRM) and a dense SNRM (DSNRM). By comparing the performance of these two models, it became clear whether sparse latent representations make a standalone neural ranking model a computationally feasible option. Two other models were also added to the comparison. The first addition was a simulated true Zipfian sparse SNRM, that was added because the trained SSNRM did not show the characteristic term frequency distribution that was found by Zamani et al. (2018) (as can be seen in Figure 4.1). By adding this model, the performance of a SSNRM in the expected circumstances can be simulated without having to retrain the SSNRM, with no guarantee that the new model does show this Zipfian term frequency distribution. The second additional model is the default BM25 model provided by Anserini. This model is added to the comparison to indicate where the neural ranking solution stands with regard to word based document ranking. The BM25 performance does not serve as a baseline, since word based ranking has been highly optimized over the last fifty years and this level of performance can not be matched by standalone neural ranking solutions. Each of these models will be discussed in the following paragraphs. For every model, a plot of the dimension frequency distribution can be found in Figure 4.1, and some more general model information can be found in Table 4.1. This section closes of with a comparison of space complexity for each of the models.

SSNRM The used SSNRM (Section 2.4) was pretrained on Robust04 using 5-grams. The

training method was different from Zamani et al. (2018): strong supervision was used since weak supervision was too compuationally intensive for the available gpu cluster. This model created vector representations for queries and documents of 5000 latent terms. On closer inspection of the document and query representations, the dimension frequency per document, it stands out that the model only uses 2381 dimensions. The remaining latent terms are zero for all document. Something else that is worthy to note is that the distribution of the sorted dimension frequency per document is not the Zipfian that is expected (see Figure 4.1).

DSNRM For the DSNRM, the query representations of the SSNRM model were copied, and

then every latent term that was equal to zero was set to a small value (10−7). This was used to simulate the performance of a semi dense SNRM. Semi, because the document representations are copied unmodified from the SSNRM. The average term-document matches (TDM) of the DSNRM is a factor 10 higher than that of the SSNRM. A fully dense SNRM (DSNRM* in Table 4.1) has an even higher average amount of TDM, which is why it was not used in the experiments. Preliminary testing showed that the DSNRM* was not practical to work with. Nothing was changed in the document scoring and ranking.

(30)

Figure 4.1: A plot of the top 2381 dimension frequency over all document representations for a word based (blue), the SSNRM (green), the Zipfian SSNRM-Z (light green) and the DSNRM* (red) representational model. The sparse latent document representations only have 2381 active dimensions.

Model name avg. QL eff. avg QL avg. TFd avg. TDM

DSNRM* 5000 t 5000 t 500000 2.5 ∗ 109

DSNRM 5000 t 2381 t 405945 9.67 ∗ 108

SSNRM 404 t 404 t 405945 1.64 ∗ 108

SSNRM-Z 390 t 390 t 38929 1.5 ∗ 107

BM25 2.9 t 2.9 t 107 310

Table 4.1: Basic information of the four used models: average query length, effective average query length, average (latent) term frequency per document, and the average amount of term-document matches that the Lucene engine finds per query. * means a theoretical model, which is not included in the experimental setup.

SSNRM-Z Document and query representations for the SSNRM-Z were randomly generated,

and have a guaranteed Zipfian distribution as can be seen in Figure 4.1. The the main reason to include the simulated Zipfian SSNRM was to provide an indication of SNRM performance in when it has a characteristic Zipfian term frequency distribution. The average TDM for this model is a factor 10 lower than the average TDM of the SSNRM. Since the queries and documents are generated in the exact same way, the queries are not sparser than the documents, contrary to Zamani et al. (2018).

BM25 The used BM25 model is described in Section 2.2, which uses k = 0.9 and b = 0.4. The

average amount of term document matches per query (see Table 4.1) for this model is extremely low, which is the main reason that this model has such a high performance.

(31)

4.4 Experimental Setup

To compute the evaluation metrics for each of the models, the following steps were taken. Firstly, three versions of the inverted index were created for the Robust04 data set. The first version is a traditional word based version, created using the DefaultLuceneGenerator on the raw TREC document collection. The second version is based on the neural representations of the SSNRM, and the third version was created using the document representations of the SSNRM-Z. These last two were created using the extension to Anserini described in Section 3.2.1, with a decimal precision of 7 digits. A higher decimal precision was not possible for the SSNRM. An inverted index based on the representations of the DSNRM was not used since preliminary testing showed that the average QET was very large, and the point of this thesis is not to prove that dense representations are impractical to work with. Since the actual raw document content is not needed for the performance and effectiveness measurements, it was not stored in any index.

The second step was to obtain the QETs and document rankings, using the model specific query representations, by running the SearchCollection program with either the bm25 or slr command line option. Something that is important to note is that the neural query representa-tions were precomputed before running the experiments. As mentioned earlier, the QETs were avaraged over 4 runs for every model to rule out the influence of system resource availability. The hardware that was used was a 9th generation Intel®CoreTMi5 processor clocked at 5.0Ghz. The document ranking was evaluated by the TREC evaluation tool.

4.5 Results

Four models were compared on query execution time (QET) and two models were compared on mean average precision (MAP). The obtained results for both metrics are presented below. The size of each of the produced inverted index version is also presented.

4.5.1 QET

The average QET for the four models can be found in Table 4.2. The value for n is equal to 250, since the 4 runs served to minimize the influence off system resources, and do not represent additional QET data.

Model name avg. QET (n=250) std. QET

DSNRM* n.a. n.a.

DSNRM 3.575 min 0.07 min

SSNRM 28.9s 7.3s

SSNRM-Z 3.5 s 0.2 s

BM25 9.1 ms 7.4 ms

Table 4.2: The average query execution times (QET) and corresponding standard deviation per model.

As expected the dense SNRM already has impractically large QETs with an average of 3.5 minutes per query. The true dense model (DSNRM*) would have an average QET that would be even higher, because the semi dense SNRM uses the same inverted index as the SSNRM. The usage of sparse queries by the SSNRM achieves QET’s that are lower by about a factor of 7 relative to the DSNRM. Although this model does not have the Ziphian term frequency distribution it’s QET’s significantly improved. The average QET of the SSNRM-Z has an even lower average

(32)

QET of about 3.5 seconds per query. Word based document ranking is very fast, shown by the average 9.1 ms per query for the BM25 model.

4.5.2 MAP

The MAP scores can be found in Table 4.3 for the two non simulated models. The MAP score is computed over the top returned documents, up to 1000. The effectiveness of the neural model did not come close to the BM25 score.

Model name MAP@1000 (n=250)

SSNRM 0.0445

BM25 0.2514

Table 4.3: The mean average precision (MAP) documents for the SSNRM and the BM25 model.

4.5.3 Disk Usage

Since the actual value of every term frequency that is stored does not have impact on the resulting size, the disk usage of the three index versions can serve as an indication of space complexity for each term frequency distribution. The measured values in Table 4.4 show the disk usage for the SSNRM, SSNRM-Z and the word based index.

Model name Index disk usage

SSNRM 2.6 GB

SSNRM-Z 648 MB

BM25 180 MB

Table 4.4: The disk usage of the index for every model.

This table shows that the trained neural model has a high disk usage followed by the SSNRM-Z, with a index size that is 4 times smaller. The word based index is the smallest, being 3.5 times smaller again.

4.6 Summary

To answer whether sparse latent representations are viable for use in real time systems, four document ranking models were compared on space complexity, performance and effectiveness, using the neural search engine described in the previous chapter. The used measures were disk usage, query execution time and mean average precision respectively. The four models are: a simulated semi dense standalone neural ranking model (DSNRM), a sparse SNRM trained on Robust04, a simulated SNRM with a Zipfian term frequency distribution and a BM25 model (using the raw Robust04 collection). The results are discussed in the next chapter.

(33)

Chapter 5

Conclusion and Future Work

The research question of this thesis, posed in the introduction, was: "how can support for sparse latent representations be implemented and what are the performance implications of their use in a standalone neural ranking model?" To answer the first part of this question, an extension was build on the Lucene based Anserini search engine, that enabled the use of sparse latent representations for the query based document ranking task. Then, the query execution time, mean average precision and disk usage of four models, two models based on the Robust04 data set and two models with randomly generated representations, were compared to answer the second part of the question.

The extension to Anserini is able to index and search an arbitrarily large collection of pre-computed document vectors. This implementation has a theoretical precision of 9 decimal places per activation value for document representations. These are stored as an integer in the (latent) term frequency, and get converted to floats at query time. In practice a value of 7 will be almost always achievable, provided that the model is sufficiently sparse (see Equation 3.2). This index can be searched with precomputed query vectors, and there are no software limits to query length or accuracy. Since this implementation works pretty universally, it enables further research into the use of sparse latent representations in information retrieval. In future development the max-imum precision for activation values in the index should be addressed, which should also increase performance by removing the conversion from an integer to a float.

When analysing the results, the relative model execution times are very similar to the av-erage term-document matches (TDM), calculated from avav-erage query length and avav-erage term frequency. The performance comparison between ranking models showed that the queries of the neural model that was trained on Robust04 had an execution time that was on average 7 times lower then when using the randomly generated dense queries on the same index. Although this was not part of the experiments, when using an index of dense document representations this factor could be as high as 70 for this data set and model. This is a very significant improve-ment over the impractical computational requireimprove-ments of indexing and searching dense docuimprove-ment representations.The comparison of the model trained on Robust04 and the randomly generated sparse model that has a Zipfian term frequency distribution – corresponding with lower average TDM – shows that these query execution times can be lowered by another factor of 10. The average query execution time for this model was around 3 seconds. Since the generated Zipfian queries are of the same length as documents, proper sparse model would be able to achieve even lower query execution times. Future models should be able to achieve sub-second query execution times, with further optimization.

Document Ranking Using Sparse Latent Representations in Anserini

Document Ranking Using

Sparse Latent Representations in

Anserini

Document Ranking Using

Sparse Latent Representations in Anserini

evaluating the performance of standalone neural ranking models

using a Lucene back end

Contents

Chapter 1

Introduction

Chapter 2

Document Ranking

2.1

Indexing and Searching

2.2

Standard Ranking Functions

2.3

Neural Ranking

2.4

Standalone Neural Ranking Model

2.5

Summary

Chapter 3

Anserini: The (Neural) Search

Engine

3.1

Standard Ranking

3.1.1

Standard Indexing

3.1.2

Standard Querying

3.2

Standalone Neural Ranking

3.2.1

Standalone Neural Indexing

3.2.2

Standalone Neural Querying

3.3

Summary

Chapter 4

Experiments

4.1

Data Set

4.2

Space Complexity, Performance and Effectiveness

4.3

Document Ranking Models

4.4

Experimental Setup

4.5

Results

4.5.1

QET

4.5.2

MAP

4.5.3

Disk Usage

4.6

Summary

Chapter 5

Conclusion and Future Work