Historical Document Retrieval with Corpus-derived Rewrite Rules

(1)

University of Amsterdam

Historical Document Retrieval

with Corpus-derived Rewrite Rules

Master thesis in Artificial Intelligence

Author: Agnes van Belle

Supervisors: Tom Kenter Maarten De Rijke

(2)

Chapter 1

Introduction

This thesis will examine accounting for historical spelling/vocabulary variations and OCR (Optical Character Recognition) errors within the context of information retrieval of historical documents. The goal is to be facilitate the retrieval of relevant historical documents given a query consisting of modern words.

1.1 Historical document retrieval

In the past decades various historical document digitization initiatives have been conducted. An overview can be found in Raitt (2000). A few recent examples are Google Books1, British newspaper archive2, Deutsches Textarchiv3 and the Dutch Databank of Digital Daily newspapers4. Their aim is preserving cultural heritage for non-expert users, and to make this information available to expert users for their research: e.g. historians or researchers in the field of digital humanities.

The presence of historical spelling variations and OCR errors however pose a challenge for searching a database of historical documents using modern terms. For example, the modern Dutch word “suiker” (sugar) used to be spelled “suycker” in the 18th and 17th century . Examples of OCR errors can be found in mis-scanning “sneeuw” (snow) as “fneeuvv”, where the “s” → “f” mistake is partly due to the fact that the token for “s” in the past used to be similar in font to that of the “f” and the “w” → “vv” mistake is due to poor quality of the original document.

These spelling variants and OCR errors hamper the retrieval of relevant documents for a query. Braun et al. (2002) identified two main bottlenecks with regard to retrieving documents written in historical language: the spelling bottleneck and the vocabulary bottleneck. With addition of OCR errors, effective Information retrieval (IR) on historical collections faces three challenges:

• Spelling bottleneck: Differences in spelling between modern and historic words, with addition of many inconsistencies in historical spelling (for Dutch until the 19th century).

• Vocabulary bottleneck: Use of a different vocabulary, i.e. a shift in meaning for a word / a shift in words for the same meaning, or the disappearance or introduction of words, over time. Braun et al. (2002) also note that historical documents appear to contain more synonyms for a given term than modern documents.

• OCR bottleneck: the conversion of documents via automatic OCR often leads to poor results, partly due to special font styles and partly due to the poor quality of the documents

Our approach at improving historical document retrieval mainly aims to reduce the spelling bottleneck, in a way that is likely to reduce the OCR bottleneck too.

1 http://books.google.com/ 2 http://www.britishnewspaperarchive.co.uk/ 3 http://www.deutschestextarchiv.de/ 4 http://kranten.delpher.nl

(3)

1.2 Taxonomy of HDR

In recent years, research discussing historic document retrieval (HDR) has been flourishing. Also, there has since long been relevant research in related fields, for example approaches at bilingual IR, spelling correction, or approximate string matching.

Research on HDR often evaluates adaptations of methods from these related fields or combinations of these methods with each other, which challenges an algorithmic-based distinction of approaches. Perhaps for these reasons, no clear overview of HDR has been made so far. To understand our approach and how it compares with related work, it is however necessary to distinguish the general concept of HDR into several consecutive steps. Below, we therefore outline a general work flow framework typical for HDR. All work related to HDR known to us can be fit into this framework or into a specific (sub)step of it.

General HDR work flow framework

1. Word variant matching, between modern words (e.g. in a query or newer document) and older words, or vice versa.

This can be done manually or automatically.

This first step can be continued by step 2, but the results can also be directly used for improving historic document retrieval in which case the next step would be 3.

2. Constructing translation resources to be able to translate older texts to newer texts or vice versa.

This is mainly a filtering or generalization step that may be applied to step 1, with the aim of preventing too resource-intensive, superfluous, or overfitted translation models. This results in:

(a) A word (n-grams) dictionary, and/or (b) A set of rewrite rules

Either of these steps can be accomplished manually, or automatically with step 1 as preprocessing step. The result can be used as preprocessing step for step 3.

3. Applying the word matching (step 1) or translation (step 2) with regards to IR. This can be achieved with:

(a) Rewriting queries, or (b) Rewriting documents, or

(c) Rewriting both to an intermediate form, and/or (d) Editing of the retrieval function of an IR system

1.3 Motivation

Few completely unsupervised approaches to HDR have been investigated. E.g. Ernst-Gerlach (2013) automatically constructs translation rules from word alignments, but the alignments from the word variant matching step are manually constructed. The few unsupervised approaches are language-dependent (Kamps et al., 2007); tailored for OCR error correction instead of the historical spelling bottleneck (Reynaert, 2004) or not evaluated on document retrieval (Reynaert, 2009). Furthermore, all background work in HDR is tailored to optimizing retrieval from only one particular, restricted and known “historical” period.

Our approach is completely unsupervised and covers the complete HDR workflow as outlined in Section 1.2.

It is furthermore based on the assumption that the generation of HDR translation resources in case of a vast body of historical text can exploit the fact that vocabulary changes occur on a continuum. Vocabulary changes have been conducted consecutively, and we additionally can assume that OCR

(4)

errors are more pronounced the older the text, because of paper deterioration. It should therefore be easier to construct a mapping between text from the year 2000 via text from 1950 towards text from 1900, than constructing a mapping directly between text from 2000 to text from 1900. Furthermore, by considering all consecutive distinct vocabulary periods in our collection, we do not artificially restrict ourselves to a certain specific historical time frame for which to apply HDR.

Finally, our method is aimed to be applicable within an information retrieval (IR) setting without much configuration of the base search engine or document collection itself.

1.4 Approach

In short, we propose an unsupervised method for extracting time-dependent, character-based, modern-to-historical rewrite rules, and a way to let these rules translate the modern words in a query to their historical counterparts in a query expansion step, with the aim of facilitating historic document retrieval.

Our approach builds on the three steps from the general HDR workflow, but extends it, to be able to leverage the fact that spelling changes occur on a continuum. We traverse consecutive time frames (periods) from new to old, and construct translation resources for each pair of adjacent periods. The resulting period-pair specific resources are merged in a cascading manner in the final application step. This means that a modern word from the newest period 1, is translated to apply to the oldest period N by first being translated to period 2 by using translation rules from period pair (1, 2), then, using translation resources specific to period pair (2, 3), to period 3, etc., and finally from period N − 1 to period N .

Because of this cascading or “waterfall”-like approach we call our HDR method Cascading Tem-poral Rules, in short, CasTeR.

We add one step to the general workflow, and rephrase the other steps: Cascading Temporal Rules: consecutive HDR steps

1. Prior identification of sequential vocabulary periods

2. Word variant matching (between each two consecutive historical periods)

3. Construction of translation resources (between each two consecutive historical periods)

4. Applying the translation resources with regard to IR . This includes merging the sequential translation resources from step 3, in a “waterfall”-like manner, such that the contemporary vocabulary translates to each historical period.

In Figure 1.1 we present an overview of how the previously mentioned steps fit into our HDR approach. First periods with a distinct vocabulary are identified in a “period detection” step (step 1 in Figure 1.1), which defines a periodical division of the document collection. Two consecutive periods are then consecutively selected and given to a “word variant matching” module (step 2). The resulting word alignments are then used by a “rule generation” module (step 3) that constructs probabilistic rewrite rules applicable for these two periods. Because we traverse periods from new to old, this results in rules for mapping from each vocabulary period to the immediately older vocabulary period. During application, these are then applied to a user query by the “temporal query expansion” module (step 4 in Figure 1.1), which results in an expanded query that contains translations specific per vocabulary period. The IR system can then use this query to retrieve documents from the document collection.

Our experiments will examine what are the best word variant matching and rule generation meth-ods, as well as the final application step in which retrieval performance is used as the evaluation measure.

Our main contributions are: first of all a completely unsupervised, language-independent method for word variant matching (Section 3.1) tailored for modern-historical word pairs. Second, a method

(5)

Figure 1.1: CasTeR system overview

for probabilistic translation rule generation from word matchings (Section 3.2) tailored for modern-to-historical rewrites. Third, a way to exploit the spelling change continuum by a method to combine these word variant matching and translation rules from consecutive periods, to let a large historical collection become accessible to modern queries by means of a probabilistic query expansion method within an IR system (Section 3.3).

In the following Chapter we will first discuss related work which will motivate and illustrate how we derived this CasTeR approach, which is described in detail in Chapter 3. In Chapter 4 we describe the setup of our experiments, and in Chapter 5 their results. Finally, Chapter 6 contains our conclusions and suggestions for future research.

(6)

Chapter 2

Related work

In Section 1.4 we have outlined the four consecutive steps of our approach to HDR as 1. Identification of distinct vocabulary periods

2. Word variant matching

3. Construction of translation resources

4. Applying the translation resources with regard to IR Below, we discuss related work according to these steps.

2.1 Identification of distinct vocabulary periods

For the identification of distinct vocabulary periods we have found no examples in literature. Many researchers use the modern era or current vocabulary as the most “new” vocabulary period and use for the “old” period historical texts which are guaranteed to have a distinct vocabulary from the current one (e.g. corpora from the 17th century).

In preliminary experiments, we have examined the agglomerative clustering of consecutive years based on counts of the n-gram sequences, consonant-vowel sequences (e.g. ‘v’,‘o’, ‘lck’ for ‘volck’) and generalizations of n-gram sequences with consonant (C), vowel (V) or blank (B) symbols (e.g. for ‘vbeen’ ‘CCVeC’ is generated). We examined applying several co-occurrence metrics (Maximum likelihood estimation, χ2statistic, Log-likelihood ratio) to these per-year counts; and choosing different numbers of clusters. The silhouette score was used as a cluster validity measure. However, the results were all unsatisfactory when compared to actual spelling history and did not disclose a single best practice. In general, the number of clusters was unequally distributed over time, with the majority of clusters concentrating on older eras, of which the cause is likely an increase in OCR errors for older years. Because a clear period division is not essential for our approach we did not examine automatic period division further and use a hand-picked period division.

2.2 Word variant matching

We consider historical words that are translations of modern words, variants of the modern word. If we would have a mapping from modern words to these historical words, we could create translation resources from it for an IR system.

Existing approaches to word variant matching can be divided into two categories of methods: token-based approaches and type-based approaches. We will discuss each of these two approaches in the subsections below.

(7)

2.2.1 Token-based approaches

Token-based procedures can be divided into morphological normalization methods, explicit cross-language approaches and approximate matching methods.

Morphological normalization procedures are tailored for monolingual IR to deal with inflections and compound words, e.g. stemming, decompounding (splitting compound words into their parts and adding these to the original word) and n-gramming (all possible n-gram substrings of the word are added to the word or replace the word). More cross-lingual token-based approaches are for example phonetic coding algorithms such as SoundEx (Russell and Odell, 1918), that map words to a more general intermediate representation. Approximate matching strategies, finally, are mostly used in monolingual IR to deal with spelling errors. Examples are string similarity metrics such as the Levensthein distance (Levenshtein, 1966).

Monolingual HDR

If morphological normalization methods work sufficiently well, there is no need to go through step 3 of the framework from section 1.2.

Results indicate that in monolingual HDR mainly n-grams are a valuable source of token-based information when finding historical matches. Kamps et al. (2004) examined the effectiveness of stemming, decompounding and n-gramming for monolingual IR in seven languages. For most of the seven languages in the monolingual retrieval case, n-gramming worked better than decompounding + stemming. Braun et al. (2002) compared stemming, n-gramming and the Wagner-Fischer character-based edit cost algorithm (Wagner and Fischer, 1974) to improve historical document retrieval on 17th century Dutch (in combination with handcrafted rewrite rules). The results show that the n-gramming method is always a significant improvement over the baseline vector space model, and that the edit cost based Wagner-Fischer algorithm conflation procedure performs poorly compared with it.

The question is however if (n-gram) strategies that explicitly model HDR as cross-language IR would not be an improvement over the monolingual conflation procedure of n-gramming.

Cross-language approaches

Adriaans et al. (2005) have compared the application of a rule set for modern-to-17th century Dutch with monolingual conflation procedures; and with the phonetic algorithm SoundEx (Russell and Odell, 1918). This rule set was constructed by Koolen (2005) using results of a cross-language (n-gram) sequence matching approach to HDR from Kamps et al. (2007). From their results, it can be concluded that applying this generated rule set works best. Morphological normalization procedures turn out to be clearly limited: the best morphological normalization procedure, n-gramming, is outperformed by both SoundEx and the rule set approach.

The approach that the rule set was created from is a combination of three approaches in which pho-netic, consonant/vowel and n-gram-sequences used to find matches between historical 17-th century Dutch words, and modern Dutch words. These approaches were:

Phonetic Sequence Similarity (PSS): If the phonetic transcriptions (using the NeXTeNS (2005) pho-netic transcription tool for Dutch) of a modern and historical word are the same but their orthography is different, they are considered matches and rules will be generated from their alignment.

Relative Sequence Frequency (RSF): All words in the modern and historical corpus are split into sequences of vowels and sequences of consonants, e.g. volck is split into v,o,lck. If a sequence occurs more frequently in the historical corpus relative to the modern corpus and above a certain threshold, the sequence is considered a typical historical sequence. In the historical words, that sequence is then replaced with a wildcard, and from a matching procedure with modern words, rules are generated.

(8)

Relative N-gram Frequency (RNF): A variant of RSF that uses n-grams instead of sequences of consonants and vowels. E.g. volck is split into #vo , vol,olc, lck, ck#. The algorithm proceeds

the same as RSF.

Note that the latter two approaches explicitly incorporate relative corpus frequency information. Of these three, the RNF algorithm performs best. The best retrieval performance is achieved using a combination of these approaches.

We however refrain from using phonetic similarity measures or the RNF+PSS+RSF combination in our approach. First of all because they suffer from the limitation of language-specificity. For PSS, a phonetic tool specific to Dutch was used. SoundEx was designed for English, and had as additional problem (noted by Adriaans et al. (2005)) that it always keeps the first letter of a word. Using RSF is also somewhat language-specific because what might have been a vowel in present day might have been a consonant in historic times. A more serious problem with applying RNF or RSF directly is that it is not really a word variant matching procedure but rather intertwined with the rule generation approach that consequently is based on n-grams only.

Character-based approximate matching

As n-gram sequence matching seems to perform well, we compared it with other and perhaps more sophisticated character-based matching approaches used in approximate matching strategies, such as string similarity metrics based on edit costs. A major problem with such metrics is that they are usually computationally expensive. For example, Hauser and Schulz (2007) successfully use the Levensthein edit cost as a distance metric to generate matching pairs of historical and modern words, but they evaluated this method using a database that explicitly noted if a word was historical or modern, so they avoided the costly process of having to identify words as typical historical or modern with this method. Similarly, Brill and Moore (2000) use the Damerau–Levenshtein distance calculation to construct spelling correction rules, but again this is trained on a set of known spelling errors and their correct forms, and even then needing to use a trie to store all possible correct (modern) words. One solution to this is to combine a coarse matching technique to get a first selection of possible matches and then apply a finer search technique on that selection. Zobel and Dart (1995) examined various coarse search techniques: storing the n-grams, the SoundEx or Phonix representations of the n-grams, and a permuted lexicon based generalization of n-grams (Bratley and Choueka, 1982; Zobel et al., 1993) in an inverted index. Of these, simply counting the number of n-grams two words have in common worked best as a coarse technique.

Another method that can be used as a coarse search technique is the tisc (Text Induced Spelling Correction) system by Reynaert (2004) (of which several variants exist, e.g. ticcl (Reynaert, 2008), or particcl (Reynaert, 2009)). The main idea behind it, is easy insertion/deletion/substitution lookup by hashing each dictionary word and n-gram in the language with a hash function that deliberately generates collisions for anagrams. With particcl (Reynaert, 2009), historical variants are found by changing up to two characters anywhere in the word (insertion/deletion/substitution). This generates many spurious matches, so a finer search technique is applied on top based on filtering (discarding two-to-two character confusions) and statistics. While the results are interesting, the accuracy of this test is not evaluated. There is no reason to assume its performance as a coarse search technique would be better than the above discussed technique of using an n-gram index to count the number of n-grams two words have in common, yet it requires considerably more resources.

Zobel and Dart (1995) researched various finer search strategies to be applied on top of a coarse one. A measure called gram-dist0 greatly outperformed the phonological algorithms and performed equally well as the edit distance cost algorithm. This measure is an efficient approximation of the original gram-dist by Ukkonen (1992) and takes the length of the strings s and t into account when calculating n-gram similarity:

gram-dist0(s, t) = |Gs| + |Gt| − 2|Gs∩ Gt| (2.1)

where both |Gt|, the number of n-grams in string s, and 2|Gs∩ Gt|, can easily be computed from an

(9)

We conclude that using n-gram counting as a coarse search technique and then applying gram-dist0 seems an appropriate and efficient way to find historical word variants based on token clues.

2.2.2 Type-based approaches

From the previous section we could conclude that HDR cannot be treated as a monolingual IR problem. While n-gram-based metrics appear to be good for finding token-based matches, another source of more type-based word variant matching comes from the field of bilingual lexicon extraction (BLE) which concerns itself with constructing a translation lexicon between two natural languages for which no parallel corpora are available. BLE methods often use contextual clues as used in distributional semantics, where the idea is that words that appear in the same context as a source word should be similar to words that appear in the context of its target-language translation. The raw co-occurrence counts are often transformed by-co-occurrence metrics. To match words based in context alone, a seed lexicon exists to translate the context words.

Results showing that using context clues improves over using only token-based clues in BLE are for example found by Koehn and Knight (2002). They compared using the longest common subse-quence ratio; corpus frequency; and co-occurrence counts of context words in surrounding positions, to iteratively match German and English words: Adding the frequency clue did not result in any improvement, but the combination of the context and spelling clues yielded the best results.

These positive results for contextual clues are confirmed by e.g. Haghighi et al. (2008). They view BLE as a maximum bipartite matching problem and propose a generative model (MCCA : Matching Canonical Correlation Analysis) to derive mappings between English and Spanish words. Each word is represented as a feature vector with both n-gram features and context features (co-occurrence counts). It is assumed that a latent concept vector generates two feature vectors of both languages, and by finding with CCA (Weenink, 2003) the latent directions in the feature space in which already-matched feature vectors (based on words in a seed lexicon) are maximally correlated, minimal-distance new word pairs can be computed. The results show that using only the context features actually performed better than using only the trigram features (which already showed an improvement over the baseline). Again, a combination performed best. 1

The problem with applying MCCA and related models to word matching in HDR is that viewing HDR as a bipartite matching problem does not allow words to have multiple translations, and it also does not model words having no translations at all. This would often be the case when mapping historical word with modern ones, however.

We conclude that calculating the distance between context vectors works well. We will combine it with an n-gram token clue as in Koehn and Knight (2002) to not only increase performance but also to reduce the computational complexity of constructing and comparing context vectors for each possible word pair in the two languages.

2.3 Construction of translation resources

Once a word matching has been either automatically generated or manually constructed, we can construct translation resources in two ways: by constructing rewrite rules, or by constructing a dictionary (i.e. word-to-word mapping).

2.3.1 Dictionaries

As the concept dictionary-based approach we consider all translation methods that consist of a word-to-word mapping. These can be applied during query time but the dictionary can also be realized sequentially and used directly in the form of document translation .

The downside of constructing a word-based dictionary to be applied during query time is that would require considerable storage and computational resources, especially when multiple, probabilis-tic rewrites are permissible. Document translation is performance-wise likely to be the most viable

(10)

solution to implement a “dictionary” as compared to query expansion or editing of retrieval function-ality techniques, but this requires the translations to be strictly accurate and non-probabilistic. The approach of Reynaert (2008), which is also mentioned in section 2.2.1, can for example be seen as a dictionary-based approach, in the sense that it can be used to traverse each presumed historical word occurring in the historical corpus to translate it to some modern word. However, from the results in (Reynaert, 2009) of which some are listed in Figure 2.1 we see that when explicitly used for finding historic variants, many mappings are found that are superlatives, inflections, or wrong (“diffuse”), making document translation inappropriate.

Figure 2.1: Overview of the top 14 confusions found by (Reynaert, 2009) for finding historical variants for Dutch.

Conceivably the largest limitation of a rule-based approach in favor for a dictionary-based approach would be incompleteness: because of the large spelling variation and OCR-errors in historic corpora rules cannot cover all word translations. Gotscharek et al. (2009) examined this issue with a manually constructed list of 140 substring-to-substring rules (e.g. ein → eyn) that should translate modern words to historic variants. Their results show that with their rules, the percentage of missed word variants increases the further one goes back in time, yet that rules can actually perform quite well: excluding stopwords, 98% of variants are retrieved in the 18th century and still 90% of variants are retrieved in the 16th century. Their research also suggests that dictionaries can still be helpful, but simply as an addition to retrieving word forms with rules for older centuries: with a manually constructed small dictionary of 5000 words, 33% or 25% (depending on the data set) of the missed historic words forms from the 16th century become available to an IR system. There appears to be an essential limit to the effectiveness of dictionaries though, as even with a ten-times increase in dictionary size to 50, 000 entries, in the best case just 50% of the missed historic vocabulary is covered. Based on these results as well as the mentioned resource and complexity considerations and model flexibility when used with query expansion, we chose rewrite rules over a dictionary as translation resource.

2.3.2 Rewrite rules

Approaches with constructing rewrite rules from string distance conversion based algorithms have been conducted e.g. by Brill and Moore (2000) (spelling errors) and Hauser and Schulz (2007) (modern to historical words). Both assume application to a set of words that are known to need to be rewritten, but Bollmann et al. (2011) devised an edit distance based, probabilistic rule construction algorithm for HDR that also models the cases when a rule should not be applied, by also extracting identity rules. This method considerably improved exact word translation without explicitly knowing which words should be translated beforehand. A problem with the edit cost based approach however, is that its single-character conversions (insertions, deletions, substitutions) should always be extended in a certain manner to allow for substring-to-substring based rules. As there are multiple optimal ways for this, we are presented with ambiguity. Furthermore, because the identity rules need to be extracted to model the precision of rewrite rules, Bollmann et al. (2011) encounter the problem that substring pairs are split whenever two letters that are the same arise. Applying their method to e.g.

(11)

the German fladernholtz→ zypressenholz yields “flad” matched to “zyp” and “r” to “sse”, and not to the optimal alignment “flader” to “zypresse”, because the “e” present in both “flade” and “zypre” needs to be used for an identity rule.

The approach of (Ernst-Gerlach, 2013) deals with the multiple-optimal-paths ambiguity problem present in edit cost based rule creation algorithms, by directly extracting substring pairs. They generate a set of substring mappings (“rule cores”) and their literal context together recursively, where the rule cores are determined by longest consecutive subsequences (LCS) that the aligned pairs do not have in common. For e.g. unn¨utz → unnuts this leads to the rules ¨u → u with left context unn

and right context t; and z → s with right context t and left context ‘ ’. These contexts are then

made more generalizable recursively generalizing over consonants, vowels and blanks by encoding these as the special symbols ‘C’,‘V’and ‘B’, respectively. Contexts follow an “inwardly-progressing”

generalization: the regular expression syntax is B?[CV]*[a-z]* for left contexts and [a-z]*[CV]*B? for

right contexts. In this example for rule ü → u the following rules (among others) are generated: (‘CC’, ‘ü’,‘’’, ‘u’),(‘’, ‘ü’, ‘t’, ‘u’),(‘nn’, ‘ü’, ‘C’, ‘u’),(‘VCn’, ‘ü’, ‘t’, ‘u’), etc. Note that

identity rules are not generated in this approach – the precision of a rule is calculated by considering on how many words pairs in the train set it does not apply. Compared with manually constructed rules and a variant graph based method, they achieve much better results on retrieving historical variants.2 However,this LCS approach still means that the source and target side of substring pairs cannot have the same letters.

To solve this latter problem we adapt the method of phrase pair extraction from statistical machine translation (SMT) to work for substring pairs. The original intent of phrase pair extraction is to extract phrase-based rewrite rules (a phrase is a sequence of words) given pairs of source and target sentences by considering words in those phrases that are known to be translations of each other (from a preprocessing step) to be “aligned”. Extraction of all possible phrase pairs is infeasible, therefore constraints are imposed on which pairs are deemed optimal to extract. According to the so-called consistency criterium by Och and Ney (2004), a phrase pair (si2

i1, t

j2

j1) is consistent if the following

criteria hold with regard to the alignment matrix A: ∀(i0, j0) ∈ A : j1 ≤ j0≤ j2 ⇐⇒ i1 ≤ i0 ≤ i2, and

∃(i0_{, j}0_{) ∈ A : j}

1 ≤ j0 ≤ j2 ⇐⇒ i1 ≤ i0 ≤ i2 I.e. all words within the source phrase should be aligned

only with the words in of the target phrase, and vice versa. Furthermore, at least one word in the source phrase should be aligned to at least one word in the foreign phrase. An example of resulting phrase alignments is shown in Figure 2.2.

Figure 2.2: Some example phrase pairs (boxes) extracted from word alignments (black squares) in phrase-based SMT 3

We adapt this approach to account for substring pair extraction by considering as the paired source

2

But note that their calculation of recall and precision is based on the retrieved historical words’ collection frequency.

(12)

and target “sentences” the paired historical and modern word. The alignments between words, become alignments between characters in our approach, where consider two characters to be aligned if they are the same letter and the distance between their indices is less or equal to a certain threshold. This way we solve both the ambiguity problem and remove the same-letter-split limitation.

2.4 Applying the matching / translation resources with regard to

IR

As outlined in Section 1.2, we can distinguish four different methods for applying translation resources to an IR system: rewriting queries, rewriting documents, rewriting both to an intermediate form, and/or editing the retrieval function of the IR system.

The case of rewriting both the query end document terms to an intermediate form is the appropriate solution for phonological coding methods as SoundEx (Russell and Odell, 1918) and Phonix (Gadd, 1990). However, we have already decided to use our own rewrite rules. We did not encounter any editing of the retrieval functionality (i.e. query-document scoring function), in our investigation of related work. This also does not seem necessary when using rewrite rules, when we have as other options document rewriting and query expansion.

With document rewriting, it is easier to apply stemming and other morphological normalization procedures to the translated words. See e.g. Kamps et al. (2007), who found document translation to outperform query translation for their approach. Furthermore, it will be faster during query time, because the query does not have to be expanded. However, document translation needs to be done again for each new document added to the collection, results in translation model inflexibility, and disables the incorporation of translation probabilities.

With query rewriting one can incorporate probabilities, does not have to account for changes in the collection and can easily incorporate changes in translation resources. In theory, extensions such as morphological normalization can still be applied by using a contemporary dictionary that first translates each modern query term to all its inflections, on which then the translations to historic words are applied (Ernst-Gerlach and Fuhr, 2006). The only disadvantage is that one needs to generate or look up the translations each time a query is issued. With regard to applying the generated rules to IR we therefore chose for query expansion.

2.5 Summary

In this Chapter we examined related work relevant to the four consecutive steps of our approach to HDR as outlined in Section 1.4. An unsupervised execution of the first step, identification of distinct vocabulary periods, remains an open challenge, and we decide to use a supervised approach for this step.

For word variant matching we have identified that monolingual methods as stemming do not suf-fice, and that of the token-based clues gram-dist0 (Equation 2.1) is the most promising, and of the type-based clues a cosine similarity between context vectors containing co-occurrence counts is the most suitable for HDR. These clues we chose to combine. With regard to the construction of trans-lation resources, we have motivated our choice for rewrite rules and proposed our own substring pair extraction algorithm based on a phrase pair extraction algorithm, to circumvent ambiguity and arti-ficial constraint issues with existing substring pair extraction approaches. Finally, we have motivated why we choose query expansion as an IR application technique for the generated rewrite rules.

(13)

Chapter 3

Approach

In Section 1.4 we have outlined the four consecutive steps of our approach to HDR as 1. Prior identification of distinct vocabulary periods

2. Word variant matching

3. Construction of translation resources

4. Applying the translation resources with regard to IR

Based on preliminary experiments, we rely on a user-defined periodical division in step 1. As described in Section 2.1, we have investigated automating step 1 by clustering of (n-gram) sequences in various ways and calculating cluster validity scores. The results were all unsatisfactory when compared to actual spelling history and did not disclose a single best practice. A clear period division is furthermore not essential for our approach, as we exploit gradual changes in spelling.

Below we outline the rest of our approach as presented in Section 1.4 in more detail.

3.1 Word variant matching

Word variant matching operates on the level of having one modern and one old period. The modern period can be seen as the “source” side whose words need to be mapped to the set of words from the older period, the “target” side.

Given this modern and old corpus, which are from eras sequential to each other, many words in the modern corpus will not have to be translated to old words, or vice versa. We therefore first distinguish the words that are likely to have changed in spelling. For those words, we then construct a matching based on both character-based and context features.

The algorithm selectively prunes matching candidates from the modern side and historical side, starting with the words in the modern period. Words with a very low frequency are ignored, as they are likely to be erroneous, anomalies or not representative of that period. We furthermore select for the source side of the mapping only words that are typically “modern” words . For determining this we use the ratio of relative frequency of the word between the modern and old period. We then determine possible translation candidates from the old period for these typically “modern” words. This happens in multiple-stage pruning process. In the first stage an n-gram-distance metric is computed, and those candidates close in distance are kept. We then filter those candidates based on frequency and ratio of relative occurrence in the old and modern period. Finally the candidate list is pruned based on a context distance metric between the candidate and the modern word in question.

This procedure is outlined in Algorithm 3.1.

(14)

Algorithm 3.1 Basic word variant matching procedure

Require: Functions: rel freq, rel freq ratio, having bigrams common, gram dist0, context dist Require: Global feature thresholds: min rel freq1, min rel freq2, min rel freq ratio1, min rel freq ratio2,

max gramdist, context metric, max context dist

function Match(P1, P2) . P1 and P2 are the new and old period

matching map ← {} for t ∈ P1 do

if rel freq(t, P1) < min rel freq1 or

rel freq ratio(t, P1, P2) < min rel freq ratio1 then

continue

list candidates ← having bigrams common(t, 2, P2) . Fetch words from older period sharing ≥ 2 bigrams

for c ∈ list candidates do

if gram dist0(t, c) > max gramdist then list candidates.remove(c)

for c ∈ list candidates do

if rel freq(c, P2) < min rel freq2 or

rel freq ratio(t, P2, P1) < min rel freq ratio2 then

list candidates.remove(c) for c ∈ list candidates do

if context dist(context metric, t, c, P1, P2) > max context dist then

list candidates.remove(c) matching map[t] ← list candidates return matching map

3.1.1 Features

Frequency statistics

The relative frequency of a word t in a period P is calculated as: rel freq(t, P ) = frequency(t,P )_{nr terms(P )} . The ratio of relative frequency of new word t in one period P1 versus another period P2 is calculated

as: rel freq ratio(t, P1, P2) = _{rel freq(t,P}rel freq(t,P₁_{)+rel freq(t,P}1) ₂₎.

Character features

We use as string distance metric the function gram dist0 (Zobel and Dart, 1995) gram dist0(s, t) = |G_s| + |G_t| − 2|G_s∩ G_t|, where G_s is the set of n-grams in string s, in our case bigrams. This is a computationally cheaper approximation of the gram-dist definition of Ukkonen (1992): P

g∈Gs∪Gt |

s[g] − t[g] |, where s[g] is the number of times n-gram g occurs in string s. (If the strings do not contain repeated occurrences of the same n-gram, which will often be the case, gram-dist0 will be equal to gram-dist.) We however first construct an inverted n-gram index as a coarse search technique, to speed up the retrieval process: having bigrams common(t, 2, P2) in Algorithm 3.1 returns those

word from period P2 having at least two bigrams in common with modern word t, using that inverted

index.

On top of thus, after a first selection of word pairs, we will also examine the effect of some other minor character-level features: equivalence of the first or last letters; the length of the words; and the Levensthein distance.

Context distance

For modeling the context distance we use the co-occurrence vector approach as described in Section 2.2.2. We choose a fixed co-occurrence window size of 5 and it is determined how often a pair of words occurs within a text window of this size. To compare the co-occurrence vector for the new word t with the co-occurrence vector of its old candidate word c, the co-occurrence vectors are normalized by the L1-norm and we calculate the cosine distance (the complement of the cosine similarity) between them.

A co-occurrence statistic based correlation might be improved by not using the co-occurrence counts directly, but association strengths between words instead (Rapp (1999), Bron et al. (2010)).

(15)

We examine several co-occurrence transformation metrics (for details see Appendix A): the normalized Pointwise Mutual Information; the χ2 ratio transformation; the log-likelihood ratio transformation (LLR); and finally an IDF-based transformation which multiplies the co-occurring word by its inverse document frequency.

3.1.2 Computational complexity

Algorithm 3.1 warrants an explanation of its complexity which can be broken down into the trivial complexity of getting the (relative) word frequencies; the complexity of the pruning based on the n-gram distance, gram dist0; and the compelxity of the pruning based on the context distance, context dist.

The complexity of the pruning based on the n-gram distance is determined by the complexity of fetching the words that share at least two bigrams for each modern word. Let there be N1“new” words

in the newest period (P1) that satisfy the minimal (relative) frequency thresholds. Assume we also

calculate the “old” words from period P2 that satisfy the (relative) frequency thresholds beforehand.

If we put the words N1 and each n-gram it contains in a row in a table and do the same for period

P2, and assume the tables are indexed by n-gram and that hword,n-grami-pair look-up by n-gram

occurs in logarithmic time, fetching the words that share at least two bigrams for each modern word will be O(N1B1log(N2B2)) where Bi denotes the average number of bigrams per word in period i (or

O(M₁log(M2)) where Mi denotes the number of unique word-bigram pairs in period i).

For the pruning based on the context we need to generate co-occurrence vectors per word, for all the remaining new words and all their historical matching candidates. This requires knowing the number of windows a word occurs in, the number of windows a context word of that word occurs in, and the number of windows a word and its context word both appear in (also see Appendix A). Finding the number of windows they both appear in, i.e. hword, context word, number of shared windowsi tuples, is the most computationally complex operation. To accomplish this we use a table containing per document identifier the word and the position of that word in the document, and per period we first copy the hdocument,word,positioni tuples for documents from that period Pi to

a separate table. To find the context words’ positions and shared window count, we first generate new positions that are within the window centered on the position of the words we are interested in, e.g. for position 3 we generate the new positions {1, 2, 4, 5}, by adding, for our window size of 5, the numbers −2, −1, 1 and 2. For these new positions we then again generate the positions that are within a window centered on that position, e.g. for the earlier generated 5 we generate {3, 4, 6, 7}, and such that only unique document-position pairs remain. These are the positions of the word that are within a window of size 5 with the original words, e.g. the word on position 7 will share a window of size 5 with the word on position 3 for the same document. The upper bound for this operation is determined by the lookup for each context word given the generated positions, and is O(Fi(W − 1) log(Ti)), where Fi is the sum of the occurrences of all words in period Pi for which we

want to calculate context vectors, W is the window size, and Ti is the sum of the document sizes

in period Pi (i.e. the size of the hdocument,word,positioni table). As we need to calculate context

vectors for multiple old words per new word, and the window size is a constant, the complexity is effectively O(F2log(T2)). As T2> F2 > N2 and, in practice, F2> (N2B2), calculation of the context

vectors is more computationally expensive than calculation of the n-gram distance and in practice sets the upper bound on the complexity of Algorithm 3.1.

3.2 Construction of translation resources

We use a probabilistic rule-based approach towards the construction of translation resources.

We construct a substring pair extraction (SPE) algorithm based on the so-called alignment tem-plate approach (Och and Ney, 2004) for phrase pair extraction used in the field of statistical machine translation (SMT), as described in Section 2.3.2

As a baseline, we use the rule generation algorithm of Ernst-Gerlach (2013), the longest consecutive subsequence (LCS) approach as described in Section 2.3.2, from which we also borrow the rule pruning

(16)

methodology and a way of representing contextual information. Below we explain our approach and how it relates to this baseline approach.

3.2.1 Substring pair extraction

The first part of our approach to rule generation is to extract substring-to-substring mappings, given a set of word pairs that are considered to be translations or variants of each other. We do this in a way akin to phrase pair extraction in SMT.

The examination of each possible source-to-target substring combination (see Och and Ney (2004) for pseudocode) is resource intensive, but can be implemented efficiently by constructing from the alignment matrix A source and target word index ranges that comprise letters that are aligned to each other. Algorithm 3.2 outlines or substring pair extraction algorithm given paired historic and modern words.

Algorithm 3.2 Substring pair extraction

Require: Functions SoRange and TaRange that return the covered range in the other language given a range in the source or target language, respectively. A range is a tuple of two integers.

function Extract(T, S) . T and S are the target and source word for t1= 0 → Len(T ) do

if Aligned(t1) then

for t2= t1→ Min(t1+ maxSubstringLen, Len(T )) do

if Aligned(t2) then

hs1, s2i ← SoRange(ht1, t2i)

if s2− s1< maxPhraseLen then

hts1, ts2i ← TaRange(hs1, s2i)

if ts1< t1 then

break . Skip this t1

else if hts1, ts2i = ht1, t2i then . Found consistent pair

AddPair(ht1, t2i, hs1, s2i)

AddUnaligned(ht1, t2i, hs1, s2i)

Figure 3.1 shows the substring pairs that would be extracted form a pairing of^lopen$(walking) and ^loopende$ (participium of walking, historical spelling) in our approach when the minimum substring

size would be set to 2 and the maximum to 3; and the value maxDistanceAlignedLetters to 3. This

Figure 3.1: The substring pairs (dark boxes) extracted from letter alignments (gray squares) in our substring pair extraction approach using specific settings (see text for details).

word pair under these parameter settings would yield the following substring mappings: ^l → ^l, lo → loo,op → oop, pe → pe,pen → pen,en → end, and en→ en. Note that these substring mappings

contain identity mappings (mappings between equivalent substrings).

As explained the LSC approach of Ernst-Gerlach (2013) to substring mapping is based only on the longest consecutive subsequences that the aligned pairs do not have in common; which differs from our approach that extracts substring pairs based on letter-alignments where two mapped substrings always include at least one equivalent letter. A clear difference of the SPE approach compared to the

(17)

LCS approach implied by this, is that our SPE-approach does not generate “insertion” or “deletion” rules, i.e. rules which map voids to characters or vice versa. The LSC approach in contrary would yield for the example in Figure 3.1 only insertion rules, of the form ‘’ → o, ‘’ → de. As noted by

Bollmann et al. (2011), insertions are constrained by their context characters only, which leads to an overabundance of insertion rules that could possibly be applied to a given word – insertion rules which, because of having a less confined LHS, are also not well equipped to compete with substitution and deletion rules. These issues do not arise with our SPE approach, because we do not generate any rules that map voids to characters or vice versa. It also leads to our approach generating more (non-identity) substring mappings in general.

3.2.2 Adding context

After we have created sets of substring mappings the next step in constructing rules is to generate context for the substring mappings.

To add context to a substring pair, we check which substring pairs were extracted in the same word pair alignment that are an identity mapping and are just left or right to this substring pair.

We then also generalize over consonants, vowels and blanks (word boundaries) as in the LCS approach. We examine using the “inwardly progressing” context generalization (“IPCG” – following the regular expressionB?[CV]*[a-z]*for left contexts and[a-z]*[CV]*B?for right contexts) as did

Ernst-Gerlach (2013) as well as using all possible generalization-letter combinations, i.e. uniform context generalization (“UCG” – B?[CVa-z]*for left contexts and [CVa-z]*B?for right contexts). Using UCG

would result in the, in the IPCG setting not generated, rule (‘Bl’, ‘op’, ‘Vn’, ‘oop’), for example.

3.2.3 Rule set pruning

As many rules are being generated by this algorithm want to retain only those with a high precision and recall. The recall we are aiming for however is an IR recall, and not a recall on the word pairing alone. In calculating precision and recall we thus take into account the corpus frequency of the historical word that a rule is able to generate when applied to a modern word.

We define a rule i as a tuple (l ci, ci, r ci, ri) where c is the “core”, r is the replacement, and l c

and r c the left and right contexts. Together, l c, c and r c thus denote the LHS of a rule. The rules are generated from paired (matched) modern and historical words, such a pair we notate as (m, h). Assume we have a function G (from “generated by”) that for a given rule yields all modern-historical word pairs (m, h) that have generated that rule. We then denote the positive collection frequency score of a rule with the variable qi:

qi =

X

(m,h)∈ G((l ci,ci,r ci,ri))

cfP2(h) (3.1)

where cfP2(h) denotes the collection frequency in period P2of word h, and P2denotes the historical

period (period notation is as in the system overview in Figure 1.1).

Many times however the LHS of a rule will also occur in other generated rules, among which are many identity rules. These other rules serve as (an approximation of the set of) negative examples. We denote the so-called LHS-occurrence score ti of a rule as:

ti = X r X (m,h)∈ G((l ci,ci,r ci,r)) cfP2(h) (3.2)

We then calculate the precision of rule i with: pi=

qi

ti

(3.3) The minimum values for pi and qi are used as thresholds in our pruning algorithm. This algorithm is

(18)

In the pruning algorithm we make use of the notion of the generalizability of a rule. To determine if rule uj = (l cj, cj, r cj, rj) is a more “general” variant of a rule ui = (l ci, ci, r ci, ri), where cj = ci

and rj = ri, we count the number and type of contextual characters in the left context l cj and

the right context r cj. If the sum of left and right context characters of a rule uj is smaller than

those of ui, rule uj is identified as being more general than ui. If rule ui and uj have the same

amount of contextual characters, rule uj is identified as being more general than ui if it contains more

consonant/vowel generalizations (i.e. C orVsymbols instead of actual letters).

The pruning algorithm is then given by the following steps: Rule pruning algorithm

1. Let F denote the set of all generated non-identity rules and E ← ∅ the result set. 2. Remove rules ui from F whose pi < minp and qi < minq

3. While F 6= ∅:

3.1 From the rules in F with the highest p values, select those with the highest q values, and call this set C.

3.2 From these candidates, chose a rule ui for which no more general variant exists in C.

3.3 Add rule ui to E

3.4 Remove the more general variants of ui from F

Some rules that might be generated are for example: (,onnoemel,VCC, onnoeml) with precision 1, (BCV,rteluk,VC,rtelyk) with precision .714 and (C,ciele,,cieeie) with precision .86.

3.3 Applying the translation resources with regard to IR

As outlined in Section 3.2 a rule is of the structure (l c, c, r c, r) where l c, c, r c together represent the condition (with consonant-vowel generalization for the left and right contexts) and r the replacement, with an associated precision and frequency score. For weighing these rules we use the associated precision of a rule, p, whose generation is explained in Section 3.2.3.

For generating the expanded query we first split the query string into words and for each word and we apply the rules for each time period. The query is viewed as a bag of words and the generated terms have as precision score or weighting factor the precision of the rule they were generated with. If a term is generated by multiple rules for a given time period, it will get the precision of the rule with the highest precision.

Because our generated rules are specifically tied to translating from one certain time period to another (older one), the rules will furthermore have associated with them a tuple pair denoting the years of the time periods they apply to, i.e. ((s y2, e y2), (s y1, e y1)) where e y1 > s y1 = e y2 > s y2.

When we have multiple consecutive pairs of time periods ((s y2,i, e y2,i), (s y1,i,e y1,i)) and ((s y2,j,e y2,j),

(s y1,j, e y1,j)), where (s y2,i, e y2,i) = (s y1,j, e y1,j) – e.g. ((1930,1947), (1947,1970))and((1920,1930), (1930,1947)) – the rules associated with period pair j will be applied to the terms generated by the

rules of period pair i. This leads to generation of terms by application of rules in a waterfall-like manner. If a term gets generated by application of different rules from multiple time periods, it is given the precision of the rule with the lowest precision.

The generated terms per time period are incorporated into a boolean query that by the usage of a filter only applies to documents that have as attributed year of origin a year within that specific time period. These boolean queries are finally combined into one big query covering the time span of the document collection, where time spans for which no rules exist simply use the original query terms with a precision of 1.

We use the TF-IDF scoring function for scoring in our search engine. We edit the TF-IDF scoring function to use as inverse document frequency (IDF) not the IDF of the generated term, but the IDF of the word that the generated term is generated from. This is because some of the generated

(19)

terms are likely to be rare (such as OCR error corrections), and according to the theory of term frequency/inverse document frequency, the document with the erroneous spelling will then be much more relevant than a document containing the correct spelling.

3.4 Summary

In this Chapter we have outlined our approach towards the three methodologically successive steps of our HDR system (Figure 1.1): word variant matching, translation rule generation and the application of the generated translation rules in a query expansion step.

Word variant matching for two consecutive periods is accomplished using a combination of an n-gram and context based distance measure, as well as frequency statistics (Section 3.1, Algorithm 3.1).

In Section 3.2 we described how we generate rules from a word variant matching with our so-called SPE approach, using the substeps of substring pair extraction, (generalized) context incorporation and pruning. We also described the LCS baseline, which shares the same pruning and context gen-eralization capabilities, but our SPE approach uses a different algorithm for extracting the substring pair from which rules are generated (Algorithm 3.2).

In Section 3.3 we have outlined how we apply the generated rules for multiple consecutive period pairs in a cascading manner during query expansion, to edit the query accordingly to the time period that a certain document belongs to. This completes the design of our CasTeR system for HDR and allows for evaluating retrieval performance. In the following Chapter we describe our experimental setup used for evaluating the retrieval performance of our system, and the performance of the word variant matching and rule generation steps.

(20)

Chapter 4

Experimental setup

We will first present our research questions, then discuss our methodology in answering them, and finally we will briefly outline our data and its setup.

4.1 Research questions

Recall that we want to examine Historical Document Retrieval (HDR) performance giving the rules generated by our Cascading Temporal Rules (CasTeR) approach as outlined in Section 1.4.

This final retrieval performance however depends on the performance of a previous step of rule generation, which, in turn, depends on the performance of a first step of word variant matching. Therefore, we have research questions for each of these three steps.

Our main experiment concerns the final step of application of the rules in an IR setting, in which we investigate the retrieval performance of our approach using these rules, compared to two other baseline retrieval methods. We have two auxiliary experiments for the other steps: For the word variant matching step, we are concerned with the optimal settings of parameters for mapping modern to historical words. For the rule generation step, we use manually generated word variant mappings as ground truth, and examine how several parameter settings as well as a baseline method perform in generating rules.

For the word variant matching step, we have the following Research Question:

RQ: What feature value thresholds for the frequency, character and context-based word features used in Algorithm 3.1 result in the best modern-to-historical word variant matching between words from two consecutive time periods (one “modern” and the other “historical”)?

(4.1)

The Research Question related to the second rule generation step is:

RQ: Given a word matching between modern and historical words, what rule generation mechanism performs best, using what parameter settings?

(4.2)

The setup for Research Questions 4.1 and 4.2 is discussed in Section 4.2.1.

Finally, for our main experiment in which we evaluate retrieval performance when applying our generated rules within an IR system, we have two research questions. The first is:

RQ: Does the consecutive application of period-pair specific, automatically derived, prob-abilistic modern-to-historical rewrite rules (as generated by our CasTeR word variant matching and rule generation steps) on contemporary-language queries, using rule sets related to consecutive (new-to-old) periods up to and including the rule set tar-geting a specific historical time period, improve retrieval performance on that time period?

(21)

Another research question pertaining to IR application is:

RQ: Does the consecutive application of period-pair specific, automatically derived, prob-abilistic modern-to-historical rewrite rules (as generated by our CasTeR word variant matching and rule generation steps) on contemporary-language queries, using rule sets related to consecutive (new-to-old) periods, improve retrieval performance when querying over all time frames?

(4.4)

This is a separate question because one problem we would like to tackle is that currently, for most topics, the majority of the top results are documents written in modern time periods, since there simply are many more documents using contemporary spelling available in the more recent time periods compared to earlier, historical times. We discuss the setup for this experiment in Section 4.2.2.

4.2 Methodology

4.2.1 Auxiliary experiments

Word variant matching

For evaluating the word variant matching, i.e. answering Research Question 4.1, we will examine the optimal performance in relation to the feature parameters described in Section 3.1.

In this experiment we will align words from a modern period with words from a historical period immediately before it. That is, with a single period pair, i.e. ((s y2, e y2), (s y1, e y1)) where e y1 >

s y1 = e y2 > s y2.

The mapping will produce for a discovered “new” word several old word candidates. A ground truth is constructed by constructing a pool which contains all (newword, oldword) pairs generated under various parameter threshold settings for the primary features. These pairs are then annotated with various labels (for a detailed overview, see Appendix Table B.1) denoting the type of match.

We calculate the performance of the word matching by defining a subset of the annotation labels as “correct” (1) and the rest as “incorrect” (0). The most important annotation in constructing these subsets is the specification if the matched old word solves a part of a historical translation for the new word. There are furthermore annotation subtypes defined for the cases when this is true but where the old word is containing an OCR error (including if it was a deletion or substitution); if the old word is a less-occurring variant (synonym) of the normally correct old word; and for the case where the old word is a grammatical variant (diminutive, superlative, comparative form or inflection) of the old word that should have been correct. As an example of the last case, in the mapping pair

(‘lopen’,‘loopende’) the old word “loopende” is an inflection of the correct historical word “loopen”.

If the old word is in no way a historical match to its paired new word, we still annotate the type of error for investigation how the approach performs on just OCR error detection.

Aside from the precision (positive predictive value) and recall (sensitivity), we are interested in their combination in the F1-score.

Construction of translation resources

Here we answer Research Question 4.2. Given a word matching, we generate rules from the word pairs. We compare our own SPE approach to the baseline LCS approach of Ernst-Gerlach (2013) (see Section 3.2 for their differences), and investigate various parameter settings (which apply to both approaches) to see which one performs best.

We investigate the following settings:

• Basic rule detection procedure: SPE (our approach) or LCS (baseline) • Collection frequency score threshold for rules: min_q

(22)

• Precision threshold for rules: minp 1

• Type of consonant-vowel context generalization: Either inwardly progressing, IPCG, or uniform, UCG, context generalization (see Section 3.2.2).

We always use a fixed maximum source substring (i.e. the part that is replaced) size of 8 letters and a minimum size of 1. The maximum size for the left and right contexts is set to 3 and the minimum to 0.

The word pairs to generate rules from, are taken from the same annotated word pair set as used in the previous word variant matching step. We investigate the same categories of matches. An evaluation sample is defined by each pair of new word and generated translation. A true positive occurs when the pair of new word and translation is in the correct word mappings set. A false positive is defined as a translation not defined by the correct word mappings set. Note that this is a conservative definition, as these translations might actually be correct, but not in the pool generated by various settings for the word variant matching procedure.

As metrics we will look at precision and recall and F1 score. We also look at the number of

generated rules, because the number of rules will affect the computational complexity of the query expansion algorithm.

4.2.2 Applying the generated translation resources with regard to IR

This is our main experiment in which we answer Research Questions 4.3 and 4.4, i.e. investigate the retrieval performance of our approach to historic document retrieval using the modern-to-historical translation rules generated with the previous steps.

Baselines

For both research questions we use the same baselines. As a first baseline we consider a system without any rewrite rules or query expansion using the vector space model with TF-IDF scoring (Arguello, 2013). This way we can see if our approach actually improves upon such a standard setup at all.

As second baseline we add to the previous system a query expansion method that generates word variants but does not take into account dates of documents or historical language differences: Elasticsearch’s built-in Fuzzy Like This (FLT) query 2. It is the most obvious choice when trying to find words that need at most a certain number of character modifications, and the only choice for query expansion for that task aside from suggesters (Cholakian, 2013). The standard fuzzy expansion method as implemented in Lucene3generates for each query term all possible variants that are within a maximum Levensthein edit distance and exist in the index. Improvements over standard retrieval using fuzzy queries with a normal data sets have been found by e.g. Singh et al. (2015). However, simply fuzzifying query terms using TF-IDF scoring leads to a strong bias towards documents containing rarer terms such as misspellings, because these will have a higher IDF score. Therefore, we use the FLT query which uses for each generated variant term the IDF of the original term when scoring. This is also what we do in our own approach outlined in Section 3.3.

Query subjects (topics)

Query subjects or topics can be divided into two sets: query subjects that will have many or most of their relevant hits from documents in a certain historical time period (this is likely to be the case with topics referring to historical events that took place in that time period); and general query subjects that will have results in all time periods. We explicitly separate the query topics in these two categories so that we are able to investigate if the results of the two experiments show a difference between these types of topics.

1

As the LCS approach does not generate identity rules, the LHS-occurrence score needed for the calculation of precision in the LCS approach is calculated by exhaustively examining to which word pairs a rule could be applied.

2

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-flt-query.html

3

(23)

Aside from selecting queries with words of which we know that they have a historical counterpart, we want to account for the fact that our rewriting methods might rewrite terms that do not have to be rewritten at all. These translations could be actual vocabulary terms, therefore yielding non-relevant documents – for example, the term “japan” might be translated to “japon” (gown), resulting in documents about dresses rather than the nation in East Asia. But even fictitious terms could result in skewed results by matching OCR-error garbled terms in historical documents – for example, if “verbod” (prohibition) is translated to the non-existant term “veerbood” this could match with a garbled version of “veerboot” (ferryboat). Therefore, we use queries composed of at least two terms of which at least one does not have to be rewritten, and also a few queries that do not have to be rewritten at all. To determine if a term needs rewriting to match historical spelling conventions, we use resources pertaining to the history of Dutch spelling.

The final queries used are Appendix C. The period-specific historical queries from Appendix C.2 were created based on topics listed on Wikipedia pages about the decades within that period4. The general queries in C.1 were created based on conceived general topics, for which we have checked beforehand that results from all periods under consideration exist. We have aimed to create queries such that the possible historical rewrites would be as diverse as possible, regardless of the period used – for example, not only rewrites of the form o → oo but also s → sch and e → ee, for example. Annotation and evaluation

We use graded relevance scores with grades zero (not relevant), one (somewhere between irrelevant and relevant) and two (relevant). For binary metrics, not applicable for graded relevance scores, we convert the three-graded relevance scores to binary scores in the two possible ways (i.e. if a given score of 1 should be considered as a score of 0, or as a score of 2) and take the average of both metric scores. This averaging over all possible relevance threshold levels, proposed and evaluated by e.g. Kekäläinen and Järvelin (2002) (“generalized” precision / recall) and Scheel et al. (2011) (µMAP) ensures that marginally relevant documents are also taken into account, while at the same time ensuring that higher relevance levels still dominate the final score more than lower relevance levels.

Considering the size of our collection, it is infeasible to consider all relevant documents for a given query because of limited resources. Therefore we only consider the top n documents like in TREC Ad Hoc runs where n = 25 documents. Of documents with duplicate contents within one return set, the first one is annotated and the rest are ignored.

As IR evaluation metrics we use the Normalized Discounted Cumulative Gain (incorporates graded relevance) and the binary relevance metrics Mean Average Precision (MAP, to capture trade-off between precision and recall), Precision@5, Precision@10 and the Mean Reciprocal Rank (based on the rank of the first relevant document).

For testing significance, we use the non-parametric Randomization test (also called Permutation test) (Fisher, 1935). The null hypothesis here is that our system is equal to the baseline system under consideration, and that arbitrarily swapping the observed retrieval scores per query between our system and the baseline system thus would result on average in just as high a difference in mean retrieval score as that which has been observed in reality between our system and the baseline system. For more details and a comparison with other significance tests for IR, see Smucker et al. (2007). For each test, we run through all permutations of labeling a query scores belonging to either our system or the baseline system.

Data

For our own approach, we chose four consecutive time period pairs from which the word variant matching and rule generation to an earlier time period should take place. (See Section 3.3 for details on how the modern-to-historical rewrite rules are used during query expansion.) Our time periods are, in order: (1947,1960), (1934-1947), (1917-1934), (1900-1917), and (1870-1900). From these, four period pairs are generated: ((1947-1960) & (1934-1947)), ((1934-1947) & (1917-1934)), etc. For each period pair, we apply the word variant matching step to 15000 words maximum according to the best

4

(24)

settings found in the experiments for that step (see Section 4.2.1), and then apply the rule generation step according to the best approach/settings found in the experiments for that step (see Section 4.2.1), to generate the rules.

4.3 Data setup

The dataset we use is one from the Dutch Royal Library (Koninklijke Bibliotheek). About 9 million Dutch historic newspaper articles from the period between 1618 - 1995 have been digitalized within the project “Databank Digitale Dagbladen” of the Dutch Royal Library (Koninklijke Bibliotheek). All these documents are Dutch, but use a spelling variant of Dutch specific to the time in which they were written.

For our experiments we use the Elasticsearch (ES) 5 engine set up with the default TF-IDF similarity module. Each document in the index has attached its original date of publishing, and its title and content fields are indexed fields which we search against. For the query expansion part of our approach, we accomplish incorporating our generated terms with associated precision and custom IDF by indexing the amount of terms in each document beforehand and using Elasticsearch’s Function Score Query 6 _{in each part of the query that maps to a specific time period.}

For storing terms and term statistics in this collection needed for our experiments we set up an PostgreSQL database. We extract random documents per time period via the ES engine. Aside from storing the terms and n-grams, we store for each document extracted its year. In another table we store, for each word in the document, its position in the document (this is used for retrieving co-occurrence counts). We also use a table mapping each word to its n-grams. The main tables in our database are shown in Figure 4.1. For details about how the database is used in the word variant matching step, see Section 3.1.2.

Figure 4.1: The main tables in the database, used for conducting the word variant matching step, and storing the result of the rule generation step.

5_{http://www.elasticsearch.org} 6

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query. html