Improving historical language modelling using Transfer Learning

(1)

MSc Artificial Intelligence

Master Thesis

Improving historical language modelling

using Transfer Learning

by

Konstantin Todorov

12402559

August 10, 2020

48 EC Nov 2019 - Aug 2020

Supervisor:

Dr G Colavizza

Assessor:

Dr E Shutova

Institute of Logic, Language and Computation

University of Amsterdam

(2)

3.2.2 Post-OCR correction . . . 24 3.2.2.1 Motivation . . . 24 3.2.2.2 Data . . . 25 3.2.2.3 Evaluation . . . 26 3.2.2.4 Model . . . 26 3.2.2.5 Training . . . 28 3.2.3 Semantic change . . . 29 3.2.3.1 Motivation . . . 30 3.2.3.2 Tasks . . . 30 3.2.3.3 Data . . . 30 3.2.3.4 Evaluation . . . 31 3.2.3.5 Model . . . 31 3.2.3.6 Training . . . 32 4 Results 33 4.1 Named entity recognition . . . 33

4.1.1 Main results . . . 33 4.1.2 Convergence speed . . . 37 4.1.3 Parameter importance . . . 38 4.1.4 Hyper-parameter tuning . . . 40 4.2 Post-OCR correction . . . 41 4.2.1 Main results . . . 41 4.2.2 Convergence speed . . . 43 4.2.3 Hyper-parameter tuning . . . 44 4.3 Semantic change . . . 45 4.3.1 Main results . . . 45 4.3.2 Hyper-parameter tuning . . . 46 5 Discussion 47 5.1 Data importance . . . 47

5.2 Transfer learning applicability . . . 48

5.3 Limitations . . . 49

5.4 Future work . . . 49

6 Conclusion 51

(4)

“Those who do not remember the past are condemned to repeat it.”

(5)

Abstract

Transfer learning has recently delivered substantial gains across a wide variety of tasks. In Natural Language Processing, mainly in the form of pre-trained language models, it was proven beneficial as well, helping the community push forward many low-resource languages and domains. Thus, natu-rally, scholars and practitioners working with OCR’d historical corpora are increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by documents from the past, including OCR quality and language change, call for a critical assessment over the use of pre-trained language models in this setting.

We consider three shared tasks, ICDAR2019 (post-OCR correction), CLEF-HIPE-2020 (Named Entity Recognition, NER) and SemEval 2020(Lexical Semantic Change, LSC) and systematically assess using pre-trained language models with historical data in French, German and English for the first two and English, German, Latin and Swedish for the third.

We find that pre-trained language models help with NER but not with post-OCR correction. Furthermore, we show that this improvement is not coming from the increase of the network size but precisely from the transferred knowledge. We further show how multi-task learning can speed up historical training while achieving similar results for NER. In all challenges, we investigate the importance of data quality and size, with them emerging as one of the fragments currently hindering progress in the historical domain the most. Moreover, for LSC we see that due to the lack of standardised evaluation criteria and introduced bias during annotation, important encoded knowledge can be left out. Finally, we share with community our implemented modular setup which can be used to further assess and conclude the current state of transfer learning applicability over ancient documents.

As a conclusion, we emphasise that pre-trained language models should be used critically when working with OCR’d historical corpora.

(6)

Acknowledgements

Working on this thesis proved to be a challenge of a great value for me and also one of great joy. I am truly pleased and immensely proud for finalising this important step of my life.

None of this would have been possible without, first and foremost, Giovanni Colavizza, whom I would like to thank for his constant advice, great support and outstanding supervision throughout the last nine months. His experience and thinking made me gain invaluable experience and deliver more than I ever imagined.

Furthermore, I want to thank my family for everything that they have ever done for me which all resulted in this achievement. Благодаря от сърце на моето семейство – на моите майка и баща – Ренета и Живко Тодорови и сестра – Деница, без които никога нямаше да постигна това, което съм постигнал до този момент в живота си. Благодаря ви за всичко, което някога сте направили за мен. Искрено се надявам наличието на този документ да ви прави горди. My sincere gratitude goes also towards two of my colleagues at Talmundo, namely Reinder Meijer and Esther Abraas, who throughout the last months reached out to me to verify that I am doing okay more times than I did myself. I would not have been able to finalise this work in such a state if it was not for their understanding and the freedom they gave me over my time.

Finally, I also want to thank Veronika Hristova for supporting me throughout this difficult period, for motivating me to do more and for taking huge leaps of faith because of me. Благодаря ти!

(7)

List of Figures

2.1 General OCR flow . . . 11

3.1 General model architecture for assessing transfer learning on a variety of tasks . . . 15

3.2 Amount of tokens per decade and language . . . 17

3.3 Amount of tag mentions per decade and tag . . . 18

3.4 Non-entity tokens per tag and language in the training datasets . . . 18

3.5 NERC base model architecture . . . 20

3.6 NERC multi-task model architecture . . . 22

3.7 Post-OCR correction model, encoder sub-word and character embedding concatenation 27 3.8 Post-OCR correction model, decoder pass . . . 28

4.1 Levenshtein edit distributions per language . . . 44

(8)

List of Tables

3.1 NERC sub-task comparison . . . 17

3.2 NERC entity types comparison . . . 19

3.3 NERC hyper-parameters . . . 23

3.4 ICDAR 2019 Data split . . . 25

3.5 ICDAR 2019 Data sample . . . 25

3.6 Post-OCR correction hyper-parameters . . . 29

3.7 SemEval 2020 corpora time periods per language . . . 30

4.1 NERC results, French, multi-segment split . . . 34

4.2 NERC results, French, document split . . . 35

4.3 NERC results, German, multi-segment split . . . 36

4.4 NERC results, German, document split . . . 37

4.5 NERC results, English, segment split . . . 37

4.6 NERC, convergence speed (averaged per configuration). . . 38

4.7 NERC, parameter importance, French . . . 38

4.8 NERC, parameter importance, German . . . 39

4.9 NERC, parameter importance, English . . . 40

4.10 NERC hyper-parameter configurations . . . 40

4.11 Post-OCR correction results, French . . . 41

4.12 Post-OCR correction results, German . . . 42

4.13 Post-OCR correction results, English . . . 43

4.14 Post-OCR correction – convergence speed (averaged, in minutes). . . 43

4.15 Post-OCR correction hyper-parameter configuration . . . 45

(9)

Abbreviations

General notation

et al. . . . et alia (en: and others)

e.g. . . . exemplum gratia (en: for example) etc. . . et cetera (en: and other similar things) i.e. . . . id est (en: that is)

Machine learning

CNN . . . . Convolutional Neural Network

CRF . . . . Conditional Random Field

DL . . . . Deep Learning GRU . . . . Gated Recurrent Unit IR . . . Information Retrieval

LSTM . . . . Long Short-Term Memory

ML . . . . Machine Learning MTL . . . . Multi-Task Learning RNN . . . . Recurrent Neural Network STL . . . Sequential Transfer Learning

Natural language processing BOW . . . Bag-of-Words

LSC . . . Lexical Semantic Change NE . . . . Named Entity

NEL . . . . Named Entity Linking

NER . . . . Named Entity Recognition

NERC . . . . Named Entity Recognition and Classification

NLI . . . Natural Language Inference

NLP . . . . Natural Language Processing

(10)

Chapter 1

Introduction

Language is one of the most fundamental parts of human interaction. It is what binds us, defines us as humans, enabling us to communicate and store information about countless events. It allows us to preserve current knowledge throughout history for future generations – sometimes thousands of centuries later. As a consequence, it is naturally one of the most fundamental areas of past and present research in many different scientific fields. Machine learning is not an exception and Natural Language Processing (NLP) encompasses the foundations of many of the current advances in text processing. Further, using language is crucial for advancing in other fields such as Computer Vision and Reinforcement Learning – an understandable fact, considering that we use language and in particular text to explain almost everything around us.

Despite the countless advances in the direction of true text processing, little is done about our cul-tural heritage and ancient texts in particular. Many old manuscripts and newspapers are collecting dust, long forgotten in libraries around the world, waiting to tell us the stories they contain. For without knowing the past, it is impossible to understand the true meaning of the present and the goals of the future.

1.1 Motivation and Problem statement

Developing systems that can understand human language, interact seamlessly and offer information retrieval portals has been a lifelong dream for scientists and alike from more than one generation by now. Early symbolic text representations were designed rule-based but proved to be inefficient, mostly because of their limited focus towards a particular domain they had been designed for (Winograd, 1972). They failed to generalise to unseen data which eventually made them obsolete (Council et al., 1966).

Ultimately, statistical handling of data proved as the most robust way to deal with different types of data, including text (Manning and Schutze, 1999). At the core of such applications stand mathematical models that automatically learn data features. The advancements in Deep Learn-ing inevitably improved systems all around and direct human participation decreased even more. Many problems, once considered unsolvable, are now solved. Many domains, once considered unex-plorable and incompatible with artificial intelligence, are now at the center of numerous researches. Areas which deep learning in general, and NLP in particular, influenced range from machine translation and spell checking to automatic summarization and sentiment analysis. Even more so, the advance in NLP opened up new fields of study and introduced tools such as chatbots, as well as helping to advance previously ‘stuck’ ones such as optical character recognition (OCR). Precisely because of the progress made in OCR, researchers were able to achieve partial success by applying it to some non-standard documents in other domains. One such, inspired by Digital Hu-manities, was the historical domain, consisting of many ancient books, newspapers and magazines. These documents are important part of our cultural heritage and many organisations are entrusted

(11)

with their preservation – a task, which difficulty grows as time passes. Thus, full digitisation of historical texts should become a critical goal to aim for.

Unfortunately, there are several factors that make the general accessibility of collections of digitised historical records, and their use as data for research, challenging (Piotrowski, 2012; Ehrmann et al., 2016; Bollmann, 2019). First and foremost, Optical or Handwritten Character Recognition is still error prone during the extraction of text from images. Challenges also include the drastic language variability over time and the lack of linguistic resources for automatic processing. This all imposes the need to integrate, but at the same question, modern NLP techniques in order to improve the state of digitisation of historical records. One of the most promising techniques in this regard is

transfer learning.

Transfer learning focuses on sharing knowledge, learned from one problem, to another, often related problem. It is heavily used in modern day linguistic problems, offering many benefits, among which (i) faster convergence, (ii) requiring less compute resources, (iii) overcoming lack of linguistic resources, including annotated data and (iv) achieving higher levels of generalisation compared to traditional learning.

Due to its nature, transfer learning promises to alleviate many issues which exist in historical cor-pora. Recently, it has started to be applied on historical collections in a variety of ways. Examples include measuring lexical semantic change of specific words (Shoemark et al., 2019; Giulianelli et al., 2020) and extracting named entity information (Labusch et al., 2019). Unfortunately, its applications are sporadic and unsystematic. One of the biggest problems that hinders progress is the lack of a standardised assessment of its characteristics when applied over ancient texts. Com-munity is still uncertain when and how it can be successfully applied to this domain. As most of the currently available models are being pre-trained on modern texts, their application must be questioned and analysed in details. Thus we set on the path and take the first step into providing initial knowledge of how historical digitisation can be improved using transfer learning.

1.2 Contributions

This work provides several contributions to the community which we consider as a beginning of a full hands-on research that can be further performed over the problem.

• We introduce a modular architecture which is meant to ease the disabling or addition of new modules into an existing neural network setup. We provide the code freely and open-source it1 _(§3.1).

• We visualise how transfer learning can be used in several, important for Digital Humanities, challenges using ancient documents as data. We focus on named entity recognition and

clas-sification(NERC) (§3.2.1), post-OCR correction (§3.2.2) and lexical semantic change(LSC)

(§3.2.3). We further visualise the two sides of it – (i) it can improve setups significantly (§4.1.1) and encode information, otherwise unavailable through alternative deep learning methods (§4.3.1). But it can also (ii) produce only marginal benefits, sometimes non at all (§4.2.1). We further make examples of the increase in time that it requires (§4.2.2 and §4.1.2), thus concluding that researchers should not blindly follow a trend, thinking that transfer learning should always work better and immediately produce benefits.

• We further provide results in experiments, commonly conducted in studies, related to transfer learning, but applied over historical documents. We show that fine-tuning BERT is also questionable and often does not give any benefits over historical texts, opposite to common beliefs (§4.2.1 and §4.1.1). Additionally, we use multi-task learning for our NERC setup (§3.2.1) and compare it against a single-task learning configuration, showing that we can achieve similar, even better results while cutting off training time.

• We uncover the severe data limitations currently hindering progress in the area, stemming mostly from the overall bad quality of the OCR’d documents, but also from the lack of large single-origin datasets. We gain our best results for tasks, having both high amount of data and originating from the same historical domain background (§5.1).

For the purpose of this work, we participate actively in two open challenges – (i) CLEF-HIPE-2020, where we work on NERC and rank second overall for French and German, using the team

(12)

name Ehrmama (Ehrmann et al., 2020b; Todorov and Colavizza, 2020b), and (ii) SemEval 2020 (Schlechtweg et al., 2020) where we rank 22nd and 24th in the two sub-tasks. We analyse the latter and show that there is a serious bias introduced in current evaluation measurements of lexical semantic change (§5.1). Finally, our contributions and findings are also available summarised in a paper that resulted from this thesis (Todorov and Colavizza, 2020a).

1.3 Structure

In Chapter 2, we explain briefly the main concepts of Machine Learning(ML) and Natural Language Processing(NLP) that are used throughout this work. Afterwards we talk about the current state of Transfer learning, the different applications and types that it has. We explain in a similar manner Optical Character Recognition(OCR) and the problems that are hindering historical texts digitisation. Finally, we combine these problems and demonstrate how transfer learning can be beneficial for OCR over historical texts.

Further, in Chapter 3 we firstly show the generic modular setup that we use throughout this thesis. Then, we describe the specifics of the challenges that we participate in, the different modules and task-specific models that we use in each one and the motivation behind these.

Chapter 4 displays our results for each challenge and visualises the analysis we conduct. For post-OCR and named entity recognition, we additionally provide tracked convergence times differences between our different configurations. Finally, we also provide the hyper-parameter configurations that we used from the extensive lists that were available.

We discuss our main findings in Chapter 5 and show the limitations of our work, as well as in the current state of Digital Humanities and in particular the historical domain. We also provide guidelines for future work, which we believe can build on top of this study.

Lastly, in Chapter 6 we conclude this work and try to motivate the community, aiming to attract more interest in this area.

(13)

Chapter 2

Background

This chapter provides a background and explanations on the methods that we use throughout the thesis, as well as information about the research that has been done over the years on those. We show how the field has changed and how currently is a very dynamic area for studies.

We discuss briefly the origins of Machine Learning and why is it so important (§2.1). We then display Natural Language Processing(NLP) – its linguistic subsection (§2.2). Further, we show the importance of Transfer Learning and its branches (§2.3) and then do the same for Optical character recognition (OCR) (§2.4). Finally, we talk about the digitisation of historical texts and the problems they pose (§2.5) before combining everything together and discussing how those aforementioned problems can be solved (§2.6).

2.1 Machine learning

Machine learning is the basis of our work and is a term used to identify computer algorithms that

improve automatically through experience. More specifically, it builds mathematical models from data and uses probability and information theory to assign probabilities to possible outcomes of a specific event.

In machine learning, input data is usually represented as a vector x ∈ Rd _{of d features where}

each feature contains a particular attribute of the data. For example, in the context of text, these features could simply be the characters of the sequence or when working with images – the pixel values and so on. Furthermore, the learning process of a model which uses these vectors is split into supervised learning – where for every input xi there is an output value, typically a separate

label yi – and unsupervised learning where no designated labels are available.

Moreover, machine learning tasks are split into different categories, one of the most common being

classification. In classification, the label yi comes from a predefined set of classes. Depending on

this set, we have binary classification, multi-class classification and multi-label classification. The first one always deals with two classes, usually representing a Boolean value, whereas the second – with more than two. In the third case, unlike the most common case where every input xi only

has one correct corresponding label yi, instead we have more than one. The multi-labels may also

happen to overlap for different input entries.

An ultimate goal of such models is to eventually achieve generalisation – that is a state of the model which would allow to apply it to previously unseen data and expect as good results as the ones observed previously. We will frequently revisit the topic of generalisation throughout this work as it is a fundamental part of our research too.

For the purpose of generalisation, the available data is usually split into different sets which then play different roles. A bigger part is taken and called training set which is used for training the model. A smaller part is then reserved to evaluate the model after the training. It is called test

set and it is used to mimic unseen data, thus showing a glimpse of the generalisation ability of the

(14)

labels. While training, we compute the training error over the training set. More important however, is the test error which shows the amount of generalisation that is achieved. Thus, first of all, it is important for the split of training and test sets to be such that the latter is unique and unseen in the first indeed. Secondly, the main difference between optimisation and machine learning algorithms is shown - while the first seeks to minimise the training error, the second aims to minimise the generalisation(test) error.

Machine learning remains a fundamental field, combining many areas and techniques. Naturally, it is at the core of this study too, in particular through its linguistic focused sub-category.

2.2 Natural language processing

A crucial part from machine learning is natural language processing (NLP) – representing a field which combines active areas of research of digital texts and their modelling. One of its main goals is to bring the distance between computers and humans in understanding natural languages closer. Due to the contrast between different languages, usually a generalisation is achieved by taking abstract overview and mapping languages to similar linguistic models and tasks (Smith, 2011). Linguistic models that help with the abstraction in NLP are language models. As a key element from this field, they represent one of the earliest and biggest milestones in the progress made towards true autonomous text processing and are nowadays one of the core parts of all related structures. They are mathematical machine learning models, using sequences of texts (words, characters or other) as the input vectors xi. They aim to learn the vector mapping towards the

output labels y and also provide the benefit of having a low level generalisation of text-based systems which allows us to be able to compare and formulate them.

Many studies have been performed as to how exactly to formulate such models. Earliest depictions included human rule-based systems. These required people to manually create rules which will then be used to map the data. These were a good first step but had many flaws, the biggest of which was their limitation and inability to generalise well on unseen data (Committee, 1966). For the past 20 years, focus has shifted towards mathematical models which automatically learned representations from the input data (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006). This allowed human effort to be used to create features instead, which were supposed to teach the model what is important and what is not. Unfortunately, having required human interaction once again proved to an be error-prone approach and one mostly limited to the human understanding. Not only it was time-consuming, but it was also reducing generalisation as even though most often humans knew a certain field very well, they were failing to foresee unexpected circumstances outside of their specialisation.

With latest developments in the recent years, deep neural networks have now become the most widely used models out there (Goodfellow et al., 2016). Their main difference is the automated learning of data features. As a result, they reduce and in some cases fully remove the need of human supervision. In cases of supervised learning, the human participation is required only for labelling the data, whereas for unsupervised learning, the model is fully automated.

Being an important component of language models – the different ways of representing and labelling the input data have changed over the years too. In the beginning, as previously mentioned, task-specific representations, created by human annotators were the most widely used case. After these – a better, more autonomous approach of storing data was proposed, called bag-of-words (BOW). This made use of a vocabulary V which contained all words that occur in the data corpus and the number of occurrences for each. This frequency number was sometimes additionally weighted with

term frequency-inverse document frequency(tf-idf) which also helped with weighting non-important

words (e.g. is, and, etc.) lower compared to other task-meaningful, but rare words. Original BOW models used unigrams – that is sequences of one word – but some also made use of other n-grams, such as bigrams for example, which are sequences of two words. It is important to note that this approach ignored grammar and word order, therefore losing information about the context of the words. This had a major drawback where occurrences of words like bank were treated equally, irrelevant of the fact that they may appear in the context of withdrawing money from my bank versus I sat on the river bank to admire the water.

These problems paved the way to the introduction of word embeddings - a representation that boosted NLP progress tremendously and is arguably one of the most important novelties. It

(15)

con-verts words to embedding vectors which can differ in dimensionality and allow words to encode contextual information later on. This approach did not exclude the usage of BOW, what is more, some of the earliest researches used both and only replaced the vocabulary words with their cor-responding vectors, thus creating a word matrix (Levy and Goldberg, 2014; Goldberg and Levy, 2014; Joulin et al., 2016).

One of the most widely recognisable approaches which is still used today is Word2Vec (Mikolov et al., 2013). This used skip-grams-based training – that is using a context window of N words around a target word, which the model had to predict using the vocabulary. The downside to this approach was the inability to encode much context, since the window was usually small, up to 2 words, and increasing it made training more time consuming and more importantly introduced a lot of unwanted noise. Still, many studies proved that word embeddings are an important piece of the NLP puzzle, showing how they can even be used to compare different words, precisely by using their vectors. Even more so, one can add vectors of words such as Russia and river resulting in a vector corresponding to the word Volga.

With the addition of global matrix factorisation and local context window methods, GloVe (Pen-nington et al., 2014) was introduced as improvement, where the vectors now used global matrix factorisation.

All those pre-trained word representations quickly became key components in many of the neural language architectures out there. However, they all lacked encoding of higher knowledge of their surroundings and were therefore still not perfect. Some improvements were made over the years, such as the inclusion of sub-word information (Wieting et al., 2016; Bojanowski et al., 2017) and adding additional vectors for each sense of the words (Neelakantan et al., 2015) but they were not as significant. To this end and due to the advances in Deep Learning(DL), ELMo (Peters et al., 2017, 2018a) was introduced which gave the first deep contextualised representations, building on top of these improvements. The model now contained much more complex characteristics, using a function of the entire input sentence as a representation instead, but also including sub-word information. This advanced the state of the art for most of the NLP challenges.

However, it was shown that feature-based systems like ELMo can be improved further more if they were fine-tune-based instead (Fedus et al., 2018). The recent introduction of the transformer architecture (Vaswani et al., 2017) allowed significantly faster training when compared to tradi-tional recurrent- or convolutradi-tional-based systems, which at the time were state-of-the-art systems in many tasks. Using this new architecture, BERT was unveiled (Devlin et al., 2019). Although much bigger, more power- and time-consuming than alternatives, it proved itself over many times (Goldberg, 2019; Clark et al., 2019), having one key benefit, similar to ELMo – the lack of need to train the model from scratch. Since then, many alternatives have been proposed, both in different languages (Martin et al., 2020) and having slightly different architectures or training setups (Liu et al., 2019; Yang et al., 2019). They all had a common benefit, precisely the ability to simply take the pre-trained representations from the model and use them to extract features. This all proved the transfer of representations between tasks as a highly promising option.

2.3 Transfer learning

In standard learning methods, we train randomly initialised models. In cases where we work on multiple tasks, separate training operations for each task are performed, even if they use the exact same model architecture underneath. Many works further proved the importance of transferring information over. BERT solidified this knowledge, providing better generalisation and additional bidirectional information of the input data, as well as encouraging the usage of the same pre-trained embeddings on a wide range of different tasks (Sun et al., 2019). Tasks such as these gave more weight to one of the most important areas of study in the last decade – transfer learning (Pan and Yang, 2010).

Similar as with humans where parents are continuously sharing existing knowledge with their children from young age, networks also proved to benefit greatly from such a distribution. This approach provided many benefits, some of which being (i) significantly faster convergence, due to systems not starting the learning from scratch; (ii) less computation power required, due to the faster convergence and much less parameters and arguably the most important – (iii) better gener-alisation, coming from the fact that models now contained information from multiple domains and thus being better prepared for unseen data. To add to these, research showed that transfer learning

(16)

helps when there is lack of data available, which is often observed in many setups (Chronopoulou et al., 2019).

The mentioned gains helped the community in almost all areas of Machine Learning, including Natural Language Processing. When it comes to the source task or source domain – that is the area from which we take existing knowledge – a decision needs to be made on what information exactly to use and apply to the target task or target domain, i.e. the area to which we apply the knowledge base. Most of the methods use a general-purpose source task. In addition, transfer learning is mostly used and found helpful when a linguistically under-sourced target task is at hand, i.e. when we have too small amount of data to train independently or even a lack of linguistic resources. Even if this is not the case, benefit such as saving of compute resources is observed because of the reduced amount of training needed and instead applying a pre-trained model. Looking at to what exactly to transfer – usually we would like to encode as much of the learned experience as possible. This can take on various forms depending on the area of application. In the case of NLP and also throughout our work, this will come down to the text representations learned by neural network systems.

Having shown how powerful transfer learning can be, we now turn towards how to transfer exactly and the multiple ways of using it across different tasks. We focus on the approaches used in this thesis and only briefly mention remaining ones.

2.3.1 Multi-task learning

A transfer learning does not always have to depend on a model trained on a source task before carrying over the knowledge. Instead, we can transfer knowledge during training. This approach is called multi-task learning(MTL) and is what many systems use nowadays to encode knowledge from broader context and multiple domains. This is shown as helpful in many applications of machine learning, including natural language processing (Collobert and Weston, 2008), speech recognition (Deng et al., 2013), computer vision (Girshick, 2015) and even drug discovery (Ramsundar et al., 2015).

2.3.1.1 Introduction

Naturally, systems set up a model which is trained to perform a single task. This model may or may not be transferred to a different target task later on. However, focusing on a multiple source tasks rather than a single one during the initial stages have several advantages – (i) we reduce the risk of forgetting information which is not relevant to the source task but is for the target task and more importantly (ii) our model can generalise the encoded knowledge much better (Caruana, 1998).

This type of learning is also motivated by a number of real-life examples. Looking at human biology, it is natural for babies to learn new tasks by applying knowledge they learn simultaneously elsewhere, for example when they first learn to recognise faces before applying this knowledge to recognise other more complex scenarios(Wallis and B¨ulthoff, 1999). It can also be seen in teaching when people are trying to learn new skills they will always first learn something simple which will provide them the necessary abilities to master more advanced techniques (Brown and Lee, 2015). Going back to machine learning, multi-task learning is also proven to coach models into preferring hypotheses that explain more than one task (Caruana, 1993).

2.3.1.2 Benefits

We will look at the most important gains when using multi-task learning. We follow Caruana (1998) who first explained these. For simplicity reasons, we assume we have two tasks A and B in all examples.

Generalisation is arguably the most important perk. Learning in a multi-task approach literally increases the training data that we have at our disposal. And because all tasks contain some amount of noise, even if it is marginally small, increasing data size helps when dealing with it. Learning by focusing on either task A or B will introduce the risk of overfitting, while a joint approach lets us encode better, focused on generalisation representations.

(17)

Focus is a very important aspect in machine learning and as such can make a crucial difference, especially during the training process. If a certain task data contains a lot of noise, this can hinder the model’s ability to distinguish relevant from irrelevant features. Multi-task learning helps by introducing more data and directing the focus to the important cross-domain features.

Eavesdropping allows a model to learn some features required for task A by focusing on task B. Examples include features that are represented in a more complicated way in task A or because other features, specific to task A, are in the way.

Representations generalisation allow the model to learn in such a way that produced represen-tations in the future will be generalised to cover multiple tasks, which is something looked after during a multi-task approach. This also helps covering new tasks in the future, assuming they are from the same environment (Baxter, 2000).

Regularisation occurs when using a MTL approach, which helps the model to avoid overfitting as well as reduce the Rademacher complexity of the model, i.e. its ability to fit random noise (Søgaard and Goldberg, 2016).

Multi-task learning is a natural fit for domains where we are originally interested in learning multiple tasks at once. Examples are (i) fields of finance or economics forecasting, where we might need to predict multiple likely related features, (ii) marketing, where considerations for multiple consumers are taken at once (Allenby and Rossi, 1998), (iii) drug discovery, where many active compounds should be predicted together and (iv) weather forecasting, where predictions are often dependant on multiple conditions and situations in neighbouring cities.

However, multi-task learning can also be beneficial in cases when there is one main task. In such cases, auxiliary tasks are introduced to help better learn the desired representations. These can range from tasks, predicting similar features to the desired ones, to a task learning the inverse of the original one.

In NLP, tasks that have been used as auxiliary tasks include speech recognition (Toshniwal et al., 2017), machine translation (Dong et al., 2015; Zoph and Knight, 2016; Johnson et al., 2017; Malaviya et al., 2017), multilingual tasks (Duong et al., 2015; Ammar et al., 2016; Gillick et al., 2016; Yang et al., 2016; Fang and Cohn, 2017; Pappas and Popescu-Belis, 2017; Mulcaire et al., 2018), semantic parsing (Guo et al., 2016; Fan et al., 2017; Peng et al., 2017; Zhao and Huang, 2017; Hershcovich et al., 2018), representation learning (Hashimoto et al., 2017; Jernite et al., 2017), question answering (Choi et al., 2017; Wang et al., 2017), information retrieval(Jiang, 2009; Liu et al., 2015; Katiyar and Cardie, 2017; Yang and Mitchell, 2017; Sanh et al., 2019) and chunking (Collobert and Weston, 2008; Søgaard and Goldberg, 2016; Ruder et al., 2019).

2.3.2 Sequential transfer learning

We now turn towards the oldest, most widely known and arguably most frequently used transfer learning scenario, namely the sequential transfer learning (STL). Here we train each task separately instead of jointly as it was the case in multi task learning. The goal is then to transfer some amount of information from the training which is performed on the source task, and to improve performance on the target task. This also made people use the term model transfer (Wang and Zheng, 2015).

2.3.2.1 Motivation

Compared to MTL approach, sequential transfer learning often requires initial model convergence before transferring information. There are also studies where it is used before final convergence on the source task, resulting in the so-called less common curriculum learning (Bengio et al., 2009). This type of sequential transfer learning imposes an order to the different tasks at hand and selects when each one needs to be used, most often based on pre-defined difficulty scale.

To further show how important this approach is, we look at some situations where using it seems like a natural choice, that is when we have (i) data which is not available for all tasks at the same time, (ii) when there is significant data imbalance between the different tasks or (iii) when we want to adapt the trained model to multiple target tasks.

Looking at compute requirements, sequential transfer learning is usually much more expensive during training on the source task. However, afterwards, during the application to the target task,

(18)

it is then much faster compared to MTL, which, while faster to train on the source task, might be expensive when applying to the target task.

2.3.2.2 Stages

This type of transfer learning consists of two main stages, which correspond to the time periods during the learning process – pre-training and adaptation. During pre-training we perform the training over the source task data. This is the operation that is usually expensive in computational terms. After this, at the adaptation phase, the learned information is transferred to the target task. The latter stage – being time efficient and relatively easy to perform – is what gives compelling benefit to sequential transfer learning over other types of learning as systems can build on top of this pre-trained knowledge instead of starting their training flows from scratch. In order for this benefit to be maximal, the source task must be chosen carefully and in such a way that it combines as many general domain features as possible. Ideally, we would be able to produce universal representations which encode information about the whole area of interest. Even though the no

free lunch theorem (Wolpert and Macready, 1997) states that fully universal representations will

never be available, pre-trained ones are still significantly better when compared to a learn from scratch approach.

Pre-training stage can be further split into three sub-types taking into account the type of super-vision that is applied during training.

We have unsupervised pre-training which, following the unsupervised machine learning, does not require labelled data and only relies on raw forms. This is closer to the way humans learn (Carey and Bartlett, 1978; Pinker, 2013) and, in theory, more general than other types. Because there is no labelling, we remove any unwanted human introduced bias from the data and rely on the model to learn important features on its own, producing a so-called self-taught learning (Raina et al., 2007) or simply unsupervised transfer learning (Dai et al., 2008).

The second sub-type, called distantly supervised pre-training (Mintz et al., 2009) uses heuristic functions and domain knowledge to obtain labels from unlabelled data on its own. Tasks used in such approach include sentiment analysis (Go et al.), word segmentation (Yang et al., 2017) and predicting unigram or bigrams (Ziser and Reichart, 2017), conjunctions (Jernite et al., 2017) and discourse markers (Nie et al., 2019).

Lastly, supervised pre-training combines traditional supervised methods which require manual la-belling of the available data. The data used here is usually existing one and sometimes such existing tasks which are related to the target task are chosen. Examples include training a machine trans-lation model on a language which is highly-resourced and then transferring it to a low-resource language (Zoph et al., 2016) or training a part-of-speech(POS) tagging model and then applying it to a word segmentation task (Yang et al., 2017). Another options which emerged recently are pre-training on large datasets and tasks which help the model to learn generic representations. Such examples are dictionaries(Hill et al., 2016), natural language inference(Conneau et al., 2018), translation(McCann et al., 2017) and image captioning(Kiela et al., 2018).

Combining some or all of those three approaches can lead to a so called multi-task pre-training. This is done by elevating some of the heuristics mentioned at §2.3.1 and pre-training at multiple tasks at once. This can provide resulting representations with further generalisation, as well as reduce the noise they contain.

Adaptation stage combines the different ways of adapting the already pre-trained knowledge. From the discussion up until now, we see that the gains of transferring existing knowledge are indisputable. However, based on the task at hand, different methods of applying this knowledge might be more suitable than other.

After initial convergence, a final decision remains as to how exactly to use the pre-trained informa-tion. One can decide to utilise the knowledge in a feature-based approach, where the transferred model weights are kept ‘frozen’ and the whole pre-trained setup is only used to extract features. These are then used as an input for another model or plugged into the current one using some heuristic merging(Koehn et al., 2003). In the case of NLP, an example using feature extraction are word representations (Turian et al., 2010). The second approach is referred to as fine tuning and it requires additional continuation of the training of the existing model, this time on the target domain.

(19)

Both adaptations have their own benefits. Feature extraction allows us to use existing models easily and repeatedly. On the other hand, fine-tuning allows us to further specialise in the target domain, which is proven to perform better in most cases (Kim, 2014). However, there are exceptions where fine-tuning is not as good and can even hinder performance, such as when the training set is very small, if the test set contains many OOV tokens (Dhingra et al., 2017) or if the training data is very noisy (Plank et al., 2018).

Most works point towards fine tuning being the more promising choice (Gururangan et al., 2020). However, while it has also been shown that fine-tuning can work best if source and target tasks are similar, if this is not the case, fine-tuning will easily be overcame by a simple feature extraction (Peters et al., 2019). This also corresponds to one of the negatives of fine-tuning – it can introduce specific focus over certain tokens which might lead to other becoming stale.

2.3.3 Other

One other, less known transfer learning type is lifelong learning (Thrun, 1996, 1998). It differs from sequential transfer learning which originally uses two tasks – a source one and target one. In lifelong learning we can have many tasks which are learned sequentially and there is no clear distinction over which tasks are source and which target.

Another studied approach is called domain adaptation. This field seeks to learn representations and functions which will be beneficial for the target domain instead of trying to generalise. Finally, a separate type is reserved to cross-lingual learning which combines methods for learning representations across languages. These come down to cross-lingual word representations and stand behind the hypothesis that many different models trained separately actually optimise for similar objectives and are in fact closely related, differing only in the optimisation and hyper-parameter strategies.

2.4 Optical Character Recognition

We now turn towards another major technology that has impacted our lives for the better, namely

optical character recognition(OCR) (Mori et al., 1999). This represents a group of techniques

which goal is to convert, i.e. digitise images of typed or printed text into a machine-encoded text. It can also be applied to a handwritten text which would result in the so-called handwritten

character recognition (HCR). The conversion is made through an optical mechanism. One can

consider the way humans read also optical recognition with the eyes being the optical mechanism. The technology is of great importance as most of the search and mining performed on digitised collections is done using OCR’d texts.

The field has been an active area of development for more than a century (Mori et al., 1992; Rikowski, 2011) and may be traced all the way back to 1914 when Emanuel Goldberg developed a machine that reads characters and converts them into standard telegraph code. There are many applications for a digitised text – the first ones were creating devices that can help the blind. Nowadays, technology is advancing and recognised texts can be indexed and used in search engines, which can also help with one of the biggest problems in NLP and ML in general - the lack of data. Even simple outcomes, such as being able to access the text digitally, word by word pave the way for more important scenarios like the ability to read books out loud which can make the difference for visually impaired people.

We identify several main steps during the process of recognising optical characters. These can all be seen in Figure 2.1. We describe them briefly.

Scanning involves converting the paper documents, which can be books, newspapers, magazines, etc. into digital images. This step can be skipped if we already have the images and not the original source.

Pre-processing applies some amount of processing over the scanned images. This can be ‘de-skewing’ the image, i.e. aligning in case it wasn’t originally, ‘despeckling’, i.e. removing very bright or very dark spots and smoothing edges, binarisation of the image which can make it black-and-white for example, cropping edges, removing lines and more(Mori, 1984). This step aims to make it easier for the system to detect the text in the latter steps.

(20)

Figure 2.1: General OCR flow. Blue steps use Computer vision techniques, while red steps use NLP ones. During recognition step both CV and NLP approaches can be used. Not all steps are necessary and some might be merged together or skipped depending on the requirements.

Segmentation step aims to find the important segments in the image which contain the text(Hirai and Sakai, 1980; Jeng and Lin, 1986). These are also called ‘regions of interest’(ROI) of the image. These are usually represented as coordinates of the starting pixels of a ROI and additionally the size in terms of width and height of the box that wraps the region.

Recognition can combine both NLP and Computer Vision (CV) techniques while trying to recog-nise the text in terms of characters and words out of the resulted regions of interest from the previous step. This usually produces a text that can contain some grammatical or lexical errors, which is also why we often require to apply the next step in the flow.

Post-processing groups different strategies of fixing errors resulted in the recognition step. This can involve character- or word-vocabulary check-up, post-OCR correction, probability verification and more. This is an important step because of many documents having extremely low quality of preservation and are therefore very hard to digitise naturally (Nguyen et al., 2019).

Evaluation represents a final step which is used to train better recognition systems. Assuming we have a ground truth of the text we are trying to recognise, we can compare it to the output of the OCR system using a specific metric, usually that being accuracy. However, at this step we can also compare different OCR systems based on their speed and flexibility, i.e. how error-prone is the system to changes in the input.

While in the previous century focus has been mostly on improving results of OCR setups, recently it has been shifting towards usability of the output and application in other domains. There is less room for improvement now to the point of this challenge being considered a solved problem (Hartley and Crumpton, 1999). There are many OCR systems which would give accuracy percentage rates in the high nineties. However, with that being said, those results are not always consistent and are subject to a wide variety of pre-conditions, the biggest one being the quality of the original document (van Strien et al., 2020). This hinders advances in the field and the impact of using OCR’d text remains relatively unknown (Milligan, 2013; Cordell, 2017, 2019; Smith and Cordell, 2018). As a result, there have been many works related to OCR quality and its development (Alex and Burns, 2014).

With the advances of Deep Learning in the recent years, naturally, some methods have been applied to OCR too. Research has been done, analysing its effect on most of the aforementioned steps. Post-processing benefited greatly from applications of machine translation models during the post-OCR correction phase (Nastase and Hitschler, 2018), some of which were used to de-noise

(21)

the OCR’d text (Hakala et al., 2019). Due to the nature of digitised texts, it is important to be able to process data without having a ground truth – a requirement that is uncommon for most ML tasks. There have been significant advances in this regard with many models currently being able to correct OCR outputs in an unsupervised manner (Dong and Smith, 2018; Hämäläinen and Hengchen, 2019; Soni et al., 2019).

In general, due to the complex nature of the OCR systems which combines many different fields of studies, large percentage of the novelties that are being introduced to the machine learning field are directly applicable to one or more of the architecture steps. However, as we will see next, problems arise when the domain on which OCR is applied changes, as most of the known techniques have been applied solely on modern day-to-day text corpora.

2.5 Historical texts

Historical documents have long been neglected when it comes down to NLP and digitisation efforts. In the recent years, research has increased its efforts, mainly due to the significant advances in OCR which allowed for some of the thousands historical texts to be successfully digitised (Piotrowski, 2012), including multiple different languages (Smith and Cordell, 2018). This is an important first step towards a cultural heritage preservation, however, it remains only a first step and there are still many challenges which are blocking the progress towards a successful full digitisation. And even then, digitisation is only a necessary beginning. To enable intelligent search and navigation, the texts must be processed and enriched, which is harder for historical ones, due to the OCR output there having extremely low quality, often times making it unreadable. We describe several main differences of historical corpora from modern day texts that are hindering OCR performance.

Quality of historical documents is often much worse compared to their modern counterparts. Very often the paper of a document starts degrading when it is centuries old which can then result into parts of the pages falling down ink to start fading, etc. This all leads to additional noise in the OCR process, usually enough to deceive current modern workflows.

Linguistic change is another major complication which stems from the fact that all languages change over the years. Nowadays, many systems are able to return exploitable results from docu-ments in modern day English. However, when applied on manuscripts and newspapers in English, originating from more than one or two centuries ago, most fail to grasp any meaningful knowledge out of them. Changes also happen at the typewriting itself with many old books having Gothic typewriting – an unknown type for current systems. Looking at linguistic change, it is natural to consider the old version of the language as a separate language and attempt to perform current techniques such as machine translation to bridge the gap. Unfortunately, this leads us to the next key difference when comparing with languages from our time.

Lack of data is blocking many advances as current deep learning systems rely on significant amount of raw data. Digitised resources, and in particular quality and clean ground truth corpora, are still scarce and insufficient for the demands of the current big NLP architectures, thus slowing down necessary advances.

Fortunately, more and more projects are being launched to help narrow one or more of these gaps. Similar to modern day text collections such as Project Gutenberg (Project Gutenberg, 2020), many initiatives have been launched with the sole goal of improving the state of historical corpora and its accessibility. They bring to us an unprecedented amount of digitised historical books and newspapers. Examples of such are (i) NewsEye research project (NewsEye, 2020) which not only stores digitised historical texts but also supports competitions aimed at advancing research in this area; (ii) IMPRESSO project (Ehrmann et al., 2020a) which aims to enable critical text mining of newspaper archives; (iii) HIMANIS project (Bluche et al., 2017) aiming at developing cost-effective solutions for querying large sets of handwritten document images; (iv) IMPACT dataset (Papadopoulos et al., 2013) which contains over 600 000 historical document images and more. As of this date, the community still lacks the required knowledge for a general approach towards processing historical texts. There have already been many works which successfully deal with such, however, all of them are focused specifically at the problem at hand and are unable to generalise when taken out of their original context. Gezerlis and Theodoridis (2002) and Laskov (2006) de-velop systems for recognising characters used in the Christian Orthodox Church Music notation,

(22)

whereas Ntzios et al. (2007) presents a segmentation-free approach for recognising old Greek hand-written documents especially from the early ages of the Byzantine empire. Many systems were also proposed which extract information from digitised historical documents, empowering document ex-perts to focus on improving of the input quality (Droettboom et al., 2002, 2003; Choudhury et al., 2006). Research has also been performed on languages which are not used actively nowadays, such as Latin (Springmann et al., 2014).

Because of the amount of errors in historical OCR, focus was naturally also targeted at analysing their impact and different ways they can be overcame (Holley, 2009; van Strien et al., 2020). Hill and Hengchen (2019) aimed to quantify the impact OCR has on the quantitative analysis of his-torical documents, while Jarlbrink and Snickars (2017) investigated the different ways newspapers are transformed during the digitisation process, whereas Traub et al. (2018) showed the effects of correcting OCR errors during the post-processing stage.

Being a necessity for information retrieval, named entity recognition(NER) is one of the important tasks for digitised historical texts. Ehrmann et al. (2016) analysed the complications and this task, while achieving near-perfect results for modern texts, is far from reaching at the very least compelling results when applied to their historical counterparts (Vilain et al., 2007).

2.6 Transfer learning for historical texts

There is much to be desired when it comes to digitising historical texts and, due to the great amount of obstacles, progress is slow. However, transfer learning (§2.3) gives us many promises, mostly due to the problems that it was designed to help with. Fortunately, many of those problems occur in historical OCR’d corpora as well.

Research is already benefiting from the power that knowledge transfer provides, with works fo-cusing on measuring and representing semantic change (Shoemark et al., 2019; Giulianelli et al., 2020) and extracting named entity information (Labusch et al., 2019). Other works focused on transfer learning using CNN feature-based networks, such as Tang et al. (2016) who investigated the applicability of transfer learning over historical Chinese character recognition, while Granet et al. (2018) investigated handwriting recognition. Recently, Cai et al. (2019) showed that ap-plying pre-trained generative adversarial network (GAN) can be useful, resulting in the TH-GAN architecture which proved effective over historical Chinese character recognition.

Nevertheless, while there have been minor advances in the field, some questions remain open, mostly due to the challenges which historical corpora pose. It is still unclear when (i.e., for which tasks, languages, etc.) and how (i.e., which approach) transfer learning can be successfully applied on historical collections. Given that most pre-trained language models have been trained on modern-day, high-resource languages (e.g., Wikipedia in English), their applicability to historical collections is not straightforward.

In this work, we start bridging the gap of systematically assessing transfer learning for historical tex-tual collections. To this end, we consider three tasks which help us analyse further: SemEval 2020 challenge, aimed at lexical semantic change evaluation (Schlechtweg et al., 2020), the ICDAR2019

Competition on Post-OCR Text Correction (Rigaud et al., 2019) and the CLEF-HIPE-2020

chal-lenge on Named Entity Recognition, Classification and Linking (Ehrmann et al., 2020b). These tasks are of importance to practitioners as they directly influence the usability and accessibility of digitised historical collections.

(23)

Chapter 3

Empirical setup

Our aim is to assess the importance of pre-trained language models and whether such represen-tations can be of use when applied to historical text corpora. We analyse different scenarios and environments and look into some of the most pressing challenges that the community is facing currently. To this end, we introduce a generic representation that can help future research over historical corpora. This includes a highly modular embedding layer which allows us to activate and deactivate different embeddings, working on different levels of the data, so we can further analyse their influence. We use task-specific inner and output layers and keep core parts as similar as possible. This enables better tracking of results and progress. The general architecture of our model is explained in details in §3.1.

In §3.2 we explain some of the current problems that historical text representations face. We use these throughout the thesis as they allow us to test our hypotheses in distinctive scenarios, requiring us to rely on various modern-text state-of-the-art Natural Language Processing features and test them in this different domain.

3.1 Model architecture

Our modular architecture empowers our abilities to fully utilise comparison metrics. We stick to one general architecture when designing our models which can be seen at Figure 3.1. This gives us the freedom to test different linguistic representations while still keeping core parts of the model uniform across different tasks and languages. We further split the key sections of the model into four distinguishable parts.

3.1.1 Input

This layer simply represents the data that is being used. It is dependent on the challenge itself and can represent masked or unmasked sequence of text tokens or characters.

3.1.2 Embedding layer

We group the modules which are responsible for building representations of the input data in the embedding layer. These are then further split into two sub-groups based on their origin - (i)

pre-trained (transferred) representations and (ii) newly trained representations. The former groups

modules which are using pre-trained models to generate embeddings. Furthermore, these can have frozen weight matrices and only used for extracting features or having their weights trained (more specifically fine-tuned) along with the rest of the model. The other sub-type, newly trained representations, links embedding matrices, which are usually initialised randomly and learned from scratch, along with other parts of the model at hand. These do not leverage existing knowledge and try to learn task-specific features from scratch.

(24)

Figure 3.1: General model architecture for assessing transfer learning on a variety of tasks

To give more flexibility to these two sub-groups, each of them can further present different gran-ularity of the input data. This depends on the specific task and whether or not the specific type would be helpful. The three granularity types, motivated by previous research and following exist-ing knowledge, we use are (i)word-, (ii)sub-word- and (iii)character-level embeddexist-ing types (Mikolov et al., 2012; Kudo, 2018; Kudo and Richardson, 2018). They can be used on their own or in com-bination of each other. Whenever more than one approach is used at the same time, we introduce different processes on how exactly to merge the different information. For simplicity reasons, we group these processes based on the desired output type, i.e. based on the task itself. Generally, there are tasks where we would want to work on character-, word- or sub-word level. Here we explain how the merging occurs based on this organisation.

Character-level output works in a similar manner as the sub-word approach. We concatenate sub-word tokens and word tokens to each character representation, often repeating multiple times one sub-word or word token, thus enriching the final representation with more information.

Sub-word-level output allows us to use the same approach as previously explained for word em-beddings, repeating one word representation over multiple sub-word ones. As for the characters, in cases, where character embeddings come encoded from a neural network, we simply use the same character output size as the sub-word embedding size. In all other cases, we take the average of the embeddings for all characters that occur in a specific sub-word token. In the end, we always concatenate the character embeddings to the sub-word embedding representation.

Word-level output requires us to present one word by multiple sub-word and character embeddings respectively. In such cases, we take the average of the character and the sub-word embeddings of

(25)

the word respectively and concatenate them to the existing word representation.

3.1.3 Task-specific model and evaluation

This part is, as the name suggests, highly specific to the task itself. In our work here a state-of-the-art approach is incorporated which allows us to quickly reach promising results and focus on embeddings and model transfers.

We explain what models exactly are used at this stage for the different challenges and the motivation behind our choice in §3.2.

We also present the grouping of the outputs of our system where we evaluate the different charac-teristics of the architecture and analyse how helpful the different parts are. The target values and predictions are compared and then analysed for their performance changes when using different bits of information, where those bits might be coming from a pre-trained model but also by the addition of another part to the model.

3.2 Problems

We use challenges that the community is currently facing when working over historical corpora. Most of those are achieving very promising results when applied over modern day texts but when turned towards past documents that are prone to OCR errors and linguistic change differences, performance drops significantly. We pick three of the most pressing research challenges focused on historical NLP – Named Entity Recognition and Classification (NERC) (§3.2.1), post-OCR

correction (§3.2.2) and Lexical Semantic Change (LSC) (§3.2.3). All of these were shared tasks

of which we actively participate in two – NERC and LSC. We do that so we can better compare our results and models, more specifically not just against baseline models and scoring systems, but also against other competitors.

3.2.1 Named entity recognition

Named entity(NE) processing has become an essential component to the Natural Language field after being introduced as a problem some decades ago. Recently, the task has seen major im-provements thanks to the inclusion of novel deep learning techniques and the usage of learned representations (embeddings) (Akbik et al., 2018; Lample et al., 2016; Labusch et al., 2019).

3.2.1.1 Motivation

Named entity recognition and classification (NERC) is also of great importance in the digital humanities and cultural heritage. However, applying existing NERC techniques is also made challenging by the complexities of historical corpora (Sporleder, 2010; van Hooland et al., 2015). Crucially, transferring NE models from one domain to another is not straightforward (Vilain et al., 2007) and in many cases performance is consequently greatly impacted (van Strien et al., 2020). In attempt to fix some of the problems, this year’s CLEF-HIPE-2020 challenge (Ehrmann et al., 2020b) is presented in which we participate actively. It aims to solve three of the most pressing problems about NE processing, particularly

• Strengthening the robustness of existing approaches on non-standard input;

• Enabling performance comparison of NE processing on historical texts; and, in the long run, • Fostering efficient semantic indexing of historical documents in order to support scholarship

on digital cultural heritage collections.

3.2.1.2 Tasks

The organisers provide two tasks, one focused on Named Entity Recognition and Classifica-tion(NERC) and one on Named Entity Linking(NEL). For the purpose of our research, we de-cide to focus only on the first one as it allows us to analyse effects of using information from pre-trained networks, where the latter is part of studies in Information Retrieval(IR) field which remains beyond the scopes of this work.

(26)

The task which we evaluate our models on is further split into two sub-tasks which correspond to difficulty and class types that are used. Sub-task 1 is regarding coarse-grained types and more specifically the recognition and classification of entity mentions according to coarse-grained types. Those are the most general representations of the entities, most often a grouping of multiple sub-types – for example loc represents all location entities and sub-entities. This sub-task includes the literal sense and, when it applies, also the metonymic. Metonymy stands for a figure of speech in which a specific thing is referred to by the name of something closely related to it. Sub-task 2 is aimed towards fine-grained entity types. Following previous example, we now have detailed sub-entities such as loc.adm.town which corresponds to administrative town or loc.adm.nat which corresponds to administrative nationality, etc. In the second sub-task, recognition and classification of fine-grained types must be performed, including again literal and – when it applies, metonymic sense. Additionally, detection and classification of nested entities of depth one and entity mention components (title, function, etc.) is required. Table 3.1 shows comparison of the two sub-tasks and the difference in expected predictions.

Sub-task 1 Sub-task 2

NE mentions with coarse types yes yes NE mentions with fine types no yes Consideration of metonymic sense yes yes

NE components no yes

Nested entities of depth one no yes Table 3.1: NERC sub-task comparison

3.2.1.3 Data

The data consists of Swiss, Luxembourgian and American historical newspapers written in French, German and English languages respectively and organised in the context of the IMPRESSO project(Ehrmann et al., 2020a).

Figure 3.2: Amount of tokens per decade and language

For each newspaper, articles were randomly sampled among articles that (i) belong to the first years of a set of predefined decades covering the life-span of the newspaper, and (ii) have a title, have more than 50 characters, and belong to any page (no restriction to front pages only). For each decade, the set of selected articles is additionally manually triaged in order to keep journalistic content only. Time span of the whole data goes from 1790 until 2010 decades and the OCR quality corresponds to real-life setting, i.e. it varies according to digitisation time and archival material. Information about the amount of tokens per decade and per language can be seen at Figure 3.2. We have in total 569 articles and 1 894 741 characters with a vocabulary of 151 unique characters. Additionally, the amount of mentions, i.e. entity occurrences per decade are also visible at Figure 3.3. They are broken down per coarse-grained type.

Improving historical language modelling using Transfer Learning

MSc Artificial Intelligence

Master Thesis