Multi- And Cross-Lingual Document Classification: A Meta-Learning Approach

(1)

MSc Artificial Intelligence

Master Thesis

Multi- And Cross-Lingual Document

Classification:

A Meta-Learning Approach

by

Niels van der Heijden

12397768

December 4, 2020

48EC February 2020 - November 2020 Supervisor: Dr. E. Shutova Assessor: Dr W. Aziz

(2)

(3)

Acknowledgements

First of all, I would like to express my deepest gratitude towards my direct supervisors, Ekaterina Shutova, Helen Yannakoudakis, and Pushkar Mishra, for the intriguing discussions, continuous support, and trust in my work.

This work has been supported by Deloitte Risk Advisory B.V. in the Netherlands. I would like to thank my superiors for allowing me to do what I love: exploring and expanding the limits of the theory of Artificial Intelligence to create value in general and/or business settings. This combination of tangible, impactful results and the intellectual challenge inherent to research has fueled my desire to keep following this path and started the desire in the first place. Next to that, I would like to thank my direct colleagues, family, friends, and surroundings for the continuous interest and support during the creation of this thesis. Your presence, albeit digital, made the highs of this journey higher and the lows better.

It’s impossible.

It’s impossible until some crazy person has the audacity to believe that no matter what the experts or the doctors say, ’it is possible’.

(4)

Abstract

The great majority of languages in the world are considered under-resourced for the successful application of deep learning methods. In this work, we propose a meta-learning approach to document classification in low-resource languages and demonstrate its effectiveness in two different settings: few-shot, cross-lingual adaptation to previously unseen languages; and multilingual joint training when limited target-language data is available during training. We conduct a systematic comparison of several meta-learning methods, investigate multiple settings in terms of data availability and show that meta-learning thrives in settings with a heterogeneous task distribution. We propose a simple, yet effective adjustment to existing meta-learning methods which allows for better and more stable learning, and set a new state-of-the-art on several languages while performing on-par with the state-of-the-art for the remaining languages, using only a small amount of labeled data.

Keywords – Meta-learning, document classification, few-shot learning, cross-lingual, multilingual

(5)

List of Figures

2.1 The MLM and TLM for BERT and XLM respectively . . . 7

2.2 Encoder-decoder setup for training LASER . . . 8

2.3 Two examples of MTL setups . . . 10

2.4 Convolutional siamese networks for 1-shot classification . . . 16

2.5 Training strategy MANN . . . 18

2.6 Architecture of SNAIL . . . 19

2.7 MAML algorithm and visualization . . . 20

2.8 An intuitive illustration in which we use solid lines to represent the learning of initialization, and dashed lines to show the path of fine-tuning. Figure taken from (Gu et al., 2018). . . 22

5.1 Validation accuracy for 3 seeds for original foProtoMAML and our new method, foProtoMAMLn. . . 39

(8)

List of Tables

5.1 Search range per hyper-parameter. We consider the number of update steps in the inner-loop, Num inner-loop steps, the (initial) learning rate of the inner-loop, Inner-loop lr, the factor by which the learning rate of the classification head is multiplied, Class-head lr multiplier, and, if applicable, the learning rate with which the inner-loop optimizer is updated, Inner-optimizer lr. The chosen value is underlined. . . 34 5.2 Accuracy on unseen target languages in half-split setting Reported

results are the average of 5 different seeds. Non-episodic corresponds to the standard supervised learning baseline. ∆ corresponds to the average accuracy across test languages. . . 34 5.3 Accuracy on target languages in leave-one-out split for MLDoc.

We report the average accuracy of 5 different seeds on the test sets of the target languages. Non-episodic corresponds to the standard supervised learning baseline. ∆ corresponds to the average accuracy. Diff corresponds to the average difference compared to the low-resource half-split setting. 35 5.4 Accuracy on target languages in leave-one-out split for Amazon.

We report the accuracy of 5 different seeds on the unseen target languages. Data indicates the low- and high-resource setting. ∆ corresponds to the average accuracy across test languages. . . 36 5.5 Accuracy on target languages in the joint training setting for

Amazon and MLDoc. We report the average accuracy over 5 different seeds. ∆ corresponds to the average accuracy across test languages. . . 37 5.6 Average accuracy of 5 different seeds on unseen target languages using the

original/unnormalized foProtoMAML model. ∆ corresponds to the average accuracy across test languages. Diff : difference in average accuracy ∆ compared to using forProtoMAMLn (Table 5.4 and 5.2). . . 38 5.7 Accuracy on target languages in leave-one-out split for Amazon

when initializing from monolingual classifier in lsrc. We report the

average accuracy of 5 different seeds on unseen target languages. Data indicates the low- and high-resource setting. ∆ corresponds to the average accuracy across test languages. Diff : difference in average accuracy ∆ compared to initializing from the XLM-RoBERTa language model (Table 5.4) 40

(9)

1 Introduction

1.1 Multilingual Natural Language Processing

There are more than 5000 languages around the world and, of them, around 6% account for 94% of the population.1 Even for the 6% most spoken languages, very few of them

possess adequate resources for Natural Language Processing (NLP) and, when they do, resources in different domains are highly imbalanced. Additionally, human language is dynamic in nature: new words and domains emerge continuously and hence no model learned in a particular time will remain valid forever.

With the aim of extending the global reach of NLP technology, much recent research has focused on the development of multilingual models and methods to efficiently transfer knowledge across languages. Among these advances are multilingual word vectors which aim to give word-translation pairs a similar encoding in some embedding space (Mikolov et al., 2013b; Lample et al., 2017). There has also been a lot of work on multilingual sentence and word encoders that either explicitly utilizes corpora of bitexts (Lample and Conneau, 2019) or jointly trains language models for many languages in one encoder (Devlin et al., 2018; Artetxe and Schwenk, 2019; Conneau et al., 2019). These encoders are applied to a task by extending them with a task-specific head, such as a fully connected linear layer followed by a softmax classifier for classification tasks. Most commonly, in order to perform cross-lingual classification tasks, these models (the language encoders extended with task-specific head) are trained using data in one language, typically English, and consecutively applied to the target language (Devlin et al., 2018; Artetxe and Schwenk, 2019; Conneau et al., 2019). The implicit assumption this strategy makes is that training a model in one language does not degrade performance in others, which is not necessarily true.

The scope of this thesis is limited to the classification of documents in multiple languages. For a cross-lingual document classification method to obtain perfect performance, it has to fulfil two requirements. First of all, the knowledge obtained in the language that is trained on, the source language, must be perfectly transferable to the target language. A perfect machine translation system would for instance satisfy this requirement. Yet,

(10)

language is only a tool for people to describe concepts. The way a concept is described might differ vastly based on the time of writing, culture, etc. For instance, a news article on government policies in a Canadian journal might have very different contents from one in a Russian journal. Or a Chinese customer writing a product review might express its opinion differently than its Dutch counterpart. Following the nomenclature of Lai et al. (2019), we call the phenomenon of difference in expression of concepts that are not attributable to the usage of the language in question domain drift. The second requirement is that the method needs to be able to account for domain drift.

Although great progress has been made in cross-lingual transfer learning, these methods either do not close the gap with performance in a single high-resource language (Artetxe and Schwenk, 2019; van der Heijden et al., 2019; Eisenschlos et al., 2019), or are impractically expensive due to, amongst others, the need to finetune a large language model per language/task combination (Lai et al., 2019) – more on that in Section 2.

1.2 Meta-learning

Meta-learning, or learning to learn (Schmidhuber, 1987; Bengio et al., 1990; Thrun and Pratt, 1998), is a learning paradigm that focuses on a quick adaption of a learner to new tasks. The idea is that by training a learner to adapt quickly and from a few examples on a diverse set of training tasks, the learner can also generalize to unseen tasks at test time. Meta-learning has recently emerged as a promising technique for few-shot learning for a wide array of tasks, such as image classification (Triantafillou et al., 2020), image segmentation (Hendryx et al., 2019), image synthesis (Fontanini et al., 2019), tracking (Wang et al., 2020) – and reinforcement learning (Wang et al., 2016; Duan et al., 2016; Alet et al., 2020). Meta-learning has also found promising applications within NLP – Dou et al. (2019) investigate meta-learning for few-shot learning across a diverse set of Natural Language Understanding tasks, such as paraphrase detection and natural language inference. Next to that, promising results have been obtained for machine translation (Gu et al., 2018), relation extraction (Obamuyide and Vlachos, 2019a) and text classification (Yu et al., 2018). To our best knowledge, no previous work has been done in investigating meta-learning as a framework for multi- and cross-lingual document classification. The only current study on meta-learning for cross-lingual few-shot learning is the one by

(11)

Nooralahzadeh et al. (2020), focusing on natural language inference and multilingual question answering. In their work, the authors focus on applying meta-learning to learn to adapt a monolingually trained classifier to new languages. In contrast to this work, we instead show that, in many cases, it is more favourable to not initialize the meta-learning process from a monolingually trained classifier, but rather reserve its respective training data for meta-learning instead.

1.3 Research question and contributions

In this work, we focus on document classification in multiple languages. We are especially interested in learning to perform document classification in a previously unseen language fast and from few samples given data in a small set of other languages. The main research question hence is:

Can we apply meta-learning to improve few-shot performance on cross-lingual document classification tasks?

We hypothesize that by applying meta-learning it is possible to obtain a model that can generalize well to previously unseen languages in a few-shot setting since it can be applied using any (arbitrarily good) language model while having a natural fit for handling domain drift.

Secondly, we are interested in a low-resource scenario where little data is present for multiple languages.

Can we use meta-learning to improve performance over traditional supervised learning in a low-resource, multilingual joint-learning setting?

We hypothesize that by explicitly modelling multiple languages as different tasks using meta-learning, a model can learn better what knowledge to share across tasks and how to ‘specialize‘ into one of these respective tasks, ultimately yielding better performance.

Summarized, the contributions of our work are as follows:

• We propose a meta-learning approach to few-shot cross-lingual and multi-lingual adaptation and demonstrate its effectiveness on document classification tasks over traditional supervised learning;

(12)

cross-• We analyse the effectiveness of meta-learning under several different parameter initializations and multiple settings in terms of data availability, and show that meta-learning can effectively learn from few examples and diverse data distributions; • We provide an empirical argument for focusing on the diversity of data rather than sheer volume in the data collection process when a new task must be learned across many languages.

• We introduce a simple yet effective modification to existing meta-learning methods and empirically show that it stabilizes training and converges faster to better local optima;

• We advance the state-of-the-art on several languages and achieve on-par with it for the remaining languages using only a small amount of data.

1.4 Reading guide

First, we will present a literature review of related work in Section 2. In Section 3, the datasets used in our experiments are introduced. Then, an overview of all meta-learning methods and baselines is given in Section 4. Here, the general framework of meta-learning, its taxonomy, and the relevant algorithms will be formally introduced. After this, all experiments and results will be covered in Section 5, accompanied by an elaborate analysis on the effectiveness of meta-learning in multi- and cross-lingual document classification followed by a qualitative comparison of the used meta-learning methods. The section concludes with an ablation study on the effectiveness of our proposed algorithm and the effect of different parameter initializations on the used meta-learning methods. Finally, we refer back to our research questions and draw general conclusions in Section 6.

(13)

2 Background

In this section, we start by reviewing the main drivers of progression in the field of multilingual NLP and give an overview of its applications. Consecutively, we introduce the concept of meta-learning and provide examples of promising applications within the field of NLP.

2.1 Multilingual NLP

2.1.1 Multilingual word embeddings

After the successful application of word embeddings (Mikolov et al., 2013c; Pennington et al., 2014) in monolingual NLP tasks, a natural extension is to align these word embeddings across languages. Doing so enables us to compare meanings of words across languages, which is key for machine translation and cross-lingual information retrieval amongst other tasks. Also, it enables model transfer across languages by providing a shared representation space. Cross-lingual word embedding models can be broadly categorized along the following two dimensions (Ruder et al., 2019) :

• Type of alignment: The level of the supervision signal: word, sentence or document.

• Comparability: The level of alignment. Some methods require exact translation whereas others require data that is semantically comparable.

The first work on cross-lingual word embeddings uses parallel word-aligned data in two or more languages (Mikolov et al., 2013b; Faruqui et al., 2014). Early work (Mikolov, Le, Sutskever, 2013b; Faruqui Dyer, 2014b) requires a predefined bilingual dictionary whereas more recent work shifted the attention to obtaining the bilingual dictionary using either a small seed dictionary or in a fully unsupervised setting (Lample et al., 2017; Artetxe et al., 2018).

Sentence alignment-based methods use corpora similar to those used in machine translation. Some only use the parallel sentences for training (Hermann and Blunsom, 2014; Lauly et al., 2014) whereas others also use word-level alignment information wherever possible (Zou et al., 2013; Shi et al., 2015). Most commonly, these methods either straightforwardly

(14)

extend monolingual methods for word embedding learning, such as bilingual skip-gram models (Gouws et al., 2015) or bilingual autoencoder models (Lauly et al., 2014), or use compositional sentence models where word representations are used to construct sentence representations and the model is trained to create representations of parallel sentences close to each other (Hermann and Blunsom, 2014).

2.1.2 Multilingual

language

models

/

General-purpose

multilingual representations

In the monolingual setting, the state-of-the-art on many tasks has moved away from using static word embeddings to contextualized word embeddings/pretrained language models. These pretrained language models can be successfully applied to many NLP tasks with little to no architecture adjustment. For instance, the pioneer in the field of deep contextualized word embeddings, ELMo (Peters et al., 2018), beat the state-of-the-art on a wide array of tasks such as Question Answering, Named Entity Recognition and Semantic Role Labeling with a large margin ranging from 5.8% to 21% relative improvement. Similarly, the BERT (Devlin et al., 2018) model improved the state-of-the-art on eleven NLP tasks among which the General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark upon which it improved with a total of 7.7% points. BERT, just like many developments afterwards (Yang et al., 2019; Sun et al., 2019; Radford et al., 2018), is based on the Transformer architecture (Vaswani et al., 2017).

In this section, an overview of some key architectures and pretraining methods for multilingual language models is given, followed by an example of their impact on cross-lingual transfer learning.

2.1.2.1 BERT

BERT is a language representation model based on the Transformer architecture (Vaswani et al., 2017). The authors introduce a new pretraining method they call the Masked Language Model (MLM), see Figure 2.1, which essentially masks a certain percentage of the tokens (15%) in a sequence and asks the model to predict the original token given its context. This objective is complemented with the next sentence prediction task which is framed as a binary classification task determining whether two sentences are consecutive sentences in their original corpus.

(15)

Figure 2.1: The MLM and TLM for BERT and XLM respectively

English BERT uses WordPiece embeddings (Johnson et al., 2017) with a vocabulary size of 30k tokens and is trained on a concatenation of a Wikipedia dump and the BookCorpus (Zhu et al., 2015), totalling to approximately 3300M words. Its multilingual counterpart, mBERT, has a vocabulary size of 110k tokens and is trained on the entire Wikipedia dump for the 100 biggest listed languages. In order to counter the imbalance of dump sizes across languages, exponential smoothing is applied to the sampling of languages.2

Although mBERT is not explicitly trained to align representations across languages, its multilingual capabilities are surprisingly good (Pires et al., 2019; van der Heijden et al., 2019).

2.1.2.2 LASER

Whereas BERT is pretrained solely on monolingual data, Artetxe and Schwenk (2019) try to explicitly align sentence representations across languages with Language-Agnostic SEntence Representations (LASER). Their model consists of a 5-layer LSTM (Hochreiter and Schmidhuber, 1997) and uses a subword vocabulary of 50k tokens based on byte-pair encoding (BPE) (Sennrich et al., 2016) their training corpus.

The authors use a dataset of translated sentences obtained by combining the Europarl,

(16)

Figure 2.2: Encoder-decoder setup for training LASER

United Nations, Global Voices, Opensubtitles2018, Tatoeba and Tanzil corpora (Tiedemann, 2012). All sentence-pairs are translations from some language to either English or Spanish and the encoder is trained in an encoder-decoder setup on the task of Machine Translation (MT), see Figure 2.2.

Since a shared encoder is used for all languages and the decoder gets no information on what language it has to decode (only to what language to decode to) the encoder is forced to create language-agnostic sentence representations.

Although it has been empirically shown that the performance drop in zero-shot cross-lingual transfer is relatively small (Artetxe and Schwenk, 2019; van der Heijden et al., 2019), its monolingual capabilities are not on par with e.g. BERT.

2.1.2.3 XLM

In order to further improve the multilingual capabilities of BERT, the Cross-lingual Language Model (XLM) (Lample and Conneau, 2019) pretraining was introduced. In addition to the monolingual MLM objective, the authors introduce the Translation Language Model (see the lower part of Figure 2.1) objective to exploit multilingual parallel data to better align language representations. The TLM is an extension of the MLM where a sentence and its translated counterpart are considered at the same time and again random words are masked. Since both sentences have a mask independent of each other and the model can attend to both at the same time, this method encourages sharing information across languages and hence aligning representations.

(17)

2.1.2.4 XLM-R

In their work, Liu et al. (2019c) perform a replication study of BERT pretraining and quantify the impact of the key hyper-parameters and training data size. The authors find that BERT was significantly undertrained and propose a modified pretraining strategy, referred to as RoBERTa, which matches or exceeds the performance of all post-BERT methods. Their main proposed modifications are 1) training the model longer, on more data, with larger batches; 2) Removing the Next Sentence Prediction objective; 3) Dynamically changing the positions of the mask-tokens for the MLM objective; 4) training on longer sequences.

Inspired by the monolingual improvements of RoBERTa, Conneau et al. (2019) perform a similar analysis on the trade-offs and limitations of multilingual language models at scale. Most notably, the authors use over 2.5TB of cleaned textual data in 100 languages, a couple of orders of magnitude more than is used for mBERT, for the pretraining. The resulting model, named XLM-RoBERTa or XLM-R, outperforms mBERT on cross-lingual classification by up to 23% accuracy on low-resource languages. It outperforms the previous state-of-the-art by 9.1% average F1-score on MLQA (Lewis et al., 2019), a cross-lingual question answering dataset, and 5.1% average accuracy on XNLI (Conneau et al., 2018), the go-to benchmark for cross-lingual natural language inference. XLM-R performs on par with RoBERTa on monolingual tasks, showing for the first time that it is possible to have a single large model for many languages without sacrificing per-language performance.

2.1.3 Multilingual Multi-Task Learning

Multi-Task Learning (MTL) is a learning paradigm aiming to combine knowledge from multiple related tasks that are learned simultaneously to improve performance on all of them. This is done by sharing (part) of a neural network architecture while training on multiple related tasks which are altered using some sampling strategy, e.g. uniform sampling.

More formally, adopting the definition of Li et al. (2019) there is a set of T tasks, {T1, T2, . . . , TT}with corresponding data DTk for k ∈ {1, 2, . . . , T }. Let Θ be the total set

of parameters used in the MTL model, the loss for task Tk is then denoted as LTk(D

(18)

The aim of MTL is to find Θ∗ _{such that} Θ∗ = argmin Θ T X k=1 λTk_L Tk(D k_{, Θ),} _(2.1)

where λTk is the weight of task T

k.

Collobert and Weston (2008) train a model to perform Part-of-Speech tagging, Named Entity Recognition, Chunking, Semantic Role Labeling, Language Modeling and Synonym Detection and steadily improve the performance on the Semantic Role Labeling tasks, a relatively complex task, when more auxiliary tasks are added to the training.

More recently MTL has been shown to improve performance on many NLP tasks, see for instance (Collobert and Weston, 2008; Liu et al., 2019b). Figure 2.3 depicts the setups of the previous approaches, respectively. Despite the seemingly successful application of

Figure 2.3: Two examples of MTL setups

MTL in monolingual setup, little work has been done on MTL for cross-lingual transfer learning - depending on the definition of MTL. One might qualify the pre-training setup for BERT, where the MLM objective is optimized for multiple languages simultaneously and the next sentence prediction task is used, as an MTL setup where each MLM/language combination qualifies as a task.

A more traditional application of MTL to the cross-lingual transfer learning domain is seen in (Singla et al., 2018). The authors combine the multilingual Skip-gram model (Luong et al., 2015) with a cross-lingual sentence similarity task to learn word and

(19)

sentence embeddings in a bilingual setting.

2.1.4 Current applications

In this section, we give an overview of the recent developments of some of the main applications of multilingual NLP.

2.1.4.1 Machine translation

The history of Machine Translation (MT) is both long and rich, with early work dating back over half a century ago (Slocum, 1985). Here we give a short overview of the progress in deep learning aided MT, often referred to as Neural Machine Translation (NMT). From a probability perspective, MT can be formulated as the task of generating the target sequence of length m T = (t1, t2, . . . , tm)from the maximum conditional probability given

the source sequence of length n S = (s1, s2, . . . , sn):

T∗ = argmaxP (T |S) (2.2)

The first NMT methods typically use some encoder-decoder architecture as in Figure 2.2 in which the encoder network encodes the source sequence to a fixed-, low-dimensional feature space from which the decoder network attempts to generate the target sequence. The encoder and decoder first mainly consisted of Recurrent Neural Networks (RNNs) (Auli et al., 2013; Auli and Gao, 2014) and Convolutional Neural Networks (CNNs) (Kalchbrenner and Blunsom, 2013; Kaiser and Bengio, 2016). Although great progress has been made since the application of deep learning to MT, it still suffers a large performance reduction when the source sentence becomes too long. This is due to the fact that the encoder has to compress the arbitrarily long sequence into a fixed-length vector, resulting in information loss. Attention mechanisms emerged as a remedy to the former problem. Initially, it was used a supplemental technique that can provide extra word alignment information, but this changed with the work of Vaswani et al. (2017). In their work, the authors show that their new model architecture – the Transformer –, which is solely based on attention-mechanisms, outperforms all previous approaches in terms of quality of translation and computational efficiency.

(20)

Apart from obtaining high-quality translation systems between language pairs using parallel corpora, more recent research has also focused on unsupervised machine translation using only monolingual corpora (Lample et al., 2017; Artetxe et al., 2019) and creating MT systems which directly translate from many to many languages, instead of relying on English as an intermediate step (Fan et al., 2020).

2.1.4.2 Token classification

Token classification is the task of classifying the constituents of a sequence, e.g. words in a sentence. The most well-known token classification tasks are Named Entity Recognition (NER), the task of finding direct mentions of a set of predefined entities such as persons, locations and organizations, and Part-of-Speech (PoS) tagging, the task of classifying tokens according to their grammatical function.

Early work on cross-lingual transfer learning exploited external knowledge bases as a means of feature engineering or used parallel corpora to create cross-lingual word clusters (Täckström et al., 2012). Consecutively, many approaches shifted from using multilingual word embeddings (Xie et al., 2018; Chen et al., 2018; Artetxe et al., 2018) to strong multilingual language models, such as XLM-R (Conneau et al., 2019).

It has also been shown that token classification methods benefit from multilingual joint learning when code-switching is present (Adel et al., 2013), when either the target or all languages are resource-lean (Khapra et al., 2011) and even in high-resource settings (Mulcaire et al., 2018).

2.1.4.3 Document classification

Early work on cross-lingual document classification relies on aligned word embeddings which are aggregated, e.g. by simple averaging (Mikolov et al., 2013a,b) – also known as the Continuous Bag-Of-Words (CBOW) method – or by using them as the input of a neural network which transforms the arbitrarily long sequence to a fixed-size vector (Kim, 2014; Zhou et al., 2016). Later, the attention shifted to exploiting corpora of aligned sentences to train a shared encoder to create multilingual sentence representations (Schwenk and Douze, 2017; Artetxe and Schwenk, 2019). More recent approaches (Eisenschlos et al., 2019; Lai et al., 2019) use some form of self-training where first a teacher model is trained by finetuning a multilingual language model, e.g. LASER or XLM-R, in English and

(21)

using it to generate pseudo-labels in the target language. Then, a student model is first finetuned on unlabeled (in-domain) target language data using some language modeling objective such as BERT’s MLM and finally trained for the task at hand using the generated pseudo-labels. Most state-of-the-art results on cross-lingual document classification tasks are held by Lai et al. (2019). In their method, the teacher model is trained using the semi-supervised learning technique called Unsupervised Data Augmentation (UDA) (Xie et al., 2019). Let pθ be a classifier parameterized by θ, their objective function can be

written as

min

θ E(x,y)∈Lsrc[−log(pθ(y|x))] + λEx∈Utgt[DKL(pθ(y|ˆx)||pθ(y|x))] (2.3)

Where Lsrc is the labeled data in the source language and Utgt is the unlabeled

in-domain data in the target language. In the second term, ˆx is an augmented sample generated by a predefined augmentation function q, such that ˆx = q(x). In the case of UDA, the augmented samples are obtained by applying back-translation (Sennrich et al., 2015). The objective can be interpreted as maximizing the likelihood of the classifier in the source language while simultaneously enforcing consistent predictions in the target language. Although the method has achieved promising results, an obvious downside is the computational complexity involved with it. For every language and domain combination 1) a machine translation system has to be inferred on a large number of unlabeled samples; 2) the UDA method needs to be applied to obtain a teacher model to generate pseudo-labels on the unlabeled in-domain data; 3) a language model must be finetuned, which involves forwards and backwards computation of a softmax function over a large output space (e.g., 110K tokens for mBERT and 250K tokens for XLM-R). The final classifier is then obtained by 4) training the finetuned language model on the pseudo-labels generated by the teacher.

2.1.4.4 Natural language understanding

The course of development of multilingual representations described in Section 2.1.1 and 2.1.2 are embodied by the progress on the Cross-lingual Natural Language Inference (XNLI) (Conneau et al., 2018) dataset, a widely used benchmark for multilingual transfer learning. The first results are obtained by using aligned word embeddings (Lample et al.,

(22)

2017) and aggregating these embeddings to a sentence embedding using an LSTM-encoder followed by max-pooling.

Not too long after the release of the original monolingual version of BERT, its multilingual counterpart, mBERT, was released. Without any explicit alignment of languages (only a shared subword vocabulary), mBERT is pretrained on more than 100 languages and outperforms the cross-lingual word embedding based methods in both the translate-train and translate-test setting and performs on-par with them in the zero-shot setting. Then, Artetxe and Schwenk (2019) improve the state-of-the-art even further with LASER. When applying their model on a task, the LSTM-encoder is kept static to preserve multilingual capabilities and only a task-specific classification head is trained on top of it. Using this method the authors improve performance for 13 out of 14 languages in a zero-shot setting and set the all-round state-of-the-art for 10 out of 14 languages. Although the model generalizes well across languages, its monolingual performance lags behind on the Transformer-based BERT by 7.5% points accuracy.

A next intuitive step is to try to improve the alignment of the language representations of a Transformer-based model in order to combine the high performance in the source language with good generalization across languages. This is exactly what XLM (Lample and Conneau, 2019) tries to achieve. By generalizing the Masked Language Model (Devlin et al., 2018) to the multilingual domain they set new state-of-the-art in zero-shot transfer on all languages in the XNLI dataset, surpassing the performance of LASER on average by 4.9% points. Finally, Conneau et al. (2019) apply the improved BERT pretraining approach of RoBERTa Liu et al. (2019c) on multilingual language modelling. The combination of improved pretraining techniques and increasing the amount of used data by orders of magnitude results in the current state-of-the-art. Their method, XLM-R, outperforms XLM by 5.1% accuracy on average.

2.2 Meta-learning

Meta-learning, or learning to learn (Bengio et al., 1990; Thrun and Pratt, 1998), aims to create models that can learn new skills or adapt to new environments rapidly from a few training examples. Traditional machine learning is known for requiring big amounts of data to perform well at a task, whereas humans can learn new concepts/skills from a few examples by starting from skills we learned in the past and reusing approaches

(23)

that worked well (Lake and Baroni, 2017). Meta-learning aims to bridge the gap between human and machine performance in learning new skills/concepts fast from a few examples. In this work, we focus on the application of meta-learning on few-shot classification, which belongs to the field of supervised learning.

Unlike traditional machine learning, datasets for either training or testing, which are referred to as meta-train and meta-test datasets, comprise of many tasks sampled from a distribution of tasks p(D) rather than individual data points. Each task is associated with a dataset D which contains both feature vectors and ground truth labels and is split into a support set and a query set, D = {S, Q}. The support set is used for fast adaptation and the query set is used to evaluate performance and compute a loss with respect to model parameter initialization. Generally, some model fθ parameterized by

θ, often referred to as the base-learner, is considered. A cycle of fast-adaptation on a support-set followed by updating the parameter initialization of the base-learner based on the loss on the query-set is called an episode.

In few-shot classification/fast learning, the goal is to minimize the prediction error on data samples with unknown labels given a small support set for learning. Each dataset D contains pairs of feature vectors and labels, D = {(xi, yi)}, and all labels belong to a

known label set, L. The base-learner, fθ, outputs the probability of a data point belonging

to class y given feature vector x, Pθ(y|x). The optimal parameters then maximize the

probability of the true labels across multiple batches Q ⊂ D:

θ∗ = argmax

θ EQ⊂D[

X

(x,y)∈Q

Pθ(y|x)] (2.4)

In the following sections, an overview of the three common approaches in meta-learning is given. Finally, some notable efforts of meta-learning within the field of NLP are reviewed.

2.2.1 Metric-based

The core idea in metric-based meta-learning is similar to that in kernel density estimation and nearest neighbours algorithms such as k-Nearest Neighbours. Some kernel function kθ, parametrized by θ, is learned to measure the similarity between two samples. The

(24)

Figure 2.4: Convolutional siamese networks for 1-shot classification

predicted label is a weighted sum of all labels in the support set S:

Pθ(y|x, S) =

X

(xi,yi)∈S

kθ(x, xi)yi (2.5)

2.2.1.1 Convolutional Siamese Networks

Koch et al. (2015) proposed a method to perform one-shot image classification using Siamese neural networks. A Siamese Network uses a shared neural network to encode multiple samples using the same weights and consecutively compare the produced encodings for the task at hand. In the case of Koch et al. (2015) the network was trained to predict whether two images belong to the same class. During test time, the network processes all possible pairs of the test image and all images in the support set. The final prediction is the class corresponding to the support image with the highest output probability. See Figure 2.4 for a visualization3_.

2.2.1.2 Prototypical Networks

Prototypical Networks (Snell et al., 2017) use an embedding function fθ to map each

input onto a M-dimensional space. Consecutively, a prototype feature is defined for every modelled class c by taking the mean of the mapped support samples in this respective

(25)

class: vc= 1 |Sc_| X (xi,yi)∈Sc fθ(xi) (2.6)

Then, the probability distribution over the output classes is computed as the softmax of the inverse distances of the prototype features and the test input x:

P (c|x) = _P exp(−dφ(fθ(x), vc)

c0_∈Cexp(−dφ(fθ(x), vc0) (2.7)

Where dφ is some differentiable distance function, squared Euclidean distance in case of

the original paper.

2.2.2 Model-based

In model-based meta-learning, no assumptions are made on the form of Pθ(y|x). Instead,

the design of the model inherently allows for fast adaptation.

2.2.2.1 Memory-Augmented Neural Networks

A natural fit for “remembering“ information across many tasks/experiences is achieved by Memory-Augmented Neural Networks (MANN) (Santoro et al., 2016). These networks, such as Neural Turing Machines (NTM) (Graves et al., 2014) and Memory Networks (Weston et al., 2014) utilize an external storage buffer/memory to read and write from/to during training and inference as opposed to RNN based networks, which utilize an internal memory only.

MANN takes the Neural Turing Machine, an architecture that couples a neural network with external memory analogous to the Turing Machine (Minsky, 1967), as the base model and adapts the read and write functionality to attend to useful information from previously seen tasks and store useful information for future fast adaptation respectively. During training, the authors force the network to hold on information over episodes by presenting the ground truth labels with a one step offset, i.e. (xt+1, yt), see Figure 2.5

(26)

Figure 2.5: Training strategy MANN

2.2.2.2 Simple Neural Attentive Meta-Learners

Mishra et al. (2017) propose a general-purpose architecture family for meta-learning (amongst others). Their method, Simple Neural AttentIve meta-Learner (SNAIL), comprises of a combination of causal temporal convolutions (TC) (Oord et al., 2016), dilated 1D-convolutions over the temporal dimension that are causal such that current value is only influenced by past ones, and self-attention (in this case inspired by Vaswani et al. (2017)) to be able to distill knowledge from previous experiences.

TC blocks allow direct, high-bandwidth access to past information within a fixed temporal context, depending on the dilation rate, whereas self-attention allows for using information from a potentially infinitely-large context. Hence, by interleaving the two, SNAIL achieves high-bandwidth access to past experiences without constraints on how many previous experiences it can utilize, see Figure 2.6.

2.2.3 Optimization-based

Optimization-based approaches view meta-learning as a two-stage process: a base-learner, fθ, is the “student“ model which is trained to complete a certain task. Meanwhile, an

optimizer/teacher, gφ, learns how to update the student model’s parameters based on the

support set S, such that θ0 _{= g}

(27)

Figure 2.6: Architecture of SNAIL

2.2.3.1 LSTM Meta-Learner

The first class of optimization-based meta-learning approaches explicitly model the optimization algorithm, such as (Ravi and Larochelle, 2016). In their work, the authors use an LSTM to model the updates to be made to the base-learner during fast adaptation. Let αt, Lt be the learning rate and loss function at optimization step t. With regular

(stochastic) gradient descent, the parameter update is performed as

θt = θt−1− αt∇θt−1Lt, (2.8)

an LSTM on the other hand, updates its cell state as

ct= ft ct−1+ it ˜ct. (2.9)

When setting, ft = 1, it= αt, ct= θt and ˜c = −∇θt−1Lt, we find

(28)

Yet, keeping ft and it fixed might give a suboptimal performance, hence the authors

perform the update as

˜

θt= −∇θt−1Lt (2.11)

ft= σ(Wf · [˜θt, Lt, θt−1, ft−1] + bf) (2.12)

it= σ(Wi· [˜θt, Lt, θt−1, it−1] + bi) (2.13)

θt= ft θt−1+ it ˜θt (2.14)

Where Equation 2.12 and 2.13 allow control how much to forget of the previous parameters and what learning rate to use per parameter, respectively.

Hence, this method allows for a lot of flexibility as the base-learner is not restricted to any specific kind of model family, but in the naive setup, the memory requirements might be problematic as Wf and Wi contain one parameter per parameter in the base-learner.

2.2.3.2 Model-Agnostic Meta-Learning

The second class of optimization-based meta-learning approaches optimizes a model directly to be suitable for fast adaptation, such as Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017). The authors propose a generic method which can be used for classification, regression and reinforcement learning with any model that is trained using backpropagation, see Figure 2.7 for the algorithm and a visualization (taken from the original paper). Despite

Figure 2.7: MAML algorithm and visualization

its simplicity, MAML and its many descendants (Nichol et al., 2018; Antoniou et al., 2018; Yin et al., 2019; Raghu et al., 2019; Triantafillou et al., 2020) obtain promising results

(29)

across many tasks. This is the approach used in this thesis and we provide more details in Section 4.

2.3 Meta-learning in (multilingual) NLP

Application of meta-learning to the field of NLP is still somewhat novel but shows promising results. Bansal et al. (2019) apply meta-learning to a wide range of NLP tasks within a monolingual setting and show superior performance for parameter initialization over self-supervised pretraining and multi-task learning; Obamuyide and Vlachos (2019b) apply meta-learning on the task of relation extraction; Obamuyide and Vlachos (2019a) apply lifelong meta-learning for relation extraction; Chen et al. (2019) apply meta-learning for few-shot learning on missing link prediction in knowledge graphs; Gu et al. (2018) apply meta-learning to Neural Machine Translation (NMT) and show its advantage over strong baselines such as cross-lingual transfer learning; Holla et al. (2020) use meta-learning for few-shot word sense disambiguation.

Next up, we provide an in-depth review of three notable works on meta-learning in NLP.

2.3.1 Meta-Learning

for

Low-Resource

Neural

Machine

Translation

Gu et al. (2018) apply meta-learning to the task of Neural Machine Translation (NMT) and show its benefit over strong baselines such as cross-lingual transfer learning. By viewing each language-pair as a task, the authors apply MAML to obtain competitive NMT systems. The proposed method, MetaNMT, is trained using 18 European languages and evaluated on 5 target languages (including non-European ones such as Korean). The authors investigate the influence of the number of languages/tasks used during meta-training and show that including more (diverse) languages improve final performance, which is intuitively illustrated in Figure 2.8. Their method achieves BLEU scores ranging from 1/2 to 2/3 of the score achieved by the models trained on the full training set while using as little as 600 parallel sentences.

(30)

Figure 2.8: An intuitive illustration in which we use solid lines to represent the learning of initialization, and dashed lines to show the path of fine-tuning. Figure taken from (Gu et al., 2018).

2.3.2 Learning to Few-Shot Learn Across Diverse Natural

Language Classification Tasks

Bansal et al. (2019) Apply meta-learning to a wide range of NLP tasks and show the superior performance of meta-learning for parameter initialization over self-supervised pretraining and multi-task training. Their method, called LEOPARD, is an adaptation of MAML where a text-encoder, BERT, is coupled with a parameter generator that learns to generate task-dependent initializations of the classification head such that meta-learning can be performed across tasks with disjoint label spaces.

The authors use a subset of the GLUE benchmark tasks (Wang et al., 2018) for training and perform data augmentation by phrasing each task during meta-training as classification between two labels in order to create more diversity in the task distribution. In addition, to be able to perform phrase-level classification tasks, the authors rephrase the sentiment classification task (SST2) in GLUE by giving both the phrase and the sentence as input -meaning that the same sentence can contain multiple phrases to be classified.

LEOPARD is evaluated on a set of 17 target NLP tasks which are unseen during meta-training, including entity typing from the CoNLL-2003 (Sang and De Meulder, 2003) dataset, MIT-Restaurant rating classification (Liu et al., 2013) and multiple social media text classification datasets. For k = 4, 8, 16 LEOPARD gains an average of 14.4%, 10.8%, 10.9% accuracy on average, respectively, showing the potential of meta-learning to adapt to previously unseen tasks.

(31)

2.3.3 Zero-Shot Cross-Lingual Transfer with Meta Learning

To our knowledge, X-MAML (Nooralahzadeh et al., 2020) is the first application of meta-learning for cross-lingual few-shot learning on classification tasks. The authors study the application of their MAML-descendant method on the XNLI (Conneau et al., 2018) and Multilingual Question Answering (MLQA) (Lewis et al., 2019) benchmarks in both a cross-domain and cross-language setting.

X-MAML works by pretraining some model M on a high-resource task h to obtain initial model parameters θ. Consecutively, a set L of one or more auxiliary languages is taken and MAML is applied to achieve fast adaptation of θ for l ∈ L. Intuitively, X-MAML aims to learn how to transfer domain knowledge fast to a new language. In their experiments, the authors use either one or two auxiliary languages and evaluate their method in both zero-and few-shot setting. It should be noted that in the few-shot setting, the full development set is used to finetune the model: “For few-shot learning, meta-learning in X-MAML is performed by fine-tuning on the development set (2.5k instances) of target languages, and then evaluating on the test set“ (Nooralahzadeh et al., 2020) which is not in line with other work on few-shot learning such as (Bansal et al., 2019). Also, there is a discrepancy in the training set used for the baselines and their proposed method. All reported baselines are either zero-shot evaluations of θmono or of θmono finetuned on the development set of

the target language, whereas their proposed method additionally uses the development set in either one or two auxiliary languages during meta-training.

For XNLI, the proposed method improves 3.65% and 1.04% accuracy points on average compared to the mBERT baseline for the zero-shot and few-shot settings respectively.

(32)

3 Data

3.1 MLDoc

Schwenk and Li (2018) published an improved version of the Reuters Corpus Volume 2 Lewis et al. (2004) with balanced class priors for all languages. The original RCV2 corpus is a collection of 487k news stories in 13 languages, independently written by local reports. The MLDoc dataset considers 8 of those 13 languages: English, Spanish, French, Italian, Russian, Japanese and Chinese. Each news story is manually classified into one of four groups: Corporate/Industrial, Economics, Government/Social and Markets. The train datasets contain 10k samples whereas the test sets contain 4k samples per language.

3.2 Amazon Sentiment Polarity

Another widely used dataset for cross-lingual document classification is the Amazon Sentiment Analysis dataset (Prettenhofer and Stein, 2010). The dataset is a collection of product reviews in English, French, German and Japanese in three categories: books, DVDs and music. Each sample consists of the original review accompanied by meta-data such as the rating of the reviewed product expressed as an integer on a scale from one to five. In this work, we consider the sentiment polarity task where we distinguish between positive (rating > 3) and negative (rating < 3) reviews. When all product categories are concatenated, the dataset consists of 6K samples per language per dataset (train, test). We extend this with Chinese product reviews in the cosmetics domain from JD.com Zhang et al. (2015), a large e-commerce website in China. The train and test sets contain 2K and 20K samples respectively.

(33)

4 Methodology

4.1 Episodic Meta-learning procedure

In this section, we specify the main ingredients of our meta-learning approach. First, we introduce the framework for training, then the requirements of the base-learner and our choice for it. Then, we specify the procedure of sampling tasks and finally, the meta-learning algorithms which are compared in this work are introduced as update method for the main training framework.

4.1.1 Meta-training

Algorithm 1 Meta-training procedure.

Require: p(D): distribution over tasks. Require: fθ: base-learner.

Require: α, β: step size hyper-parameters Initialize θ

while not done do

Sample batch of tasks {Dl_{} = {(S}l_{, Q}l_{)} ∼ p(D)} for all (Sl_{, Q}l_{) do}

Initialize θ(0)_l = θ for all steps k do

Compute: θ_l(k+1)= θ_l(k)− α(∇_θ(k) l LSl(fθ(k)_l )) end for end for Update θ = θ − β(MetaUpdate(f_θ(K) l , Ql₎₎ end while

In this work, we approach meta-learning using the episodic framework. We define an episode as a cycle of fast-adaptation on a support-set followed by updating the parameter initialization of the base-learner based on the loss on the query set. Meta-training (Algorithm 1) consists of updating the parameters of the base-learner by performing many of the formerly described episodes until some stop criterion is reached. In the episodic framework, the definition of optimal parameters can be extended to include fast adaptation based on the support set, see Eq. 4.1. The underlined parts mark the difference between traditional supervised learning and meta-learning.

θ∗ = argmax θ El⊂L[ES l_⊂D,Ql_⊂D[ X (x,y)∈Ql Pθ(y|x, Sl)]] (4.1)

(34)

After meta-training, evaluation or meta-testing is done by performing fast-adaptation on one batch of samples (from the train set) of the target task l to obtain θ(k)

l . The obtained

checkpoint is consecutively used to evaluate the performance on the test set of the target task.

4.1.2 Base-learner

An advantage of the episodic framework for meta-learning is that one can use any base-learner as long as we can differentiate through it. Since we are specifically applying meta-learning for cross-lingual document classification, an additional soft constraint for the base-learner is for it to be able to handle all considered languages. Intuitively, when one attempts to perform a task in some language, proficiency in that specific language is an advantage. Motivated by this reasoning, we choose XLM-R (Conneau et al., 2019) as our base-learner throughout all experiments due to its strong capabilities in many languages – but any multilingual language model suffices.

Since a language model on its own cannot perform a classification task, we extend it with a linear layer followed by a softmax classifier for all methods except Prototypical Networks. This linear layer maps the fixed-length document encoding produced by the language model to another space with the number of modelled classes as the number of dimensions of this respective space. Many Transformer-based encoders model their input using extra separator tokens to indicate the beginning and end of a passage, most often [CLS] and [SEP], respectively. In line with previous work, (Devlin et al., 2018; Conneau et al., 2019) we use the hidden state of the final layer corresponding to the [CLS]-token as the representation of the whole sequence.

4.1.3 Episode generation

During training, batches of tasks are considered to compute gradients for the meta-update. A task considers one language at a time and is constructed by uniformly sampling data points per considered class, resulting in a stratified set of samples. The support and query set are always disjoint. The language from which the task is constructed is chosen proportionally to the log of the number of available samples.

(35)

4.2 Meta-update methods

4.2.1 MAML

Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an optimization-based method that uses the following objective function

θ∗ = argmin θ X Dl∼p(D) Ll(f_θ(k) l ) (4.2) Ll(f_θ(k) l

) is the loss on the query set after updating the base-learner for k steps on the support set. Hence, MAML directly optimizes the base-learner such that fast-adaptation of θ, often referred to as inner-loop optimization, results in task-specific parameters θ(k)

l

which generalize well on the task. Setting B as the batch size, MAML implements its MetaUpdate, which is also referred to as outer-loop optimization, as

θ = θ − β1 B X Dl∼p(D) (∇θLl(f_θ(k) l )) (4.3)

Such a MetaUpdate requires computing second order derivatives and, in turn, holding θ_l(j)∀j = 1, . . . , k in memory – which might require unfeasible amounts of memory in a practical setting. A first-order approximation of MAML (foMAML), which ignores second order derivatives, can be used to bypass this problem:

θ = θ − β1 B X Dl∼p(D) (∇_θ(k) l Ll(f_θ(k) l )) (4.4)

Previous work has shown that “vanilla” MAML suffers from a variety of problems which lead to instability during training, restrict the learner’s capability to generalize well and increase the computational overhead (Antoniou et al., 2018). We therefore follow this work and adopt the following improvements in our framework for all MAML-based methods:

(36)

4.2.1.1 Per-step Layer Normalization weights

Layer normalization weights and biases are not updated in the inner-loop. Sharing one set of weights and biases across inner-loop steps implicitly assumes that the feature distribution between layers stays the same at every step of the inner optimization.

4.2.1.2 Per-layer per-step learnable inner-loop learning rate

Instead of using a shared learning rate for all parameters, the authors propose to initialize a learning rate per layer and per step and jointly learn their values in the MetaUpdate steps.

4.2.1.3 Cosine annealing of outer-loop learning rate

It has shown to be crucial to anneal the learning rate using some annealing function in order to obtain state-of-the-art results (Loshchilov and Hutter, 2016).

4.2.2 Reptile

Reptile (Nichol et al., 2018) is a first-order approximation optimization-based meta-learning algorithm which is designed to move the weights towards a manifold of the weighted averages of task-specific parameters θ(k)

l : θ = θ − β 1 B X Dl_∼p(D) (θ_l(k)− θ) (4.5)

Despite its simplicity, it has shown competitive or superior performance against MAML, e.g., on Natural Language Understanding (Dou et al., 2019).

4.2.3 Prototypical Networks

Our first metric-based method, Prototypical Networks – see Section 2.2.1.2 for a full introduction – is the most unconventional one in the context of our episodic training framework or more specifically Algorithm 1. It does not perform any inner-loop optimization steps, i.e. k = 0, but instead directly classifies the samples in the query set by performing nearest-neighbours classification using the prototypes created from the

(37)

support set.

Wang et al. (2019) show that despite their simplicity, Prototypical Networks can perform on par with or better than other state-of-the-art meta-learning methods when all sample encodings are centered around the overall mean of all classes and consecutively L2-normalized. We also adopt this strategy.

4.2.4 ProtoMAML

Triantafillou et al. (2020) introduce ProtoMAML as a meta-learning method which combines the complementary strengths of Prototypical Networks and MAML by leveraging the inductive bias of the use of prototypes instead of random initialization of the final linear layer of the network. Snell et al. (2017) show that Prototypical Networks are equivalent to a linear model when Euclidean distance is used. Using the definition of prototypes µc as per Eq. 2.6, the weights wc and bias bc corresponding to class c can be

computed as follows:

wc= 2µc bc= −µTcµc (4.6)

ProtoMAML is defined as the adaptation of MAML where the final linear layer is parameterized as per Eq. 4.6 at the start of each episode using the support set. Due to this initialization, it allows modelling a varying number of classes per episode.

Inspired by Wang et al. (2019), we propose a simple, yet effective adaptation to ProtoMAML by applying L2 normalization to the prototypes themselves, referred to

as ProtoMAMLn, and, again, use a first-order approximation (foProtoMAMLn). We demonstrate that doing so leads to a more stable, faster and effective learning algorithm at only constant extra computational cost (O(1)).

4.3 Baselines

We introduce baselines trained in a standard supervised, non-episodic fashion. Again, we use XLM-RoBERTa-base as the base-learner in all models.

(38)

4.3.1 Zero-shot baseline

This baseline assumes sufficient training data for the task to be available in one language lsrc (English). The base-learner is trained in a non-episodic manner using mini-batch

gradient descent with cross-entropy loss. Performance is monitored during training on a held-out validation set in lsrc, the model with the lowest loss is selected, and then

evaluated on the same task in the target languages.

4.3.2 Non-episodic baseline

The second baseline aims to quantify the exact impact of learning a model through the meta-learning paradigm versus standard supervised learning. The model learns from exactly the same data as the meta-learning algorithms, but in a non-episodic manner: i.e., merging support and query sets in laux (and lsrc when included) and training using

mini-batch gradient descent with cross-entropy loss. During testing, the trained model is independently finetuned for 5 steps on the support set (one mini-batch) of each target language ltgt.

(39)

5 Analysis

5.1 Experimental setup

We quantify the strengths and weaknesses of meta-learning as opposed to traditional supervised learning in both a cross-lingual transfer and a multilingual joint training setting with limited resources. Throughout all experiments, both the support and query set comprise of 16 samples.

5.1.1 Cross-lingual adaptation

Here, the available data is split into multiple subsets: the auxiliary languages laux, which are

used in meta-training, the validation language ldev, which is used to monitor performance,

and the target languages ltgt, which are kept unseen until meta-testing. Two scenarios

in terms of amounts of available data are considered. A small sample of the available training data of laux is taken to create a low-resource setting, whereas all available training

data of laux is used in a high-resource setting. The chosen training data per language is

split evenly and stratified over two disjoint sets from which the meta-training support and query samples are sampled, respectively. For meta-testing, one batch (16 samples) is taken from the training data of each target language as support set, while we test on the whole test set per target language (i.e., the query set).

5.1.2 Multilingual joint training

We also investigate meta-learning as an approach to multilingual joint training in the same low-resource setting as previously described for the cross-lingual experiments. The difference is that instead of learning to generalize to ltgt6= laux from few examples, here

ltgt = laux. If we can show that one can learn many similar tasks across languages from

few examples per language, using a total number of examples in the same or smaller order of magnitude as in “traditional” supervised learning for training a monolingual classifier. This might be an incentive to change data collection processes in practice.

For both experimental settings above, we examine the influence of additionally using all training data from a high-resource language lsrc during meta-training: English.

(40)

5.1.3 Specifics per dataset

5.1.3.1 Amazon Sentiment Polarity

The fact that the Amazon dataset (augmented with Chinese) comprises of only five languages has some implications for our experimental design. In the cross-lingual experiments, where laux, ldev and ltgt should be disjoint, only three languages, including

English, remain for training. As we consider two languages too little data for meta-training, we do not experiment with leaving out the English data. Hence, for meta-meta-training, the data consists of lsrc =English, as well as two languages in laux. We always keep one

language unseen until meta-testing and alter laux such that we can meta-test on every

language. We set ldev = French in all cases, except when French is used as the target

language; then, ldev = Chinese. In the low-resource setting, a total of 128 samples per

language in laux is used.

For the multilingual joint training experiments, there are enough languages available to quantify the influence of English during meta-training. When English is excluded, it is used for meta-validation. When included, we average results over two sets of experiments: one where ldev =French and one where ldev =Chinese.

5.1.3.2 MLDoc

Throughout all experiments performed on MLDoc we set lsrc = English and ldev =Spanish.

As data in sufficient languages is available, we consider two splits in the remaining languages. The first split, the half-split, is in line with the experiments on the Amazon dataset in the sense that three languages are used as auxiliary and the remaining as target, e.g.: laux = {German, Italian, Japanese}; and ltgt = {French, Russian, Chinese}.

In order to further quantify the influence of more diversity in the data we also consider a leave-one-out split, e.g.: laux = {German, Italian, Japanese, French, Russian}; and

ltgt = {Chinese}. The latter split is only considered in a low-resource setting due to the

computational expenses involved with it and the fact that we deem the low-resource setting of more importance for practical applications.

In the low-resource setting, we randomly sample 64 samples per language in laux for

(41)

influence of augmenting the training set laux with a high-resource source language lsrc,

English.

5.2 Training setup and hyper-parameters

We use the Ranger optimizer (Zhang et al., 2019; Yong et al., 2020), an adapted version of Adam (Kingma and Ba, 2014) with improved stability at the beginning of training – by accounting for the variance in adaptive learning rates (Liu et al., 2019a) – and improved robustness and convergence speed (Zhang et al., 2019; Yong et al., 2020). The batch size is set to 16 and a learning rate, to which we apply cosine annealing, to 3e-5. For meta-training, we perform 100 epochs of 100 update steps and evaluate with 5 different seeds on the meta-validation set after each epoch. One epoch consists of 100 update steps where each update step consists of a batch of 4 episodes. Early-stopping with a patience of 3 epochs is performed to avoid overfitting. For the non-episodic baselines, we train for 10 epochs on the auxiliary languages while validating after each epoch. All models are created using the PyTorch library (Paszke et al., 2017) and trained on a single 24Gb NVIDIA Titan RTX GPU.

We perform a grid search on MLDoc in order to determine optimal hyper-parameters for the MetaUpdate methods. For this we consider a low-resource setting with laux =

{German, French, Italian, Japanese, Russian, Chinese} and ldev = Spanish. The

hyper-parameters resulting in the lowest loss on ldev are used in all experiments. The number of

update steps in the inner-loop is 5; the (initial) learning rate of the inner-loop is 1e-5 for MAML and ProtoMAML and 5e-5 for Reptile; the factor by which the learning rate of the classification head is multiplied is 10 for MAML and ProtoMAML and 1 for Reptile; when applicable, the learning rate with which the inner-loop optimizer is updated is 6e-5. See Table 5.1 for the considered grid.

(42)

MetaUpdate Method Num inner-loop steps Inner-loop lr Class-head lr multiplier Inner-optimizer lr

Reptile 2,3,5 1e-5, 5e-5, 1e-4 1, 10

-foMAML 2,3,5 1e-5, 1e-4, 1e-3 1, 10 3e-5, 6e-5, 1e-4

foProtoMAMLn 2,3,5 1e-5, 1e-4, 1e-3 1, 10 3e-5, 6e-5, 1e-4

Table 5.1: Search range per hyper-parameter. We consider the number of update steps in the inner-loop, Num inner-loop steps, the (initial) learning rate of the inner-loop, Inner-loop lr, the factor by which the learning rate of the classification head is multiplied, Class-head lr multiplier, and, if applicable, the learning rate with which the inner-loop

optimizer is updated, Inner-optimizer lr. The chosen value is underlined.

lsrc= en Method

Low-resource setting High-resource setting

de fr it ja ru zh ∆ de fr it ja ru zh ∆ Excluded Non-episodic 82.0 86.7 68.3 71.9 70.9 81.0 76.8 95.3 90.9 80.9 82.9 74.5 89.6 85.7 ProtoNet 90.5 85.0 76.6 75.0 69.6 82.0 79.8 95.5 91.7 82.0 82.2 76.6 87.4 85.9 foMAML 89.7 85.5 74.1 74.1 74.0 83.2 80.1 95.0 91.4 81.4 82.7 76.9 87.8 86.1 foProtoMAMLn 90.6 86.2 77.8 75.6 73.6 83.8 80.7 95.6 92.1 82.6 83.1 77.9 88.9 86.7 Reptile 87.9 81.8 72.7 74.4 73.9 80.9 78.6 95.0 90.1 81.1 82.7 72.5 88.7 85.0 Included Zero-shot 92.4 92.1 80.3 81.0 71.7 89.1 84.4 92.4 92.1 80.3 81.0 71.7 89.1 84.4 Non-episodic 93.7 91.3 81.5 80.6 71.1 88.4 84.4 93.7 92.9 82.4 82.3 72.1 90.1 85.6 ProtoNet 93.4 91.9 79.1 81.3 72.2 87.8 84.5 95.0 91.7 81.1 82.7 72.0 88.0 85.9 foMAML 95.1 91.2 79.5 79.6 73.3 89.7 84.6 94.8 93.2 79.9 82.4 75.7 90.6 86.1 foProtoMAMLn 94.9 91.7 81.5 81.4 75.2 89.9 85.5 95.8 94.1 82.7 83.0 81.2 90.4 87.9 Reptile 92.3 91.4 79.7 79.5 71.8 88.1 83.8 94.8 91.0 80.2 82.0 72.7 89.9 85.1

Table 5.2: Accuracy on unseen target languages in half-split setting Reported results are the average of 5 different seeds. Non-episodic corresponds to the standard supervised learning baseline. ∆ corresponds to the average accuracy across test languages.

5.3 Results and Analysis

5.3.1 Cross-lingual adaptation

Tables 5.2, 5.3 and 5.4 show the accuracy scores on the target languages in the half-split and leave-one-out settings. We start by noting the strong multilingual capabilities of XLM-RoBERTa as our base-learner: adding the full training datasets in three extra languages (i.e., comparing the zero-shot with the non-episodic baseline in the high-resource, ‘Included’

setting) results in a mere 1.2% points increase in accuracy on average for MLDoc and 0.6% points for Amazon. Although the zero-shot4 _{and non-episodic baselines are strong,}

a meta-learning approach improves performance in the majority of cases. This holds especially for our version of ProtoMAML (ProtoMAMLn), which achieves the highest average accuracy in all considered settings.

The substantial improvements for Russian on MLDoc and Chinese on Amazon indicate

4_{The zero-shot baseline is only applicable in the ‘Included’ setting, as the English data is not available} under ‘Excluded’.

Multi- And Cross-Lingual Document Classification: A Meta-Learning Approach

MSc Artificial Intelligence

Master Thesis