What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

What does it mean to be language-agnostic?

Probing multilingual sentence encoders for typological properties

by

R

OCHELLE

C

HOENNI

10999949

August 10, 2020

48 EC November 2019 - July 2020

Supervisor:

D

R

. E

KATERINA

S

HUTOVA

Assessor:

D

R

. W

ILLEM

Z

UIDEMA

I

NSTITUTE FOR

L

OGIC

, L

ANGUAGE AND

C

OMPUTATION

(2)

(3)

Abstract

Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. However, while many research efforts continue to focus on further improving these models, we know relatively little about the properties of individ-ual languages or the general patterns of linguistic variation that they currently encode. Hence, this thesis aims to shed light on the question as to what it means for a model to be language-agnostic by studying what, where and how linguistic typological proper-ties are encoded by multilingual encoders. Moreover, these multilingual encoders have been trained using different types of pretraining strategies, i.e. using monolingual or cross-lingual training objectives, or a combination thereof. Interestingly, models trained in monolingual tasks exhibit competitive performance to those that rely on cross-lingual objectives and parallel data. We hypothesize that the type of pretraining tasks influence the linguistic organization within these encoders. To investigate these questions, meth-ods are proposed for probing sentence representations from four state-of-the-art multi-lingual encoders (i.e. LASER, M-BERT, XLM and XLM-R) with respect to a range of linguistic typological properties pertaining to lexical, morphological and syntactic struc-ture. More concretely, using these methods, we investigate 1. what typological properties are encoded, 2. how this information is distributed across all layers of the models, 3. whether and how the properties that different types of models encode vary systematically, 4. whether these properties are jointly encoded across languages and 5. what role the multi-head attention mechanism plays in the encoding of typological information. The main results show that different encoders capture typological properties to a varying ex-tent, and moreover that interesting differences arise in encoding linguistic variation asso-ciated with different pretraining strategies. In addition, we find that typological properties are jointly encoded across languages.

(4)

(5)

List of Figures

2.1 Encoder-decoder sequence transduction paradigm for multilingual neural machine translation. . . 17 3.1 Visualization of the LASER architecture for learning sentence

represen-tations . . . 26 3.2 Illustration of the stacked Transformer architecture. . . 30 4.1 Example of the annotation of a typological feature (71A) as specified in

WALS. Different languages may allow for different types of linguistic constructions. . . 33 4.2 Heatmaps of the performance in accuracy per multilingual encoder

bro-ken down per language . . . 38 4.3 Macro-averaged F1-scores when probing from the activation of different

layers in LASER, M-BERT, XLM and XLM-R . . . 41 4.4 Learned mixing weights sτ for each encoder and the corresponding KL

divergence K(s) for all 25 tasks . . . 43 4.5 Visualization of the layer activation from LASER, M-BERT, XLM and

XLM-R . . . 44 5.1 Visualization of the top-layer representations from different languages

produced by M-BERT (right), and the same representations after applying the self-neutralizing method (left) . . . 46 5.2 The performance on Spanish after self-neutralizing the Spanish LASER

representations . . . 48 5.3 The performance on Spanish after self-neutralizing the Spanish M-BERT

representations . . . 49 5.4 Visualization of the top-layer representations from different languages

produced by XLM (right), and the same representations after applying the self-neutralizing method (left). . . 50 5.5 Heatmaps of the change δ in accuracy for M-BERT when systematically

neutralizing with a different language each time. The performance δ is shown per language for all 25 probing tasks. . . 51 5.6 The change in performance for all test languages when neutralizing

M-BERT representations with a language-centroid computed from the Span-ish sentences. Languages are categorized by whether they had the same or a different feature value from that of Spanish for the respective tasks. . 52

(8)

3.1 Summary statistics of the multilingual encoder architectures . . . 28 4.1 Specifications of the 25 WALS features used for probing . . . 34 4.2 Indication of the variation of feature values represented in our dataset . . 35 4.3 Macro-averaged F1-scores on the test set per typological feature . . . 37 5.1 The average change in performance per task τ and neutralizing language

(nlg) for M-BERT and XLM-R. Results are reported in average change for languages with the same feature value as the neutralizing language and those with a different value for the task. . . 53 5.2 The average change in performance per task τ and neutralizing language

(nlg) for LASER and XLM. Results are reported in average change for languages with the same feature value as the neutralizing language and those with a different value for the task. . . 54 5.3 The attention-head probing results per language for M-BERT. Per

lan-guage, the mean accuracy and standard deviation across all 12 heads is reported . . . 58 5.4 The attention-head probing results per language for XLM . . . 60 A.1 Macro-averaged-F1 scores for LASER and XLM computed separately

over the set of XNLI languages that are not supported by XLM (Ukrainian, Polish and Marathi) (non-XNLI) . . . 77 C.1 The attention difference vector probing results per language for M-BERT 81 C.2 The attention difference vector probing results per language for XLM . . 83

(9)

Chapter 1 Introduction

The field of Natural Language Processing (NLP) has seen rapid improvements over re-cent years, leading to substantial performance boosts on a wide range of downstream NLP tasks. This success can largely be ascribed to the development of large-scale pretraining methods for word representations (Pennington et al., 2014) and sentence encoders ( De-vlin et al., 2019; Peters et al., 2018b). These techniques produce linguistically informed priors that allow for fine-tuning to more task-specific word and sentence representations. These pretrained representations are, however, monolingual and moreover, the pretrain-ing strategies require a vast amount of trainpretrain-ing data. As a result, many of these models, along with the success they bring to NLP technology, are in practice limited to a handful of high-resource languages only. Aiming to extend the benefits of large-scale pretraining to low-resource languages with less economic clout, many recent studies have focused on the development of models with a wider cross-lingual applicability, giving new surge to the field of multilingual NLP. Research in this field has, thus far, led to the development of multilingual word embeddings (Ammar et al., 2016b; Chen and Cardie, 2018) and sentence encoders, such as LASER (Artetxe and Schwenk, 2019), Multilingual BERT (M-BERT) (Devlin et al.,2019) and XLM (Lample and Conneau,2019). These multilin-gual encoders are jointly trained to perform NLP tasks for multiple languages, by aiming to project semantically similar words and sentences from multiple languages into a shared multilingual semantic space. Consequently, they produce encodings such that words and sentences with similar meaning obtain similar representations, irrespective of their source language. Hence, they aim to capture semantic meaning more universally.

While this work has met with success, enabling effective model transfer across a wide range of languages (Wu and Dredze, 2019), much effort is still focused on further im-proving the language-agnosticism of multilingual encoders, e.g. through methods such as linear projections, adversarial fine-tuning and re-centering representations (Libovick`y et al.,2019). The intuition behind this, is that more language-agnostic representations can further boost performance on NLP tasks such as parallel sentence retrieval, where signals of cross-lingual structural variation from the source languages hinder the task (Gerz et al., 2018). However, while research efforts into testing and boosting universality continue, little is known about the linguistic properties of individual languages that such models currently encode. Nor do we understand to what extent these models capture the

(10)

pat-terns of cross-lingual similarity and variation. This raises interesting questions on what it means for a model to be language-agnostic; do such models still encode language-specific properties, and how do they share information across a large set of typologically-diverse languages? While encoders may induce shared common underlying patterns of different languages in a data-driven manner, at the surface-level these languages vary considerably, and thus understanding to what extent these models capture and integrate the patterns of cross-lingual similarity and variation to enable such generalization is a non-trivial ques-tion. Thus far, however, little research has paid attention to investigating the linguistic properties of individual languages that pretrained multilingual representations encode. Rather than investigating language-specific properties, related studies have mostly kept to universal semantic and syntactic properties instead (Pires et al., 2019;Ravishankar et al., 2019a,b; ¸Sahin et al., 2019). Yet, research from the field of linguistic typology, which studies and documents structural and semantic variation across languages, offers much inspiration for research on interpretation of multilingual models, as well as the develop-ment of more effective multilingual designs (Bender,2009;Ponti et al.,2019). Therefore, this thesis specifically focuses on probing state-of-the-art multilingual encoders for lin-guistic typological information.

Moreover, multilingual models have been realized through different types of neural ar-chitectures (e.g. recurrent neural networks and Transformers) and training strategies, i.e. using monolingual (M-BERT, XLM-R) or cross-lingual (LASER) training objectives, or a combination thereof (XLM). Whereas models trained with cross-lingual objectives ex-ploit parallel data for supervision (e.g. LASER), unsupervised models like M-BERT and XLM-R have offered a valuable solution to side-step the scarcity of parallel resources and only rely on monolingual data instead. Interestingly, these models trained in monolin-gual tasks exhibit competitive performance to those that rely on cross-linmonolin-gual objectives and parallel data (LASER, XLM). Having been trained on many languages, both types of multilingual encoders have been successfully applied to perform zero-shot cross-lingual transfer in downstream NLP tasks, such as part of speech (POS) tagging and named entity recognition (NER) (van der Heijden et al., 2019), dependency and constituency parsing (Kim and Lee, 2020; Tran and Bisazza, 2019), text categorization (Nozza et al., 2020), cross-lingual natural language inference (XNLI) and question answering (XQA) (Lauscher et al., 2020). Yet, the incorporation of cross-lingual objectives remains a pop-ular approach withHuang et al. (2019) recently introducing Unicoder1 that incorporates 4 cross-lingual tasks. Improving on M-BERT and XLM on XNLI and XQA, the authors claim that the tasks help learn language relationships from more perspectives. This raises the question of whether multilingual encoders capture linguistic typological properties differently depending on the type of pretraining tasks. Hence, the multilingual encoders under investigation in this thesis exemplify different neural architectures and pretraining strategies, and the hypothesis is that the type of pretraining tasks influence the linguistic organization within these multilingual encoders.

Lastly, one of the main developments that led to the vast improvements in the field of NLP, was that of the Transformer model (Vaswani et al., 2017) whose strength is often ascribed to its multi-head attention mechanism. As a result, many works have focused on understanding the contribution of attention in neural network models, with some studies

(11)

CHAPTER 1. INTRODUCTION

finding that the attention heads capture several linguistic notions of syntax and corefer-ence (Clark et al.,2019;Vashishth et al.,2019;Vig and Belinkov, 2019). Therefore, this thesis includes some additional analysis into the contribution of the multi-head attention mechanism in encoding typological information in a multilingual setting. Specifically, we are interested in finding whether certain attention heads capture specific typological features better and/or are trained to attend to different languages.

1.1 Research questions

In this thesis, we extend probing methods to the realm of linguistic typology in order to shed light on the extent to which multilingual models encode linguistic typological information. In particular, we investigate several questions that can contribute to our broader understanding of what typological information is encoded in representations from these models, where this is encoded, and how this information is shared across languages. More concretely, this boils down to the following questions:

RQ1: What typological properties pertaining to lexical, morphological and syntactic struc-ture are capstruc-tured in multilingual sentence representations?

RQ2: In which layer(s) of the multilingual encoders are these linguistic typological prop-erties encoded?

RQ3: Do the properties that different types of models encode vary systematically, and how may this be an effect of differences in design decisions such as model architectures and pretraining strategies?

RQ4: How are typological properties shared across languages?

RQ5: What role does the multi-head attention mechanism play in encoding typological information in Transformers?

Given that this research area has remained relatively unexplored, we draw inspiration from both the field of typological linguistics and the growing line of research on the interpreta-tion of neural networks to create methods for investigainterpreta-tion. Following previous studies on the interpretation of neural networks (Belinkov et al.,2017;Conneau et al.,2018a;Linzen et al.,2016;Tenney et al.,2019b), we use a probing classification approach to predict the typological properties of a set of typologically diverse languages from sentence repre-sentations produced by four state-of-the-art multilingual encoders, i.e. LASER (Artetxe and Schwenk,2019), M-BERT (Devlin et al.,2019), XLM (Lample and Conneau, 2019) and XLM-R (Conneau et al.,2019). This probing classification approach relies on the as-sumption that if we are able to predict specific linguistic properties from representations, it must entail that information about this property is encoded in those representations. In order to probe for variation along a wide range of linguistic typological properties, we create 25 typological probing tasks by leveraging typological information from the World Atlas of Language Structures (WALS) database (Dryer and Haspelmath,2013).

(12)

1.2 Contributions

In this thesis the following main contributions are made:

• We propose methods for probing multilingual encoders to investigate a wide range of linguistic typological properties.

• We find that all encoders successfully capture information related to word order, negation and pronouns, however, M-BERT and XLM-R outperform LASER and XLM for a number of lexical and morphological properties.

• We find that typological properties are persistently encoded across layers in M-BERT and XLM-R, but are more localizable in lower layers of LASER and XLM. • We find that the incorporation of a cross-lingual objective contributes to the model

learning an interlingua, while the use of monolingual pretraining tasks results in a partitioning to language-specific subspaces.

• We argue that our results indicate a negative correlation between the universality of a model and its ability to retain language-specific information, regardless of archi-tecture.

• We find that typological properties are mainly encoded in the relative positioning of the language-specific subspaces learnt by multilingual encoders, and moreover, that their feature values are jointly encoded across languages.

• We find that the output of certain attention heads encode some typological prop-erties of specific languages better than others. This provides preliminary evidence that some attention heads might be better at capturing certain typological properties and/or are trained to attend to specific languages.

1.3 Thesis structure

This thesis lies on the intersection of different research areas, studying models from mul-tilingual NLP through methods proposed in the fields of linguistic typology and the inter-pretation of neural models. Thus, in Chapter 2 an overview is given of the core concepts from each of these research areas applicable to this thesis and it is explained how the work in this thesis is positioned in relation to prior work. Furthermore, we describe the neural architectures and pretraining objectives of the multilingual encoders that are under investigation in Chapter 3. The experimental part of this thesis is divided into two parts, and are described under chapters 4 and 5 respectively:

1. Chapter 4: ‘Probing for typological information’, where the methods, experimental set-up and results for RQ1, RQ2 and RQ3 are explained and discussed. These results lay the groundwork for the experiments conducted in chapter 5.

2. Chapter 5: ‘Joint encoding of linguistic typological properties’, where RQ4 and RQ5 are investigated.

Lastly, in Chapter 6 the main findings are summarised, the limitations to the research approach is discussed, and ideas to explore in future work are proposed.

(13)

Chapter 2 Background and related work

In this chapter, an overview is given of a number of core concepts used in this thesis. Since this research lies on the intersection of work in the fields of multilingual NLP, the interpretation of neural networks, and linguistic typology, a broad overview will be given on the topics used from each.

2.1 From distributional to contextualized representations

Language is a complex and ever-changing phenomena which meaning is largely shaped by human judgement. Consequently, semantics is not sufficient to grasp the full ambigu-ous and nuanced nature of languages. Lexical meaning is often heavily influenced by its context which can lead to a different pragmatic meaning. As a result, developing adequate methods to numerically quantify the meaning of linguistic units such as words, sentences and phrases without the need for lexicography, is an active research area. Various meth-ods have been proposed that computationally model the meaning of words and sentences in the form of n-dimensional continuous vector representations n ∈ R, bridging the gap between the human understanding of language and that of machines. Most of these meth-ods rely on the fundamental assumption that learning representations such that words and sentences with similar semantic meaning are closer together in the semantic embedding space, allows for the data-driven induction of new meaning and concepts.

Research on this topic first gained real ground in the field of distributional semantics with the development of unsupervised word representation models such as Word2Vec (Mikolov et al.,2013b) and GloVe (Pennington et al.,2014). The former method is based on a skip-gram model to predict neighbouring words, while the latter relies on aggregated global word-word co-occurrence statistics. Both methods build on the distributional hy-pothesis that suggests that words often occurring in the same contexts tend to purport similar meanings (Harris, 1954), and that consequently, these distributional properties of words can be leveraged to encode lexical semantics, i.e. the meaning of words. An un-desirable property of these distributional representations is, however, that the dynamic contexts of words are not accounted for, resulting in static, context-independent repre-sentations. These methods are therefore not effective in modelling polysemous words. Moreover, this inability to handle polysemy at the word level would result in problems

(14)

to disambiguate between different meanings at the sentence level as well. While notable efforts were made to overcome these shortcomings of traditional representations, e.g. by learning separate representations per word sense (Neelakantan et al.,2014), or using sub-word information to enrich them (Bojanowski et al., 2017; Wieting et al., 2016), these bottlenecks strongly limited further success in this line of research.

The next ground-breaking improvements were made when a paradigm shift took place to deep learning approaches. These methods enabled the learning of context-dependent representations through the tasks of language modelling (Devlin et al.,2019;Howard and Ruder, 2018;Peters et al., 2018b;Radford et al.,2018, 2019) and neural machine trans-lation (Artetxe and Schwenk, 2019; McCann et al., 2017). In these methods, different mechanisms are deployed to dynamically model the context in which words appear. From learning deep contextualized word representations, e.g. ELMo (Peters et al.,2018b), this was soon extended to longer sequences such as sentences and paragraphs by models such as BERT (Devlin et al., 2019) and XLM (Lample and Conneau, 2019) that produce rep-resentations at the token as opposed to the word level. Although widely successful, these deep contextualized representations also suffer from a major drawback, namely that they have come at the cost of interpretability. It has become increasingly unclear what linguis-tic relationships are captured in these pretrained models. For instance, to not only capture lexical, but also compositional semantics, these models have to be capable of composing unseen word combinations and learning grammatical constructions. It is unclear how ex-actly these models facilitate such learning. At the same time, the need for explainable AI is becoming more prevalent in society, now that these methods are being integrated more widely across applications.

As a result, these deep contextualized word and sentence representations have given rise to to an extensive suite of both intrinsic (Mikolov et al.,2013a;Rogers et al.,2018;Tsvetkov et al.,2015) and extrinsic (Ling et al.,2015;Nayak et al.,2016) evaluation methods, that aim to gain a better understanding of how natural language is encoded. In this thesis, the aim is to evaluate (multilingual) contextualized sentence representations through the use of a more recently developed evaluation approach, i.e. probing tasks.

2.2 Multilingual NLP

After the wide success of monolingual word and sentence representations, the desire amongst NLP researchers was evoked to transfer the use of these pretraining methods to different languages as well. This, however, has proven to be a problematic task due to the huge data requirements of these methods (O’Horan et al., 2016). One of the biggest challenges in multilingual NLP is data scarcity in low resource languages. Many state-of-the-art methods rely on supervised learning, and therefore their performance is dependent on the availability of manually annotated datasets (e.g. parallel corpora) and/or high qual-ity automated NLP tools such as dependency parsers. As it became quickly understood that it was infeasible to train these monolingual models for many low resources languages, NLP researchers collectively turned to multilingualism for a solution.

The motivation behind producing multilingual representations is manifold and pre-dates the development of word embeddings. Some of these motivations include the idea that

(15)

CHAPTER 2. BACKGROUND AND RELATED WORK

languages with limited resources can benefit from joint training over many languages, the desire to perform zero-shot transfer of NLP models across languages, and the possibil-ity to handle code-switching (Artetxe and Schwenk, 2019). Two of the approaches that the NLP community widely explored to side-step the need for large corpora and anno-tated datasets, are language transfer and joint multilingual learning. The former learn-ing method enables the transfer of models or data from high-resource to low-resource languages, hence leveraging information across languages, while the latter jointly learns models from annotated examples in multiple languages in an attempt to leverage language interdependencies. In this section, a brief overview is given on the developments of these two key techniques, i.e. language transfer and joint multilingual learning, that paved the way to achieving (mostly) bilingual learning, and the more recent techniques that led to development of the state-of-the-art universal sentence encoders today.

2.2.1 Language transfer and multilingual joint learning

The inspiration for language transfer was drawn from the fact that, despite having sig-nificantly different lexica and syntactic structures, languages still tend to exhibit strong similarity in dependency patterns that can be exploited. Identifying and leveraging these similarities is, however, a complicated challenge as these NLP systems need to learn map-pings between sequences from source and target languages with vastly different structures (Ponti et al., 2018). Hence, in order to leverage useful information from a source lan-guage, this information typically needs to be manipulated to better suit the properties of the target language first (Ponti et al., 2019). Different methods have been developed to enable such language transfer, including annotation projection, (de) lexicalized model transer, and machine translation (Agi´c et al.,2014;Tiedemann,2015).

Taking an annotation projection approach, cross-lingual studies have resorted to word-alignment projection techniques to facilitate homogeneous use of treebanks (Ganchev et al., 2009;Hwa et al., 2005;Smith and Eisner, 2005; Yarowsky et al., 2001). In these studies, word-alignments are extracted from parallel corpora such that annotations for the source language (often automatically produced by a monolingual model), can be trans-ferred to the target language accordingly. As a result, this newly created annotated data set for the target language can then be used to train a supervised model. In model transfer, on the other hand, studies attempt to train a model on a source language, delexicalize it to solve for incompatible vocabularies, and then directly apply this model to a target lan-guage instead (Zeman and Resnik,2008). This delexicalization has been realized in dif-ferent ways, for example by taking language agnostic (Nivre et al.,2016) or harmonized (Zhang et al.,2012) features as input. In later studies, different augmentation techniques, including multilingual representations, were integrated to better bridge the vocabulary gap (Täckström et al.,2013). The last approach is to machine translate from source to tar-get language, thus in essence creating synthetic parallel corpora first, and then projecting annotations accordingly. Consecutively, following the annotation projection paradigm, similar projection heuristics can be used to transfer annotations to the target language and train a supervised model (Banea et al.,2008;Tiedemann et al.,2014). All these methods, however, still rely on the assumption that high quality resources do exist at least for source languages. Consequently, these methods could only be useful when attempting to transfer knowledge from high resource languages.

(16)

An alternative approach to leverage information from different languages is multilingual joint learning. In this research avenue, models learn information for multiple languages si-multaneously, with the hope that the languages can support each other and thereby jointly enhance each others quality (Ammar et al., 2016a;Navigli and Ponzetto, 2012). Hence, in contrast to language transfer, this approach could also be beneficial in cases where both languages suffer from data scarcity (Khapra et al., 2011). There are two main tech-niques through which this type of learning is often realized, parameter sharing and lan-guage vector integration. Parameter sharing is a method, commonly used in multi-task and multimodal learning, used to share certain (otherwise private) representations within a neural network framework, e.g. word embeddings (Guo et al., 2016), hidden layers (Duong et al.,2015b) or attention mechanisms (Pappas and Popescu-Belis,2017), across languages. The sharing can be realized by tying the parameters of specific components of the network, which has, for example, been done by enforcing minimization constraints on the distance between parameters (Duong et al.,2015a) or latent representations (Zhou et al.,2015). Another method is to induce language-specific properties to help guide joint models towards certain languages by using input language vectors (Guo et al., 2016). These are two methods in which the integration of typological information has proven useful in the past, both to guide in selecting which network components to share between which languages and to help construct language vectors (Ponti et al.,2019).

2.2.2 Multilingual representations

Earlier work on constructing multilingual word and sentence representations consisted of methods such as mapping and joint models, similar to ones described in the previ-ous sections. While the former aim to project representations from the semantic space of one language to that of another, the latter simultaneously learn representations by us-ing parallel train corpora and a joint monolus-ingual and cross-lus-ingual objective function (Ruder et al., 2019). These approaches, however, remained inherently bilingual rather than multilingual. With the rise of deep learning techniques, recent studies have been able to book significant advancements in this area, leading to the first universal models that are widely applicable across a large set of typologically diverse languages. Since the focus of this thesis lies on sentence representations, two key state-of-the-art approaches to training multilingual sentence encoders, the encoder-decoder paradigm and transformer architecture, will be explained next.

Encoder-decoder framework

Machine translation posed as a sequence transduction problem is often used as a training objective to learn multilingual representations. The use of the encoder-decoder archi-tectural set-up was first proposed by Cho et al. (2014) and Sutskever et al. (2014) for statistical machine translation, and later adapted to a neural machine translation setting byBahdanau et al.(2015). The framework consists of two separate models, i.e. the en-coder and deen-coder, that have different objectives but are jointly trained on parallel corpora in an end-to-end manner. These encoder and decoder models can both be realized as dif-ferent types of recurrent neural networks (RNNs), e.g. long short-term memory (LSTM) (Graves, 2013) or gated recurrent unit (GRU) (Cho et al., 2014) networks, and be com-bined with different types of attention mechanisms (Bahdanau et al.,2015).

(17)

Figure 2.1: Encoder-decoder sequence transduction paradigm for multilingual neural machine translation.

The encoder receives an input sequence in a source language and learns to encode its semantic meaning into a fixed-length continuous representation vector. This intermediate representation is then fed to the decoder model, that aims to reconstruct the meaning of the sequence in a target language from it. In essence, the decoder thus functions as a feedback generator to the encoder, penalizing representations that do not sufficiently capture the needed information, i.e. pivotal linguistic features, to allow for accurate translation into the target language. After training stabilizes, this decoder can therefore be discarded and the representations from the encoder can be used as pretrained universal sentence representations.

While this procedure was initially only used to train between single language pairs, pro-ducing cross-lingual representations (Ruder et al.,2019), several studies quickly aimed to extend this approach to multilingual neural machine translation in different ways (Artetxe and Schwenk,2019;Johnson et al.,2017). Earlier work tried to achieve multilingualism, for example, by adding separate decoders for each target language (one-to-many system) (Dong et al.,2015), by adding separate encoders for each source language (many-to-one system) (Schwenk and Douze,2017), by adding both separate encoders and decoders for each supported source and target language (many-to-many system) (Luong et al.,2015), or by incorporating multiple modalities besides text into the encoder-decoder framework (Caglayan et al.,2016).

More recently,Peters et al.(2018b) showed that bidirectional LSTMs (BiLSTMs) achieve good performance combined with a language modelling task. While a unidirectional LSTM (Hochreiter and Schmidhuber,1997) processes its input sequentially from the be-ginning of the input sequence to the end, a bidirectional LSTM runs the input sequence in both directions, i.e once from the beginning to the end and once from the end back-wards to the beginning (Schuster and Paliwal,1997). The result from the separate runs in both directions is then concatenated to one representation that consequently incorporates information for each token given both past and future context. This variant on LSTMs has proven to gain a more comprehensive understanding of context compared to its unidi-rectional variant, and consecutivelySchwenk(2018) was able to show that using a single joint BiLSTM encoder to share across all input languages enables robust and competitive sentence representations. In this work, we will use one state-of-the-art pretrained mul-tilingual encoder named LASER, that uses this sequence-to-sequence encoder-decoder

(18)

approach with a single BiLSTM encoder and decoder that are both shared across 93 lan-guages (Artetxe and Schwenk, 2019) (this specific model is desribed in more detail in Section 3.1).

Transformer architecture

While the previous line of study uses an encoder-decoder framework in combination with RNNs, Vaswani et al. (2017) proposed a new competitive model architecture to solve sequence-to-sequence problems without the need for recurrent connections, i.e. the trans-former model. Prior to this work, RNNs and more specifically LSTMs, were firmly estab-lished as the best choice for sequence modelling and transduction problems as they accu-mulate and integrate data sequentially, e.g. in a word by word manner, and are equipped with memory mechanisms to handle long-term dependencies. One major drawback to RNNs in general, despite memorization mechanism as deployed in LSTMs, is that it becomes increasingly more difficult to accurately represent longer sequences when the representation vector remains of fixed-length. Moreover, while RNNs can be deployed both in a forward and backward manner, the processing is done directionally in an au-toregressive manner. This means that information acquired further in the past tends to be forgotten, resulting in a bias to focus on the relatively more direct context of tokens. It could, however, be argued that to gain better understanding of the context of a word in a sequence, we need to process all other words in that sequence concurrently instead. Unfortunately, RNNs do not allow for such parallelization.

Transformer models alleviate the need for recurrent connections to capture the complex dynamics within the temporal ordering of sequences. This is done by proposing a com-bination of positional encodings to signal the token order and a multi-head self-attention mechanism. This model architecture enables concurrent processing by applying fully con-nected layers directly to all tokens in a sequence, and draws global dependencies between input and output by relying entirely on attention. The attention mechanism facilitates the model in capturing the most important part of a sentence when different aspects are con-sidered, thereby reducing the amount of irrelevant information significantly (Wang et al., 2016). Following the general procedure described in the previous section, Transformers take an encoder-decoder approach. The main difference, however, is that both the en-coder and deen-coder are realized by stacked self-attention and point-wise fully connected feed forward networks instead of RNNs. In this thesis, we probe representations from three transformer models, the multilingual variant of BERT, XLM and XLM-R (these models are described in more detail in Section 3.1).

2.3 The field of linguistic typology

Specifying what officially constitutes a ’language’ is in part an arbitrary question as its cri-teria are grounded in mutual intelligibility, which in turn is suspect to subjectivity (Ponti et al., 2019). As a result, there is no real consensus on the number of languages in the world today, but this number is often estimated to be well over 8000 (Hammarström et al., 2017). While many of these languages may be grounded by common underlying principles and deploy similar mechanisms, these are often not evident at the surface-level,

(19)

posing many challenges for applications in machine translation and multilingual NLP. Linguistic typology is a discipline that aims to study, categorize and document the vari-ation in the world’s languages through systematic cross-linguistic comparisons (Croft, 2002). These categories are not set in stone as they emerge inductively from the com-parison of languages and are prone to change with the discovery of new languages (Ponti et al.,2019).

One well-established sub-area in linguistic typology is that of word order typology. This branch studies the order of syntactic constituents in a language. For instance, they cate-gorize the grammatical structure in languages based on their dominant relative ordering of the Subject, Verb and Object (SVO) in clauses (Dryer, 2013). From this it follows that there are 6 dominant orders that can be ascribed to a language, from most to least common: SOV, SVO, VSO, VOS, OVS and OSV. English, like many other European lan-guages, is grouped under the category SVO languages. For clauses to be grammatically correct in English, the subject should precede the verb, while the object follows, see Ex-ample 2.1. While in this particular case, the object and verb can be used interchangeably without resulting in grammatical error, it is evident that this would change the semantic meaning of the clause.

SVO::the dog | {z } Subject chased | {z } Verb the cat | {z } Object (2.1)

On the other hand, many Asian languages, e.g. Urdu, Bengali, Hindi, Japanese and Ko-rean, dominantly deploy the SOV structure. In English this would translate to:

SOV::the dog | {z } Subject the cat | {z } Object chased | {z } Verb (2.2)

Likewise, there are many structural language characteristics specified by typology linguis-tics, at different levels of granularity, that help distinguish and group different languages based on these varying features. Continuing in the line of word order typology, for ex-ample, they study correlations between orders in syntactic sub-domains, e.g. the order of modifiers (adjectives, numerals, demonstratives, possessives, and adjuncts) in noun phrases and the order of adverbials.

Note, however, that language is a complex phenomena and attempting to neatly fit a mul-titude of them to well-defined categories is an impossible task. There are languages, e.g. Russian, in which multiple relative orderings would technically be accepted as correct, however, one order might be more dominantly used in the language. In other languages, the correct relative order can depend on different parameters. For example, French is predominantly a SVO language, with the exception that the SOV structure has to be used in the specific case of the object being a pronoun. Different approaches can be taken to handling similar corner cases, such as defining a ‘No dominant order’ category, or simply ascribing the order most prevalent in the language. Needless to say, this is an empirical science that often reflects tendencies as opposed to strict rules. Many of these typological

(20)

features should thus not be taken as an absolute measure, but rather as soft constraints for typological guidance (O’Horan et al.,2016).

2.3.1 The use of typological information in multilingual NLP

If the aim in multilingual NLP is to develop truly language-independent systems, why would we want these systems to encode linguistic typological properties?

The answer to this question is dubious and strongly depends on the architectural design and end goal of these models. On the one hand, creating universal representations that do not contain the faintest signal of its source language can be beneficial when we are only interested in the core concepts that they encode without ever having to use them in a natural language setting. For instance, in information retrieval a search engine has the following task: given a corpus of documents D and a search query q, return {d ∈ D | d relevant to q} ranked according to relevance order. In order to do so, it only needs to have good semantic understanding of the search query (Zuccon et al.,2015). On the other hand, however, pretrained representations are in practice often used on downstream NLP tasks such as parsing, named entity recognition and POS tagging, that require models to pick up on and re-construct the underlying syntactic and semantic mechanisms of these typologically diverse languages. As a result, pretraining these representations such that other models can deduce the linguistic properties of the source language can improve per-formance down the line. As the importance of linguistic typology for developing truly language independent systems has been advocated (Bender,2009), the hypothesis in this thesis is that state-of-the-art multilingual encoders indeed implicitly rely on encoding and sharing some typological properties of languages. Various studies already show that faint signals of languages are left in universal representations (Beinborn and Choenni, 2019; Bjerva et al., 2019;Rabinovich et al., 2017), and others have demonstrated that the ex-plicit use of typological knowledge can significantly boost performance on a variety of tasks.

Several surveys outline how typological information has successfully been integrated in multilingual NLP systems in the past (O’Horan et al., 2016; Ponti et al., 2019) to, for instance, guide multilingual dependency parsing (Naseem et al., 2012) and POS tagging (Naseem et al., 2012; Zhang et al., 2016). Naseem et al. (2012) successfully exploit word order information in multilingual dependency parsing by enabling selective sharing between source and target languages (see Section 2.2.1 for joint models). This sharing mechanism selects source languages based on its aspects that are most relevant to the specific target language. The intuition behind this, is that some syntactic properties are universal across languages, while others are influenced by language-specific typological features. Therefore, in some cases, using this typological information to more carefully select between languages that share similar properties for language transfer, allows for more effective applications.

(21)

2.4 Previous work on probing

Probing tasks are quantitative hypothesis-driven methods for the purpose of finer-grained analysis. The multi-classification tasks are designed to isolate specific phenomena and were proposed by multiple studies under different names, e.g. auxiliary prediction tasks and diagnostic classification (Adi et al., 2017; Shi et al., 2016; Veldhoen et al., 2016). These tasks can probe for different properties and be applied to different units such as token, word and sentence representations, but also hidden states (Belinkov et al., 2018; Qian et al.,2016a), from a fully-trained model with frozen weights. This is done by asso-ciating such linguistic properties with corresponding representations, and consecutively training a classifier to predict the property of interest from a new set of such representa-tions. In this thesis, we borrow this probing classification approach and attempt to predict typological properties from multilingual representations.

Monolingual probing tasks Several studies have emerged that contribute to the rapidly growing line of research on interpretation of neural models with the aim to probe monolin-gual (English) word (Belinkov et al.,2017;Linzen et al.,2016;Peters et al.,2018a;Tenney et al., 2019b) and sentence (Adi et al., 2017; Conneau and Kiela, 2018; Conneau et al., 2018a; Nangia et al., 2017) representations. Similar methods have recently also been extended to multilingual representations (Pires et al., 2019; Ravishankar et al., 2019a,b; ¸Sahin et al., 2019). However, while these studies analyze multilingual representations, they mostly disregard typological information. For instance,Ravishankar et al.(2019a,b) apply traditional probing tasks, originally proposed byConneau et al.(2018a), to probe multilingual sentence representations for universal properties such as sentence length, word count and tree depth, without directly probing for typological information.

In a similar vein,Pires et al.(2019) investigate how M-BERT generalizes across languages by testing zero-shot cross-lingual transfer in traditional downstream tasks. They only briefly touch on typology by testing generalization across typologically diverse languages in POS tagging and NER, and find that cross-lingual transfer is more effective across similar languages. More specifically, the results show that the performance is best when transferring between languages that share word order features, suggesting that M-BERT does not learn systematic transformations of structures to accommodate languages with unseen word orders. They ascribe this effect to word-piece overlap, arguing that similar success on distant languages might require a cross-lingual objective. On the contrary, Karthikeyan et al.(2020) show that cross-lingual transfer can also be successful with zero lexical overlap, arguing that M-BERT’s cross-lingual effectiveness stems from its ability to recognize language structure and semantics instead. Similarly,Conneau et al. (2020) study the emerging cross-lingual structures in M-BERT and XLM and also find that shared vocabularies contribute little to learning cross-lingual representations, but instead ascribe this ability to parameter sharing. In this work, we aim to take a closer look at these language structures learnt by multilingual encoders by directly probing the models for linguistic properties.

Probing for language structures In a different line of research, there are studies that already probe for typological information by reconstructing phylogenetic trees from mul-tilingual word and sentence representations to analyze the language relations preserved

(22)

in these models (Beinborn and Choenni, 2019;Bjerva et al., 2019). Bjerva et al. (2019) investigate how well genetic, geographical and structural differences in languages are reflected in multilingual representations by studying the phylogenetic trees that can be re-constructed from them. Since the assumption is made that multilingual encoders produce similar representations for ‘similar’ languages, this provides useful information about the terms in which this language similarity is defined. From their research they conclude that structural distance is a better predictor for language similarity than genetic distance, which had previously been claimed by Rabinovich et al. (2017). The implications of these studies are, however, substantially different from those using a probing classifica-tion approach. While reconstructing phylogenetic trees provides useful insights on how languages are captured by the encoder relative to one another, using probing tasks allows for more fine-grained analysis of typological properties for each language directly. Cross-lingual probing tasks Qian et al. (2016b), first proposed a range of probing tasks for multilingual word representations based on nominal (e.g. case, definiteness and gender) and verbal (e.g. mood, tense and aspect) features. In their work, they ana-lyze multilingual word representations from polygot (Al-Rfou et al., 2013) learned with a C&W model (Collobert et al., 2011), a Word2Vec model (Mikolov et al., 2013b) and a character-based LSTM autoencoder. They find that typological diversity in languages, especially the specific word order type and morphological complexity, influences how linguistic information is encoded in these word embeddings. In addition, they speculate that it is possible and probably even more effective for certain inflectional languages, to decode grammatical function from the word form only. Moreover,Chi et al. (2020) ex-tended the structural probe fromHewitt and Manning(2019) to a multilingual setting and provided evidence that M-BERT shares portions of its representation space between lan-guages at a syntactic level, and cross-linguistically clusters grammatical relations. To the best of my knowledge, the work in this thesis comes closest to that of ¸Sahin et al.(2019), who propose different multilingual probing tasks for non-contextualized word represen-tations. They integrate typology by investigating the correlation between probing task performance on linguistic properties such as case marking, gender system and grammati-cal mood and the corresponding performance on downstream NLP tasks. In this thesis, I, however, considerably expand on this work by proposing methods to probe multilingual sentenceencoders and investigating a wider range of typological properties pertaining to lexical, morphological and syntactic structure. In addition, since such models are inclined to learn a language identity (Wu and Dredze, 2019), we also propose a paired language evaluation set-up, evaluating on languages unseen during training.

2.4.1 BERTology

There is a growing number of works specifically concerned with understanding the inner mechanisms of large-scale transformer models such as BERT, sometimes grouped under the name of BERTology (Rogers et al.,2020).

In one such study by Tenney et al.(2019a), they investigate how different layers of the model contribute to the encoding of different linguistic information in an attempt to quan-tify where in the network this information is captured. In previous studies, it had already been established that the hierarchically lower layers of a language model capture more

(23)

local syntax, while more complex semantics tend to be represented in the higher layers of the model (Blevins et al., 2018;Peters et al., 2018a). Using a suite of edge probing experiments (Tenney et al., 2019b) and two complementary metrics, i.e. scalar mixing weights learned from the training set and cumulative scoring measured on the test set, (Tenney et al., 2019a) show that a consistent ordering emerges in line with the findings from Peters et al. (2018a). Basic syntactic information seems to appear earlier in the network, while high-level semantic information appears at higher layers in BERT. In ad-dition, they find that syntactic information is more localizable within the network as they found that the scalar mixing weights for syntactic related tasks tend to be concentrated on only a few layers, while information related to semantic tasks seems to be scattered across many layers within the network. In this thesis, we take a similar approach to test where in M-BERT language-specific information emerges, and whether this happens in a more localized fashion, exposing a localized set of features that encode these typological properties, or whether this is rather spread across the entire network.

Similar toPires et al.(2019) who investigated the cross-lingual effectiveness for zero-shot transfer on traditionally monolingual morphological and syntactic tasks,Libovick`y et al. (2019) aim to asses semantic cross-lingual properties of M-BERT. The results indicate that the representations of M-BERT are not very neutral even after language-agnostic fine-tuning, meaning that they should at least capture some language-specific information. They conclude that the contextual representations of M-BERT capture sim-ilarities between languages and cluster the languages by their families accordingly. They found that the identity of the language can be approximated as a constant shift in the representation space, suggesting that these representations consist of a language-neutral and language-specific component. Thus, M-BERT’s representation space seems to con-tain some systematic language mappings, that enable ‘translation’ between languages by shifting the representations by the average parallel sentences offset for a given language pair. In this thesis, we aim to exploit this property in order to restructure the vector space, and investigate whether typological information of a language solely arises from the rel-ative position of their representations in the vector space.

On a different note, Clark et al. (2019) focused on investigating the behavior of the at-tention mechanism used by BERT instead. They find that the different atat-tention heads exhibit common surface-level patterns in their behavior, e.g. attending broadly over the full sentence or attending to fixed positional offsets, and show that attention heads within the same layer tend to exhibit the similar patterns. More interestingly, however, they probe individual attention heads for linguistic phenomena to investigate what aspect of language a head has learned. Using each head as a trained classifier, they evaluate the head’s ability to classify various syntactic relations. Despite not finding heads that per-form well on a large set of relations, they do find ones that perper-form particularly well on specific relations. For example, they find heads that detect direct objects, determiners of nouns, objects of prepositions, and objects of possessive pronouns, indicating that par-ticular heads are specialized in certain syntactic aspects. In addition, they propose an attention-based probing classifier that takes an attention-score matrix as input and show that BERT’s attention weights also encode a substantial amount of syntactic information. While several studies have pointed out that it is not possible to equate attention with explanation (Abnar and Zuidema,2020;Jain and Wallace, 2019;Pruthi et al.,2019;

(24)

Ser-rano and Smith, 2019), others have argued their potential for meaningful interpretation (Vashishth et al.,2019;Wiegreffe and Pinter, 2019). As a general consensus on the use-fulness of analyzing attention mechanisms has not been reached, we do not aim to directly explain token interactions through attention in this thesis. Instead, we aim to investigate whether the different heads play any part in the encoding of typological information, and more specifically whether the output from different head can tell us more about specific languages.

(25)

Chapter 3 Multilingual encoders

In this thesis, LASER, M-BERT, XLM and XLM-R are selected for investigation as they rely on different architectures and training objectives, and yet all are successful and widely used. While M-BERT and XLM-R are trained with monolingual pretraining tasks only, LASER and XLM both incorporate a cross-lingual objective using parallel data. Due to these differences we hypothesize that they will exhibit some variation in the linguistic properties they capture. Moreover, given that the integration of typological information has proven to boost downstream performance in the past (Ponti et al., 2019), this might be ascribed to the extent in which they capture typological information

3.1 Pretrained models

3.1.1 Language Agnostic Sentence Representations (LASER)

LASER is a five-layer stacked BiLSTM sentence encoder that is trained with an encoder-decoder architecture and machine translation (MT) objective (Artetxe and Schwenk,2019). The encoder generates 512-dimensional representations at the token level and performs max-pooling over the last states to produce 1024-dimensional (forward+backward) sen-tence representations. sent_embed = max_pool([ −−→ h(t0) l + ←−− h(t0) l , .., −−→ h(tn) l + ←−− h(tn) l ])

, where l=5 and n=number of tokens in sentence

(3.1) The input embedding size of the encoder is set to 320 dimensions. The auxiliary de-coder LSTM is then initialized with the following embeddings, and trained on the task of generating sentences in the target language (see Figure 3.1):

• Sentence embedding: the decoder is initialized with the sentence embedding as produced by the encoder through a linear transformation.

• BPE embedding: the input BPE embeddings at every time step.

• Language ID embedding: the decoder must be told in which target language to reconstruct the sentence representation, and thus takes a 32-dimensional language identity embedding.

(26)

Figure 3.1: Illustration of the LASER architecture for learning sentence representations (image taken from

Artetxe and Schwenk(2019)).

Hence, the decoder takes the concatenation of these embeddings, resulting in 1376-dimensional input embeddings. For each mini-batch, an input language is randomly chosen and the network is trained to translate the sentences into a target language t ∈ {English, Spanish}. Both the encoder and decoder are shared across all languages to encourage language-agnostic representations, thus sentences are tokenized based on a joint byte-pair encoding (BPE) vocabulary build with 50K operations and learned on the concatenation of all train-ing corpora.

We use the pretrained multilingual model which is available for 93 languages1. This model is trained on a combination of different parallel text corpora from the Opus web-site2): Europarl, United Nations, Opensubtitles2018, Global Voices, Tanzil and Tatoeba corpora. The dataset contained 223 million sentences in total.

3.1.2 Multilingual Bidirectional Encoder Representations from

Trans-formers (M-BERT)

M-BERT is a 12-layer multi-headed bidirectional Transformer encoder (Vaswani et al., 2017) that uses the Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) tasks as training objectives to produce 768-dimensional token representations ( De-vlin et al., 2019). Apart from being trained on the Wikipedia dumps of multiple lan-guages and using a shared WordPiece vocabulary for tokenization, M-BERT is identical to its monolingual counterpart and does not contain a mechanism to explicitly encourage language-independent representations. We use the pretrained Multilingual Cased version that is trained on 104 languages3.

Each sequence starts with the special[CLS]token, whose hidden activation is often used as a sentence representation for classification tasks after fine-tuning. However, in this research we are interested in what typological properties are encoded in the pretrained models, and consequently do not fine-tune the model on a downstream task. Therefore, using the hidden states from the[CLS]tokens as sentence representations is unsuitable for this approach. Instead, we mean-pool over the hidden states of each token from the

1_{https://github.com/facebookresearch/LASER} 2_{http://opus.nlpl.eu/}

(27)

CHAPTER 3. MULTILINGUAL ENCODERS

encoding layer to obtain fixed-length sentence representations4.

sent_embed= mean([h(t_l0), .., h(tn)

l ])

, where l=12 and n=number of tokens in sentence (3.2) As explained in Section 3.1, M-BERT is a transformer and thus requires positional embed-dings to indicate token order. Moreover, M-BERT processes sentences in pairs, separated by a special[SEP]token. As a results, each input token in M-BERT is represented as a combination of the following embeddings:

1. WordPiece token embeddings (Wu and Dredze,2019) with a shared vocabulary of size |V | = 110.000. Words from a training corpus are segmented into |V | char-acter n-grams using a Wordpiece model such that the size of the tokenized cor-pora is restricted. Furthermore, exponential smoothing was used to under-sample high-resource languages, while over-sampling low-resource languages. The special symbol ## is used to indicate the beginning of non-initial word segments.

2. Positional embeddings with a maximum supported sequence length of 512, i.e. V = {0, 1, ..., 512}.

3. Segment embeddings that distinguish which input tokens belong to the first and second sentence in the training pair.

3.1.3 Cross-lingual language model (XLM)

XLM is a 12-layer encoder based on BERT (Lample and Conneau, 2019). Apart from producing 1024-dimensional token representations and using 8 instead of 12 attention heads, this model operates similar to M-BERT. We use the pretrained version that uses BERT’s MLM and introduces a new variant on this task, translation language modelling (TLM), to stimulate language-agnostic representations. XLM uses its own variant on BPE, i.e. FastBPE, and is fine-tuned on the 15 XNLI languages (Conneau et al., 2018b): Bulgarian, French, Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, Swedish and Urdu. These languages do not cover all languages used for probing in this work, allowing us to test XLM’s ability to generalize to unseen languages. For training on the MLM objective it uses sentences from the Wikipedias of each language, for TLM it leverages parallel sentences from MultiUN, IIT Bombay corpus, the EUbookshop corpus, OpenSubtitles, Tanzil and GlobalVoices.

3.1.4 Cross-lingual robustly optimized version of BERT (XLM-R)

XLM-R(oBERTa) is another 12-layer encoder producing 768-dimensional representa-tions (Conneau et al., 2019), based on a robustly optimized version of BERT in terms of training regime (RoBERTa) (Liu et al., 2019). RoBERTa is trained with vastly com-pute power and data retrieved from CommonCrawl, omits the NSP task and introduces dynamic masking, i.e. masked tokens change with training epochs. This XLM variant is trained on 100 languages and uses a Sentence Piece model (SPM) for tokenization. Note

(28)

Model tokenization L dim H params V task lgs

LASER BPE 5 1024 - 52M 50K MT 93

M-BERT WordPiece 12 768 12 172M 110K MLM+NSP 104

XLM BPE 12 1024 8 250M 95K MLM+TLM 15

XLM-R SentencePiece 12 768 12 270M 250K MLM 100

Table 3.1: Summary statistics of the model architectures: tokenization method, number of layers L, dimen-sion of sentence representation dim, number of attention heads H and parameters, vocabulary size V and pretraining tasks used.

that unlike XLM, XLM-R does not use the cross-lingual TLM objective, but is rather trained with the monolingual MLM task, similar to M-BERT.

3.2 Training objectives

Here we give a more in-depth explanation of the four training tasks used by the models. These are grouped under inherently monolingual and cross-lingual types of tasks.

Monolingual tasks

Both monolingual tasks used by our models were originally proposed by BERT and onlly require monolingual data.

1. Section 2.2.2 explains that the main novelty of Transformers over RNNs is that they allow for concurrent instead of autoregressive processing of words. Consequently, this is a key property of Transformers that BERT aims to exploit. Masked language modelling (MLM), also known as the cloze task (Rosenfeld,2000), precisely offers an alternative to the autoregressive nature of language modelling (LM). Learning to predict the next word, would only allow the self-attention layer to attend to previous tokens, diminishing Transformers’ competitive advantage to attend to all words in parallel. Thus, MLM instead tasks the model with predicting a randomly masked word, thereby allowing the self-attention layer to leverage information from both the left and right context of a word.

LM task: each language has its own _{_}?

MLM task: each language[MASK]its own difficulties

To generate this task, 15% of each sentence is masked according to the following procedure:

• 80% of the time, the[MASK]token is used

• 10% of the time, a random word replaces the masked token • 10% of the time, the observed word stays fixed

2. The second objective proposed by BERT is next sentence prediction (NSP). Lan-guage modelling tasks focus on capturing word relationships within sentences. NSP

(29)

was added to model longer-distance dependencies that span across sentence bound-aries instead. This is done to better accommodate downstream tasks that work on the sentence level, e.g. Question Answering (QA) and Natural Language Inference (NLI). In NSP, the model is trained to predict whether two subsequent sequences could naturally follow each other in the original text.

Input: (X )[CLS]each language[MASK]its own difficulties[SEP]

(Y ) but knowing a few[MASK]it easier to learn a new language[SEP] Label: IsNext::True

Input: (X )[CLS]each language[MASK]its own difficulties[SEP]

(Y ) to get a bottle of[MASK]at the store[SEP] Label: IsNext::False

This task is trivially generated from the training corpus: for each sample, when choosing a pair of sentences (X,Y), 50% of the time Y is a sentence that naturally follows X, and 50% of the time it is a random sentence from the corpus.

Cross-lingual tasks

1. Models trained with a MT objective exploit parallel data for supervision as they learn alignments between parallel sentences. The objective is straightforward as the model is simply tasked with generating a translation in a target language from a sentences from varying source languages.

2. In translation language modelling, introduced by XLM, two parallel sentences are concatenated and words in both target and source sentence are randomly masked as in MLM. The model is then tasked with predicting the masked tokens in both lan-guages simultaneously. Consequently, the model can leverage information from the context in either language to predict the words, thereby encouraging the alignment of representations in both languages.

TLM task: (Eng)[/s]each language[MASK]its own difficulties[/s]

(Spa)[/s]cada idioma tiene sus[MASK]dificultades[/s]

3.3 Multi-head attention mechanism

As explained in Section 2.2, the multilingual encoders under investigation in this thesis are realized through different types of neural architectures, i.e. a BiLSTM (LASER) and BERT-based Transformers (M-BERT, XLM and XLM-R). Hence, a brief introduction to the benefits and workings of the transformer architecture was already given. As we inves-tigate the role of the multi-head self-attention in the encoding of typological information in Chapter 5.3, however, we will now first lay the necessary groundwork for the exper-iments conducted in the respective chapter. More concretely, we will first explain the

(30)

Add + Normalize Add + Normalize Add + Normalize Add + Normalize: g(x + f(x)) Self-attention: f(x) Self-attention Layer 1 Layer 2

Feed Forward Feed Forward

X1 Je

X2

suis

Figure 3.2: Illustration of the stacked Transformer architecture.

general architecture of a transformer block and then the multi-head self-attention mecha-nism.

Our BERT-based Transformer models all contain 12 layers of which each is comprised of a full Transformer block (Vaswani et al.,2017). As can be seen in Figure 3.2, each block consists of two sub-layers:

1. A multi-head self-attention mechanism f (x):

Attention(Q, K, V) = so f tmax(QK T √

d_k )V (3.3)

2. A fully connected feed forward layer consisting of two linear transformations and a non-linearity:

ReLU(xW1+ b1)W2+ b2 (3.4)

To both sub-layers, residual connections (He et al., 2016) and layer normalisation g(·) (Ba et al., 2016) are applied. Thus, the final output of each sub-layer is given by g(x + sub-layer(x)).

As we see in Equation 3.3, the self-attention mechanism requires three input matrices: Q, K and V, consisting of the queries qi, keys kiand values vi vectors for each input vector respectively. These vectors qi, ki and vi are created by multiplying the input vectors by three matrices composed during training. Now take for example the following input se-quence of length n = 6: ‘Each language has its own difficulties’. Given 768-dimensional input vectors, each one of the input matrices will be of size (6, 768). To compute the self-attention for the word i = 2, ‘language’, we first score this word against itself and every

(31)

other word in the sequence j 6= i based on similarity. This is simply done by computing the dot product between the query and key vectors: q2· k1, q2· k2, .., q2· kn. Hence, these scores are computed for each word by QKT in Equation 3.3, resulting in a (6, 6) matrix. We then divide these attention scores by the square root of the dimension of the key vector and pass the result through a softmax operation, i.e. so f tmax(√·

dk). The first step leads

to more stable gradients and the softmax function ensures that all scores are positive and sum up to 1. This leaves us with the final attention-score matrix which we then multiply by V , whose vectors represent the original input vectors. Thus, in essence, the output of the self-attention layer, Z ∼ Attention(Q, K,V ), simply consists of a weighted sum of the input vectors.

The multi-head attention is merely an extension of the self-attention as explained above. This mechanism uses h heads in parallel that all rely on the same computation in Equation 3.3:

Multi-Head(Q, K, V) = Concat(head1, .., headh)WO

, where headi= Attention(QW_iQ, KWiK, VWiV)

(3.5)

The difference is that instead of taking the full input vectors as input to one self-attention layer (Q, K, V), the input vectors are split into h chunks through linear projections using learned matrices W_iQ,W_iK and W_iV respectively. Each chunk is then fed to a separate head i. Thus, given m-dimensional input vector, each head receives am_h-dimensional input. The output of the h heads is then concatenated and linearly transformed through WOto get the expected dimensions. Note that, attention heads do not share parameters, meaning that each head is supposed to learn a distinct attention pattern. The intuition behind this, is that the multi-head attention mechanism allows the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al.,2017).

What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

What does it mean to be language-agnostic?

Probing multilingual sentence encoders for typological properties

R

OCHELLE

C

HOENNI

August 10, 2020

Supervisor:

D

. E

S

Assessor:

D

. W

Z

I

L

, L

C

Abstract

Contents

List of Figures

Chapter 1

Introduction

1.1

Research questions

1.2

Contributions

1.3

Thesis structure

Chapter 2

Background and related work

2.1

From distributional to contextualized representations

2.2

Multilingual NLP

2.2.1

Language transfer and multilingual joint learning

2.2.2

Multilingual representations

2.3

The field of linguistic typology

2.3.1

The use of typological information in multilingual NLP

2.4

Previous work on probing

2.4.1

BERTology

Chapter 3

Multilingual encoders

3.1

Pretrained models

3.1.1

Language Agnostic Sentence Representations (LASER)

3.1.2

Multilingual Bidirectional Encoder Representations from

Trans-formers (M-BERT)

3.1.3

Cross-lingual language model (XLM)

3.1.4

Cross-lingual robustly optimized version of BERT (XLM-R)

3.2

Training objectives

3.3

Multi-head attention mechanism