Compressing Sequences of Semantic Representations

(1)

MSc Artificial Intelligence

Master Thesis

Compressing Sequences of Semantic

Representations

by

Sergi Castella i Sapé

12201537

July 2020

48 ECTS November 2019 - June 2020

Supervisor:

Dr. P.T. Groth

Second Reader:

Dr. P. Bloem

Company Supervisors:

Jakub Zavrel,

Marzieh Fadaee

Universiteit van Amsterdam &

Zeta Alpha Vector

(2)

Abstract

We investigate the problem of compressing sequences of representations into shorter sequences in the context of Natural Language Processing. We propose a framework consisting of two steps: chunking (i.e. finding groups of representations, chunks) and generating (i.e. compressing each chunk into a single representation); by which we compose fine representations into coarse ones. Geometric Compression is a variant we propose for chunking, where we use clustering techniques on the embedding space to find groups that should be optimally compressed.

To evaluate the compressed representations, we use downstream task perfor-mance as a proxy for representation quality. Given that the compression ratio of this process is controllable, we evaluate the performance-compression trade-off and base our discussion on it. Our experiments include several variations of chunking and generation, both parametrized and non-parametrized. We find that different chunk-ing strategies do not result in substantial differences in the performance-compression trade-off, including our Geometric Compression proposal; while generating strategies do yield meaningful differences. Also, we find relevant differences in compressing contextualized and non-contextualized representations, where the latter suffer worse performance degradation.

(3)

Acknowledgements

This thesis encompasses much more than what can be read in the following pages: countless lessons learned on many fronts –professional and personal– have made this journey worthwhile and given purpose to many struggles and frustrations lived along the way. I would like to acknowledge the institutions and specially the people who have made this project possible despite the unusual circumstances we have found ourselves in the first half of 2020.

First, to Zeta Alpha Vector as the company that provided an excellent environ-ment to develop my research and to Jakub Zavrel and Marzieh Fadaee, who have personally supervised my work with dedication and thoughtful ideas and feedback. To the University of Amsterdam, for being such a vibrant institution to learn from, providing access to crucial resources such as professors and raw compute; and also to my supervisor Paul Groth, who has given me deeply valuable advice at every step. Also, to all people who have been kind to dedicate time to have conversations with me about my research, even outside their professional obligations: Jelle Zuidema, Samira Abnar, Peter Bloem and Boris Reudenik; I am truly grateful.

Personally to my family: my brother, mum and dad; who despite not seeing in a long time due to the extraordinary circumstances, have made sure I felt the warmth of home. To my friends, who from here or from a distance, have helped in keeping my morale high; and finally, to my partner, Tjitske, who has stood by my side all these extraordinary months and made them more enjoyable. Thank you.

(4)

List of Definitions vi 1 Introduction 1 1.1 Representing Language . . . 2 1.2 Research Question . . . 4 2 Background 7 2.1 Semantic Representations . . . 7 2.1.1 Character-Level . . . 7 2.1.2 Wordpiece-Level . . . 8 2.1.3 Sentence-Level . . . 10 2.1.4 Paragraph-Level . . . 10 2.2 Evaluation literature . . . 11 3 Methodology 13 4 Modeling Compression 18 4.1 Compression and Information Theory . . . 18

4.2 Geometric Compression . . . 20 4.3 Chunking . . . 21 4.3.1 Non-Parametrized . . . 22 4.3.2 Parametrized . . . 25 4.4 Generation . . . 25 4.4.1 Non-parametrized . . . 26

(5)

4.4.2 Parametrized . . . 27

5 Results 30 5.1 Quantitative Experiments . . . 30

5.1.1 Non-parametrized chunking, non-parametrized generation . . 32

5.1.2 Non parametrized chunking, parametrized generation . . . 44

5.2 Qualitative Results . . . 48

5.3 Synthetic Retrieval Tasks . . . 50

6 Discussion 53 7 Conclusion 63 A Information Theory 70 A.1 Information . . . 70

A.2 Ensemble X . . . 70

A.3 Source Coding Theorem . . . 71

A.3.1 Source Coding Theorem for symbol codes . . . 72

B Implementation 73 B.1 Transformer . . . 73

B.2 Task Specific Networks . . . 73

B.2.1 SST2 . . . 73

B.2.2 QQP, MNLI . . . 74

C Experiments 75 C.1 Initial classifiers training . . . 75

C.1.1 Figure 5.1 . . . 75

C.1.2 Gold performances: Table 5.1 . . . 75

C.2 Non-parametrized chunking, non-parametrized generation . . . 76

C.2.1 Zero-shot compression: Figures 5.2 5.3 5.4 . . . 76

(6)

C.2.3 Skip Ablation: Figures 5.10 . . . 76

C.3 Non-parametrized chuking, parametrized generation . . . 78

C.3.1 LSTM and Conv+Att generation: Figures 5.11, 5.12 . . . 78

(7)

List of Definitions

Representation (Rep.) Symbol or object, such as a vector, used as a surrogate of an external entity.

Representation Granularity Level of detailedness of a representation. Rep-resenting a document by assigning each word its own representation is a finer granularity than assigning a representation to every para-graph, which is coarser.

Fine Representation Representation that encodes small atomic parts. Non-compressed representation refers to a fine representation that has later been compressed.

Coarse Representation Representation that regards large subcompo-nents. Compressed representation refers to a coarse representation that has been obtained from composing finer ones.

Granularity Transition The compression process by which fine gran-ularity representations are composed into coarser versions.

Chunk Group of representations to be compressed into a single representation.

Generation Process that generates a single representation our of a chunk.

Non-contextualized Rep. Representation that is independent from the context of the entity being represented. It is often pre-computed.

Contextualized Rep. Representation that is dependent on the con-text of the entity being represented.

Token Base unit of language for a Language Model, consisting of a word or a subpart of a word. Compression Ratio Value between 0 and 1 indicating how much

compression has been performed; defined as the fraction of the compressed sequence length over the original one: compressed length_{origianl length}

(8)

(9)

Chapter 1 Introduction

Artificial Intelligence and Representations

One of the main goals of Artificial Intelligence (AI) research is to improve our un-derstanding of representations. ‘Representations’ in the broadest possible sense, as any mathematical object used to encode information such as vectors, data struc-tures, functions or symbols; which we can refer to as a surrogate: “[...] a substitute for the thing itself, that is used to enable an entity to determine consequences by thinking rather than acting” (Davis et al., 1993).

The relevance of representations in the grand quest of AI has been historically supported (Pitt, 2020; Thomason, 2020) but not without challenge (Brooks,1991). Do ‘good representations’ arise as a consequence of intelligence; or could it the other way around? For instance, could we humans ever be intelligent without the evolution of language, one of the most crucial knowledge representation systems for us? Leading voices in the AI and linguistics communities believe that language and human cognition are so codependent they cannot be understood in isolation (Bengio,

2017;Chomsky,2005).

Regardless of the only speculative answers to these questions, many of the latest advancements in many Machine Learning areas are tightly coupled to learning good ways to represent things. How can we preserve the relevant information something contains, while identifying the features that best define this something such that this representation also has nice properties for later manipulation? Although the definition of what makes a good representation is elusive in the context of Machine Learning, it can be intuitively understood as “that which makes a subsequent learning task easier” (Goodfellow et al.,2016)

(10)

Inquiring into the topic of representing information and knowledge begs the question: what should be the scope of a representation? If we define repre-sentations as a finite and discrete set of objects (such as vectors); it seems crucial to understand what constitutes enough or too much to be encoded in each of these rep-resentations. In this work we refer to this concept as representation granularity: a fine granularity means that small atomic units are encoded in each representa-tion, whereas a coarse granularity encodes bigger chunks of information into a single representation. For instance, assigning a vector to each word in a sentence is rep-resenting a sentence in a finer granularity than encoding it all in a single vector, which is a coarser representation.

The importance of representation granularity and compression

Take the human conscious thought process as an inspiration: when we process infor-mation and reason about it (such as reading and comprehending this very text), we necessarily manipulate concepts at different levels of detailedness. As you read this lines, the mental circuitry that retains the information necessary for capturing an overarching narrative needs to dynamically adjust from sub-word level, to arbitrarily long spans; requiring more or less processing based on the novelty and complexity of the information processed at a given time (Rescorla, 2020; Smolensky, 1988). This navigation between levels of representation granularity can be understood as a pro-cess of compression or decompression: generating coarser representations from finer ones is compressing them and viceversa.

Compression is often proposed as a fundamental lens to understand the nature of intelligence: to understand is to find structure in data, which enables its effective compression (Hutter, 2005). This fundamental idea about the importance of com-pression and its relationship to understanding is the underlying theme that motivates this research.

1.1 Representing Language

In the field of Natural Language Processing (NLP), researchers have experimented with different notions of what can be represented, such as characters (Zhang and LeCun, 2015), words (Mikolov et al., 2013) or sentences (Reimers and Gurevych,

2019). Under our current understanding, different levels of representations seem the natural choice to perform different tasks: for instance, Named Entity

(11)

Recog-nition (NER) of single words will benefit from using word representations, while embedding-based document retrieval will benefit from coarser representations, such as paragraphs.

In this work, the branch of representations that will be studied are distribu-tional semantics (Harris, 1954; Firth, 1957); which are the family of represen-tations that assume that language is a system where linguistic items with similar distributions (occurrence patterns) have similar meanings.

A relevant example of a widely used kind of distributional representations are word embeddings, such as (Mikolov et al.,2013), where each word is represented as a high dimensional vector in euclidean space. A space in which characteristics resemble more closely those from semantics, attempting to reflect properties such as relatedness and analogy making.

Word embeddings are far from perfect; for instance, the embedding of a word is pre-computed and fixed regardless of its context, which clearly cannot capture all utterances of language: how would a machine know from just a word embedding what we mean by ‘characters’ in the first sentence of this section? As in symbols or as in fictional people in a novel? Hence the importance of context.

Contextualized representations take this idea one step further by representing the meaning of words within a context, such that the same word in different contexts will have different representations. These representations (such as pre-training outputs from transformer-like Language Models) have proven to be extremely useful for a wide range of downstream tasks with just a little fine tuning –compared to training from scratch– and can capture important information about language: semantics, word relationships, topic modeling, language modeling, etc. (Devlin et al., 2018;

Vaswani et al., 2017; Peters et al., 2018). Still, these representations live at the word level: each token has its own representation. Their granularity is fixed.

In this work we are interested in modeling the compression of semantic rep-resentations in language. Compression is done by a ‘granularity transition’, the process by which fine representations can be compressed into coarser ones. As we will see in Section 2, several works have focused on how to obtain semantic represen-tations for different granularity levels such as sentences or documents. However, it is common to learn new representations from scratch, where interoperability between representation levels is omitted. Some works do model this transition explicitly, but generally the decision on what representations should be merged into a coarser version is based on fixed heuristics, such as compressing any sentence into a single

(12)

vector.

The contribution of this thesis is precisely in this space: improving our under-standing of semantic representations by compressing representations from fine to coarser granularities, which effectively means composing them from token to few-words and sentences. The framework to do so is detailed in the following sec-tions.

1.2 Research Question

We focus on the following research question that captures our proposed interests and motivations:

Can sequences of semantic representations be compressed effec-tively into shorter sequences?

Compression in this work always refers to the length of the sequence and not the size of the representations; thus the ‘compression ratio’ is defined as the compressed length over the original length compressed length_{originallength} ; and is bounded between 0 and 1. Compression effectiveness is defined as how much performance is affected on downstream tasks that use compressed representations as input; the rationale for this is explained in Section 3.

In this formulation and from now on, ‘representation’ is used as an equivalent to ‘embedding’: a high dimensional vector in euclidean space; given that it is the most widely used form of representations in machine learning models for NLP tasks.

We introduce constraints into the representations we study:

• Compressed/coarse representations should be constructed by composing finer ones, understanding composition in the broadest sense possible: coarse repre-sentations should result as a function of finer ones.

The motivation behind this is threefold. First, the hypotesis about the im-portance of ‘granularity transition’ points towards this constraint. Second, compositionality generally induces desirable properties in engineering systems, such as improved modularity and explainability (Davies et al., 2012). Finally, Natural Language is widely regarded as a compositional system (Szabó,2017), so it seems reasonable to impose this constraint bias by design.

(13)

• Compressed representations should remain in the same ‘semantic-structural’ space as the original one, as much as possible. This means that not only the vector dimensionality should be preserved, but that in some way the abstract meanings/purpose of their components (or features) should remain as close as possible to the original one.

The motivation behind this constraint is to enable representations to be used across granularities, maximizing its applicability in engineering problems. For instance, in many language tasks, one needs to perform reasoning about concepts that exist in different levels of granularity such as comparing what a full sentence conveys and what a word means.

• Determining what to compress should be controllable and not sequence-length dependent. This means that –unlike most existing approaches– the compres-sion target is not fixed by a heuristic but by modeling what can be compressed most effectively. Intuitively speaking, not all words in language contribute equally to the amount of information being transmitted. Following this in-tuition, a long sentence that conveys little information should end up more compressed than a short sentence that has high information density.

Although compression is achieved in the bit-sense as well, it is not the main ob-jective of this work, so no specialized compression techniques such as model pruning or embedding quantization are explored (Han et al.,2015).

The main pieces this work carries to answer the proposed research question can be outlined as the following:

• We design and implement a framework for exploring the compression of se-quences of semantic representations into shorter ones by evaluating the perfor-mance on downstream tasks that use these representations as inputs, presented in Section 3.

• We propose a framework for compression based on geometric clustering in an embedding space, along with several baselines based on heuristics developed in Section 4.

• In Section 5 we evaluate several configurations of the proposed approaches. We also perform qualitative experiments that help us explain our findings.

(14)

• All empirical implications of our results along with their intuitive interpreta-tions are discussed in Section 6.

(15)

Chapter 2 Background

In this section, we present relevant literature on the topic of semantic representations and their evaluation. We focus on recent works and do not cover previous influential literature from before around 2010, given its lesser direct relevance for this work.

The academic literature on the topic of semantic representations such as word sentence or document embeddings is large (Camacho-Collados and Pilehvar,2018). The most common strategy to achieve coarse representations defines the granularity level as a fixed target, such as words, sentences, or documents, whereas in the interest of this work is to model the granularity of our representations. Despite not tackling the same exact problem, the existing research provides valuable insights on the properties and quality of representations.

2.1 Semantic Representations

In the following sections we present previous works that have focused on model-ing the semantics of language at different levels of granularity: characters, words, sentences and paragraph levels.

2.1.1 Character-Level

Most relevant work initially emerged between 2015 and 2016 as an attempt to im-prove on pretrained word embeddings (Mikolov et al., 2013) (Pennington et al.,

2014) and overcome their limitations as ‘non contextualized’ representations, un-able to capture meaning utterances from context or morpheme variation. Three

(16)

relevant examples of these approaches are (Zhang and LeCun, 2015), which uses convolutions to learn all the way from characters to meaningful abstractions for downstream tasks; (Ling et al., 2015) which uses BiLSTMS (Schuster and Pali-wal, 1997) to generate word embeddings from their characters; and (Wieting et al.,

2016), which constructs word and sentence representations following the same com-positional principle, albeit from a character n-gram building block.

(Peters et al., 2018) builds from character level representations up to rich con-textualized word-level embeddings, which at the time of publication – early 2018) – surpassed many standard NLP benchmarks.

The main strength of these models is their ability to represent previously unseen words, given their compositional nature. However, the impact of the attention-based Transformer architecture from late 2018 onwards, quickly shifted the interest away from RNN based Language Models and character-based representations.

Probably the most relevant and persistent idea across this literature is that of the compositionality constraint and two main architectures to achieve it: Convolutions and Recurrent Networks, where both architecture families can model the impact of neighboring representations.

2.1.2 Wordpiece-Level

Word level representations can seem a priori the finest granularity at which seman-tic representations make sense. Generally speaking, words1 _{in language represent} concepts that have little relationship to the characters that form them: dog and god are close in character representation, but not close in meaning. Under this intuition, many works have tried to represent words in the most meaningful way possible, cap-turing ideas such as the fact that whatever ‘dog’ means, it’s closer to ‘cat’ than it is to ‘human’, but closer to ‘human’ than it is to ‘complexity’.

The history of trying to achieve such representation dates back to 1986 (G.E. Hin-ton, 1986), but the first work to be widely recognized to successfully achieve such representations was (Mikolov et al., 2013) with CBOW and skip-gram models, on which most of the future work built on. The key insight that allowed for this models is the distributional hypothesis originially proposed by (Harris, 1954) and later popularized by (Firth, 1957) as “words are largely characterized by the company

1_{By words we informally refer to a lexeme, the unit of lexical meaning, which would a more}

(17)

they keep”. Thus, using the word coocurrence as a learning signal, a semantic repre-sentation can be calculated. Soon after, GloVe (Pennington et al., 2014) improved upon previous work by carefuly analyzing the limitations of existing methods. Still, both of this influential approaches result in uncontextualized representations.

The most relevant literature for this work starts with the introduction of Trans-formers by (Vaswani et al., 2017) in which Attention is the essential building block for Sequence to Sequence model, instead of the typical Convolutional or Recurrent Networks that were used previously. This kind of architectures, known as Trans-formers, start by learning uncontextualized embeddings (word or subword level) and using self attentive mechanisms contextualize these representations to make them more meaningful given their context. The output of a Transformer encoder is gen-erally the same shape as the input but with vectors that contain more informative features than their non-contextualized counterparts. A full understanding of the at-tention mechanism and the Transformer architecture is not necessary to understand this work, so we do not provide a deep explanation; might that be needed, we refer to the original work of (Vaswani et al., 2017).

A key insight from recent literature has been the power of self-supervised pre-training of Language Models, where vast amounts of text are used for tasks that do not require labeling but still require knowledge about language. Masked Language Modeling (MLM) is the most prominent example of such task, in which the model learns to predict masked words from text. After pre-training, contextualized repre-sentations can be used for downstream tasks, or the model can be fine-tuned on a given downstream task with significantly less labelled data in comparison to what would be needed when training from scratch.

The units of language that are represented in these Language Models are usually wordpieces, first presented by (Wu et al., 2016). These subword units –often called tokens– are defined as the segmentation of a corpus of text into the least amount of wordpieces given a fixed vocabulary size, typically around tens of thousands. In practice, this results in common full words being assigned a wordpiece and rare words being split into more common subword units.

In this work, we use pre-trained Transformer embeddings as our base represen-tations of wordpieces, as we explain in Section 3. (Devlin et al., 2018) is one of the most widespread variants of the Transformer architecture, known as BERT, which will be used in this work as a standard Transformer encoder.

(18)

2.1.3 Sentence-Level

Recent works have focused on composing sentence-level representations from word-level embeddings, either in a non-parametrized or parametrized way.

The first relevant conclusion from the literature is that simple methods of com-position such as mean or max-pooling, have proven to be reasonably effective at combining the semantics of word embeddings. Moreover, non-parametrized meth-ods have shown to be competitive only based on corpus statistics and geometrical analysis such as Principal Component Analysis (PCA) (Arora et al.,2017).

Early parametrized approaches used methods preceding Transformer architec-tures and contextualized word representations such as (Lin et al., 2017) used self-attentive mechanisms to obtain word representations form uncontextualized word embeddings. On a similar method, but using convolutions and k-max pooling (gener-alization of max pooling operation to k elements) to obtain sentence representations for variable length input sequences (Kalchbrenner et al., 2014). Finally, (Conneau et al., 2017) presents research on a wide variety of architectures to learn sentence embeddings under a supervised settings.

The Transformer architecture has also inspired techniques to obtain sentence-level representations; however, these either rely on operations such as pooling from contextualized embeddings, or fine-tuning the whole network in task specific set-tings where a special token is used as the sentence representation. (Reimers and Gurevych,2019) presents an overview on these techniques and proposes a fine-tuning approach that implements pooling over the word representations to obtain a sen-tence representation. Despite the success of this approach, the fact that fine-tuning is required on the whole Transformer model makes it less relevant for this work, as we impose the use pre-trained contextualized embeddings as features, not as part of the trainable model.

2.1.4 Paragraph-Level

A task where paragraph or document level representations are useful is Information Retrieval (IR), where documents are typically ranked in relevance with respect to a query. In classical ranking engines such as BM252_{, techniques used are mostly} probabilistic, such as using term frequencies and document frequencies to determine

2_{Okapi BM25 is a ranking function developed in the 1970s and 1980s primarily by Stephen}

(19)

the importance of query and document word matching (Robertson and Zaragoza,

2009). Despite the robustness of these approaches, they do not model explicitly the semantics of queries and documents, so these engines fail to retrieve relevant documents which do not contain the same keywords as the query, which is common for long queries. This can be partly overcome by tricks such as including synonyms and related words to the keyword search, but the fundamental problem remains: performing the search in the characters space instead of the meaning space.

To tackle this problem, some approaches to generate representations for para-graphs have been researched. Examples of relevant work on this topic are (Le and Mikolov, 2014), as an extension to (Mikolov et al., 2013), where paragraph dense representations are learned in a similar fashion as their word counterpart: the para-graph representation learns to maximize the probability of a word appearing in its context.

More recently, (Chang et al.,2020) have used the Transformer architecture cou-pled with paragraph-level self-supervised tasks in order to obtain rich paragraph level representations to use in retrieval tasks. The most important of these tasks is the Inverse Cloze Task (ICT), consisting of masking sentences from paragraphs and making the model learn to determine what sentence belongs in the gap from a set of negative samples.

2.2 Evaluation literature

Evaluating the quality of semantic representations is not straightforward (Bakarov,

2018), given that the notion of ‘quality’ can only be loosely defined objectively. As explained in (Goodfellow et al., 2016), learning representations often yields a trade-off between preserving information and achieving desirable properties in the representation, such as disentanglement between components. This means in prac-tice that different ways of measuring quality will be biased towards one side or the other in this trade-off.

That said, some proxies can be used to infer how good a representation is; generally speaking, two main currents can be identified (Bakarov, 2018):

• Extrinsic Evaluation: this approach evaluates the quality by using repre-sentations as features for a model in a downstream task and measuring their performance. Examples of tasks are: Named Entity Recognition (NER),

(20)

Senti-ment Analysis, Semantic Role Labeling, Text Classification, Natural Language Inference (NLI), etc. The strength of this approach is how straightforward it is to compare different representations, however, this evaluation tends to be biased towards the task that is being solved and there is no guarantee that the results will be replicable for other tasks. This effect can be mitigated by evaluating on multiple standarized tasks, such as the GLUE Benchmark (Wang et al., 2018).

• Intrinsic Evaluation: this method focuses on comparing the human ass-esment of word qualities with those from the representation space; in other words, it qualifies how much the vectors in the embedding space map to mean-ingful linguistic and semantic properties. A classical example of this kind of evaluation is how well the embedding space encodes analogy making as arith-metic operations, which is one of the tasks used to showcase the properties of Word2Vec (Mikolov et al., 2013). This approach provides a more compre-hensive and explainable evaluation and is grounded on cognitive principles from how humans use language. However, it is rather subjective and biased towards the specific qualities being tracked and it does not always correlate with performance in downstream tasks.

In this work we will focus mainly on extrinsic evaluation using downstream tasks performance as a proxy for representation quality, as discussed in Section 3.

In this section we have reviewed recent efforts on obtaining semantic represen-tations for language at different granularities and strategies to evaluate them.

(21)

Chapter 3 Methodology

The purpose of this work is to explore different forms of compressing sequences of semantic representations into shorter sequences by composing fine granularity representations into coarser ones: from tokens to sentences.

As briefly explained in the previous section, defining a measure of represen-tation quality is not trivial, it is in fact a central part of this work. The proposed method for experimenting and evaluating compressed representations is based on extrinsic metrics.

Obtaining sequences of semantic representations

The first methodological step is to define how we obtain semantic representations at the token level. To do so, we run input sequences of text through a pre-trained Transformer, BERT (Devlin et al., 2018), which outputs general purpose contextu-alized embeddings that have shown already to perform well on several downstream tasks.

An advantage of this method is that Transformer models generate representa-tions at different layers of their architecture; so we can use the same model to obtain contextualized and non-contextualized embeddings. If we use features from the zeroth layer of the model, these features will be non-contextualized, because they are obtained from a pre-computed lookup table or embedding matrix; whereas if we use features from the last layer of the model, these will be contextualized, because each token representation will be the result of on-the-fly modeling with its context.

(22)

Figure 3.1: Full model architecture to obtain compact representations from sub-word to sentence level and evaluate them on downstream tasks.

Compressing sequences of representations

Our main objective is now to compress the representations from the BERT output and evaluate them. Once we have token-level representations, we divide the com-pression procedure in two steps: chunking and generation. Here is a brief description of these concepts, which are presented in more depth in Section 4:

• Chunking: this is the process of dividing the representations in a sequence into ‘buckets’ or ‘groups’ to be compressed; we call these groups chunks. Gen-erally, these groups will be sequential, so this process can be conceptualized as placing brackets in a sequence around elements that can be compressed into a single representation.

• Generation1_{: this second step consists of generating a new representation} for each chunk we want to compress, which means modeling a few-to-one relationship, with variable input size in the sequence length dimension.

(23)

Evaluating compressed representations

The next step in the evaluation pipeline is to feed these new compressed representa-tions into specific networks (i.e. classifiers) that perform downstream tasks. During training, the training signal for the classifiers is obtained by combining the task losses into a global loss and performing backpropagation.

If the compression stage is non-parametrized, the compressed representation remains completely task agnostic, given that no task-specific signal informs its be-haviour. When, instead, the compression stage is parametrized, a learning signal must be obtained by either:

• Downstream Task Loss: a combination of the losses from the downstream tasks. This approach can induce into task specific optimization; to minimize this effect, it is important to choose more than one downstream task, preferably capturing different aspects of language.

• Self-supervised or unsupervised objective: aplying objectives that do not rely on the performance in downstream tasks, such as the Inverse Cloze Task (ICT) presented in Section 5.3.

Under this setup, different compression approaches will be tested and evalu-ated, and given that the compression ratio is controllable, we achieve a generic compression-performance trade-off, which is comparable with respect to each other, as showed in Figure 3.2. As mentioned in the introduction, ‘compression ratio’ is defined as the compressed length over the original length and is never related to the size of each representation. Most of our quantitative analysis are based on analysing this compression-performance trade-off under a combination of settings. Also, we always normalize the performance of compressed representations with respect to the original non-compressed performance which enables fair comparison of compression approaches with different original performance.

Downstream tasks for evaluation

We choose a subset of 3 tasks from the GLUE Benchmark along with their datasets (Wang et al.,2018); the choice of tasks has the goal of having a variety of complexity levels to maximize the interpretability of results.

(24)

Figure 3.2: Comparison method for different strategies of compression and genera-tion, based on performance on downstream tasks. The compression ratio is obtained by dividing the length of the compressed sequence over the length of the original sequence.

• Sentiment Analysis on Standford Sentiment Treebank v2 (SST2) (Socher et al.,2013): SST2 is a binary text classification task for sentiment, either positive or negative. Pieces of text to classify are gathered from a movie reviews corpus. This task is the most simple one explored and current SOTA is on par with human performance, at a nearly perfect +97%.

• Language Inference on Quora Question Pairs2 _{(QQP): Language} In-ference is the task of determining the relationship between texts. In this case, the QQP dataset task consists of determining whether pairs of questions are semantically equivalent, which means this is a binary classification task of sequence pairs.

• Langage Inference on Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018): the MNLI task consists of determining the rela-tionship between a premise and a hypothesis: entailment, neutrality or con-tradiction. This dataset comes in two flavours: matched, where topics in the training corpus are the same as those in the test set, and mismatched, where topics from the test set are different than those from the training set. In this

(25)

work, we use the matched dataset.

Task Train Test Sample Labels

SST2 67k 1.8k The mesmerizing performances of the leads keep the film grounded and keep the audience riveted.

Positive, Nega-tive

QQP 364k 391k What is crossing-over in meiosis? ?

= What is the process of crossing over in meiosis?

Duplicate, Not Duplicate

MNLI 393k 20k It happened here about 2.1 billion years ago. ⇒ The event was last? week.

Entailment, Contradiction, Neutrality

(26)

Chapter 4 Modeling Compression

In this chapter, we explain the considerations and design choices made throughout this work to explore compression.

4.1 Compression and Information Theory

Now we provide an overview on the classical problem of compression and Information Theory. The goal of this section is not to present an exhaustive overview, but to help build an intuition around the concepts of entropy, information, uncertainty and compression, which will be useful for later discussion.

Information can be understood as the resolution of uncertainty (Bishop, 2006, Chapter 1.6). The amount of information revealed upon knowing the outcome of a stochastic event depends on how surprising it is: the more surprising, the more informative. For instance, the observation that it is raining in Amsterdam is less informative than it being raining in Barcelona, because the latter is more surprising (i.e. it has lower probability). Similarly, the observation that the sun will rise tomorrow carries no information, because we have certainty of it thus it is completely redundant. Shannon’s information content of an outcome x is then formally defined in terms of the probability of this outcome as h(x) = − log P (x) (Shannon, 1948). See Appendix A.1 for its formal derivation.

Now let us consider an Ensemble X, a set of stochastic variables whose out-come possibilities are finite, for which we can enumerate all possible outout-comes _AX

(27)

The entropy H(·) of an Ensemble X is then defined as its expected information content H(X) := E[h(x)] =P_x_∈A

XP (x) log P (x).

Shannon’s Source Coding Theorem proves formal constraints on how briefly mes-sages from a stochastic source (i.e. an Ensemble) can be encoded with a guarantee of not losing information (see formal Theorem in Appendix A.3). The key takeaway from this theorem is that optimal compression is lower bounded by the entropy of the source of information, which is determined by the probability distribution of its variables (i.e. H(X) = P_x_∈XP (x) log P (x)). It is useful to think of this as the fact that very sharp distributions –where the probability of an outcome is much higher than the rest– present lower entropy, thus carry less information and can be compressed into shorter codings.

The Source Coding Theorem has precise implications for symbol codes, where we encode messages as strings of discrete symbols, such as binary codings of 0s and 1s. This states (extracted from (MacKay, 2003, Chapter 5.4), originally (Shannon,

1948))

Theorem 4.1.1 Source Coding Theorem for symbol codes There exists a variable-length encoding C of an ensemble X such that the average length of an en-coded symbol, L(C, X), satisfies L(C, X)∈ [H(X), H(X) + 1). The average length is equal to the entropy H(X) only if the codelength for each outcome is equal to its Shannon information content.

Provided that a source’s entropy H(X) is determined by the probability distri-bution of its variables, if we want to optimize how many bits we spend on to encode each message from the Ensemble X, the codelength for a given message needs to be equal to its information content. To create such compression schema, we need ‘model’ of the variables involved in the ensemble; that is, a function of all joint probabilities from which we can recover marginals and conditionals.

In real-life problems, however, perfect ‘models’ of the probability space of possi-ble messages are intractapossi-ble (i.e. it is unfeasipossi-ble to exactly calculate the information of a full document as the probability of that exact string of text). Instead, approx-imations are used, such as assuming parts of the message are independent, which break messages into smaller pieces that do not require modeling the probability space of the message as a whole but as a set of smaller parts. For instance, Huffman Coding (Huffman,1952) generates codings for sequences of symbols which is optimal under the assumption that they are independent and identically distributed.

(28)

Dictionary Codings

A practical framework for compressing data are Dictionary Codings, which identify chunks of symbols that can be compressed by referencing them as an address in a data strucutre (a dictionary), such as (Ziv and Lempel, 1977). Our framework of chunking and generating (Section 3) draws inspiration from this family of algorithms, where groups of representations (fine representations) are identified (chunks), that are be compressed by generating coarser representations.

4.2 Geometric Compression

In this work, we explore a narrow perspective on compression, that we call Geo-metric Compression, in which we attempt to use geometry of the representations as a proxy of the model of information interactions between variables, which casts compression into a clustering problem.

The motivation and intuition behind focusing on Geometric Compression is the following: distributed semantic representations are defined by their geometry in an inner-product space, that is, with a well defined distance metric. This space is learned through pre-training tasks based on information theoretic interactions of these representations (Kong et al.,2020); for example, Masked Language Modeling originally used in BERT (Devlin et al., 2018) can be formally expressed as maxi-mizing a lower bound on the Mutual Information objective (the InfoNCE (van den Oord et al., 2018) objective, Equation 4.1) between a context and masked tokens:

I(A, B)≥ Ep(A,B)  fθ(a, b)− E_{q( ˜}_B)  logX ˜ b∈ ˜B exp fθ(a, ˜b)     + log | ˜B| (4.1) Where I·, · is the mutual information, a and b are different views of an input sequence (such as a word and its context), ˜B is a set of positive and negative samples drawn from a proposal distribution q( ˜_{B) and f}θ ∈ R is a function parametrized by θ.

We hypothesize that we can use clustering on the learned embedding space to find groups of representations –the chunks introduced in Section 3– that represent meaningful ‘units of information’ and that can be compressed more effectively in the context of sequences of semantic representations.

(29)

notions from kernel methods, in which a kernel function k (x, x′) = ϕ(x)T_{ϕ (x}′₎ defines a pairwaise metric for points in raw representation space; (Bishop, 2006, Chapter 6). In this case, one could conceptualize the idea of geometric compression as finding kernel functions k that estimate the Mutual Information between high dimensional variables (embeddings), or simply a metric that is an adequate proxy for their compression. In this sense, the idea of geometric compression could be achieved by only learning Kernel functions and clustering based on its metrics instead of explicitly learning representations, although we do not pursue this approach in this work.

The constraint of formulating compression as a clustering task is also motivated by the extensive work on this front for other applications (Jain et al., 1999).

The two main procedures for compression that this work is concerned with are presented in the following sections: chunking and generation.

4.3 Chunking

Most generally, chunking is the task of segmenting n representations into m ≤ n chunks, where m is not necessarily fixed, but a result of the chunking function. For instance, a simple instance of a chunking function is that which segments a sequence into a predefined number of segments S, and another one is that which segments a sequence into segments of length l, a predefined chunk size. Mathematically, this can be formulated as a function that returns a label (chunk belonging) for each element in a sequence:

fθ(X) = y X ∈ RL×D; y∈ NL×1 (4.2)

Where y(i) _{is the chunk segmentation for embedding i of the sequence X, with} i = 1, · · · , L the length of the sequence of representations. This formulation is convenient because it can be easily casted into a clustering formulation. Although clustering is often presented without a formal definition, it is widely understood as the “unsupervised classification of patterns into groups” (Jain et al., 1999), which effectively means the assignment of a “cluster label” to each datapoint in a set of

(30)

Figure 4.1: Diagram of chunking, best seen in color.

points. Thus, the chunking process is equivalent to a clustering function which returns a cluster label for each datapoint. Unlike most clustering problems, however, we assume our datapoints (embeddings) to be sequences instead of sets, where ordering is present, so our clustering (chunking) functions will im-pose constraints based on the sequentiality of elements in clusters (chunks), either explicitly (by a connectivity matrix) or implicitly (by function design).

4.3.1 Non-Parametrized

The most elementary version of non-parametrized chunking is often implicitly done by most neural networks, within the operation of pooling, whenever a set of rep-resentations are aggregated into a single one. In this case, the chunking function assigns a single same chunk to all representations in a set or sequence. In this work, we consider five main forms of non-parametrized chunking explained below.

Baselines

• Fixed Output chunking: this baseline consists of dividing a sequence into N predefined chunks of equal size, taking into account that the chunk sizes must be natural numbers. For instance, a sequence of length L = 6; [v1, v2, v3, v4, v5, v6] with Fixed Number Chunking of C = 3 will result in a segmentation of y = [c1, c1, c2, c2, c3, c3] where ci is the cluster label. The segmentation must

(31)

be as close to equal size as possible, with the smaller ‘remainder’ chunk at the end of the sequence.

• Fixed Size chunking: this next baseline consists of fixing the size of chunks and segmenting a sequence with chunks of that size. For instance a sequence of length L = 5; [v1, v2, v3, v4, v5] and fixed chunk size of N = 3 will be segmented into y = [c1, c1, c1, c2, c2]. Similarly as for Fixed Output chunking, whenever exact equal size segmentation is not possible the smaller ‘remainder’ chunk will be the last one.

• Random Chunking: to establish a baseline on the relevance of chunking itself, we perform some of our experiments using random chunking. This approach consists of a modification of ‘hard size chunking’; where given a sequence and a base span s ∈ N , we draw from a uniform distribution of integer values U[1, 2s) and generate a chunking until the end of the sentence is reached. We choose this mechanism to make the expected compression of a sequence equivalent to that of the original ‘hard size chunking’, given that Es∼U[1,2s)[s] = s.

Proposed

• Frequency based chunking: this chunking approach models each represen-tation as marginally independent, and chunks a sequence following a prob-abilistic budget, which can also be formulated as a self-information budget. This means that we assign a probability to each representation based on its frequency on a corpus, its marginal likelihood P (vi), and assuming that

se-quential representations are marginally independent P (vi|vk̸=i) = P (vi), we

model the probability of a sequence as the multiplication of their likelihoods.

P (vi:j) = j

Y

k=i

P (vk) (4.3)

Whenever this joint likelihood drops below a threshold, we consider the chunk to be at ‘full capacity’, as its likelihood is below the budget. For numeri-cal stability and ease of computation, we formulate this budget in terms of negative log-likelihood, where the joint log-likelihood is obtained by the sum of log-marginals (instead of their product), and the probabilistic threshold is

(32)

defined as a log-probability. log P (vi:j) = j X k=i log P (vk) (4.4)

This can also be conceived as a chunk’s self-information budget, where we impose that each chunk’s self-information _{− log P (v}i:j) must be below

a given threshold. In terms of algorithmic implementation, the model sums log-likelihoods of the representations iteratively and assigns a same chunk level until the information budget is reached. Then, a new chunk is assigned to that representation and the process is restarted. This process does not guarantee an optimal use of the information budget, given that small ‘remainder’ chunks can be left at the end of the sequence; however, we discarded a fully optimal ‘sequential sum clustering’ that also maximizes the uniformity of chunk self-information because of its implementation complexity.

We choose to not use frequencies from our datasets and rely instead on the pre-training data to avoid any task-specific optimization. However, token frequencies from the pre-training corpus are not available, but the tokenizer implementation (Wolf et al., 2019) provides their ranking according to their frequency, so we can use the empirical Zipf Law approximation for the En-glish language (Powers, 1998) from Equation 4.5, and use it for the inverse probability weighting used in (Arora et al., 2017) 4.6:

P (w)∝ 1 rα

w

→ log P (w) = −α log rw+ C (4.5)

Where rw is the rank of word w and α is the exponent coefficient

(empiri-cally determined to be close to 1). The log-likelihood can be derived up to an additive constant C, which theoretically represents the “base frequency” of the word at rank 1, which in the English case is the word the, sitting at P (the) ≃ 0.07 which results in approximately C = log P (the) ≃ −2.7. Nev-ertheless, this approximation is known to overestimate the probabilities of frequent words (Powers, 1998), which means that the optimal fitting C will not correspond to the calculated here, so we will assume this additive constant is a hyperparameter of the chunking function.

• Hirearchical Clustering chunking: hirearchical clustering is a method of cluster analysis that performs clustering by building a hirearchical tree given a set of points. Given a set of datapoints, the clustering process works by

(33)

ei-ther: (1) iteratively aggregating datapoints from the bottom-up (agglomerative clutserging) or (2) by a top-down approach of dividing the set iteratively into smaller sets. The criteria for aggregating or splitting datapoints and sets of points is performed according to a distance metric and a linkage strategy. The distance is the function that defines a metric between pairs of individual datapoints, and the linkage criteria is the equivalent distance metric for sets of points, usually defined in terms of the distance metric, which enables the merging or splitting of closest sets at each iteration. In this work, we focus on the bottom-up Agglomerative clustering as a chunking function.

Two properties make Agglomerative Clustering adequate for our purpose of Geometric Compression. Frist, the stopping criteria for the algorithm can be defined in terms of a distance based threshold instead of a predefined number of desired clusters. This aligns with the proposed third constraint about deter-mining what to compress should be controllable and not dependent on length. Secondly, sequentiality constraints can be imposed by providing a connectiv-ity matrix of the datapoints, which specifies what merges are allowed. For instance, imposing full sequentiality means that the datapoint in position 2 cannot be in the same cluster as the datapoint 5 unless datapoints 3 and 4 are also in the same cluster.

4.3.2 Parametrized

Parametrized chunking can be formulated as a metric learning problem in which an embedding space is transformed into one a new space where pairwise embedding distances learn to satisfy an objective function, either supervised or self-supervised. After the space transformation, the method of hirearchical clustering discussed in non-parametrized chunking are applicable. In this work we only do representation learning in synthetic retrieval experiments in which we apply a contrastive learning objective, developed in Section 5.3.

4.4 Generation

Generation is the process by which a a single representation is produced from a chunk. Like with chunking, generation is implicitly done in many neural architec-tures under the name of pooling, such as mean pooling or max pooling.

(34)

Figure 4.2: Diagram of the generation procedure, given a the original embeddings (pale blue) and their chunkings (color framings).

4.4.1 Non-parametrized

We consider the following non-parametrized generating strategies:

• Random generation: similarly as for random chunking, we perform some experiments using ‘random generation’ as a baseline to determine assess the relevance of generation itself. Given a chunk C of length L, this function randomly samples an index from a uniform distribution of natural numbers i =U[1, L] and returns the representation of index i as the compressed repre-sentation for the chunk C.

• Mean and max pooling generation: Neural Networks have included pool-ing operations for many years as a means to model an N-to-one downsizpool-ing of features. The mean-pooling operation consists of averaging values along the dimension that is being compressed, and max-pooling consists of tak-ing the maximum value along the compression dimension. In computer vision and convolutional architectures, 2D max-pooling is ubiquitous, and in Nat-ural Language Processing, mean pooling and max pooling are often used to compress sequences of features into a single feature vector. However, unlike for Computer Vision applications, we apply pooling over embeddings with-out previous ReLU activation functions, so our max-pooling implementation actually pools based on absolute values.

(35)

• Frequency-based generation: as proposed by (Arora et al., 2017), in-verse frequency weighting can be a powerful non-parametrized tweak to ag-gregate word embeddings, although in their work they experiment with non-contextualized embeddings and it is unclear how much this effect translates into contextulized embeddings. As discussed before, token frequencies from the pre-training corpus are not directly available to us, but we can use the fre-quency approximation from the Zipf Law in Equation 4.5, as with frefre-quency based chunking, and generate compressed embedding (G(vc)) as a weighted

sum of its parts:

G(vc) = X w∈c a p(w) + avw vc ∈ R L×D _(4.6)

Where vc = [vw1, vw2· · · , vwL] are the embeddings in a chunk c of size L; a is

a hyperparameter to shape the weight-probability function, with the weight bounded between 0 and 1. We set a = 0.001 in accordance to (Arora et al.,

2017).

4.4.2 Parametrized

We experiment with two different parametrized architectures for generation:

• 1D Convolution + Attention: We implement a simple Neural Network architecture that models an N to one relationship, composed first by a 1D convolution on the sequence dimension, followed by an single head Attention layer (Bahdanau et al., 2015) in which the context is the original sequence and the query is a set of pre-defined learnable tensors, thus making the out-put of fixed size. Given context c _{∈ R}D×L _{and a query q}

θ ∈ RD×N (with D

being the dimensionality, L the context length and N the fixed query length, parametrized by θ); the output h ∈ RD×N _{is the weighted sum of the query}

with the attention weights αij which are the softmax of the dot product

be-tween element i of context and element j of query. C is a linear transformation applied to the context c and the output has an added non-linearity (tanh).

αij = exp(qT_jCci) PL kexp(q T jCck) (4.7)

(36)

Figure 4.3: Parametrized Generator architecture.

Figure 4.4: LSTM-based Generator architecture.

hj = tanh L

X

i

αijci (4.8)

As shown in Figure 4.3, we add a linear layer, avoiding the constraing on the output to be between −1 and 1 because of the tanh activation function. • Long Short Term Memory (LSTM): The second parametrized generation

approach we implement is based on a recurrent network, to cover a wider range of architecture families. LSTMs (Hochreiter and Schmidhuber, 1997) are one of the most widespread variants of recurrent networks, and in this work we use them to process a the representations in a chunk, from which we obtain hidden representations, and then we perform max pooling do downscale them to a single representation, the output of the generator function.

(37)

modeling compression, which is the backbone of our contributions, along with all different strategies we consider in our experiments, whose results are presented in the next section.

(38)

Chapter 5 Results

We develop all our models primarily using GPU-accelerated PyTorch (Paszke et al.,

2017) via its Python API and the HuggingFace (Wolf et al., 2019) Transformers library. The code is publicly available at thisGithub repository1_.

5.1 Quantitative Experiments

We now present the results of the experiments performed on downstream tasks according to our methodology section.

Preliminary considerations

In the introduction chapter we provide an informal definition of what makes a good representation: “that which makes subsequent learning tasks easier” (Goodfellow et al., 2016). In accordance to this principle, we constrain our models and experi-ments precisely to gain insights on the representation side instead of the modeling one:

• We limit the complexity of the classifier networks, implementing simple exist-ing architectures, specified in Appendix B.2. The rationale for this constraint is that we want to assess the quality of the representations, not the classifier itself.

(39)

• We mask special representations produced by BERT (<cls> and <sep> to-kens), given that these special tokens capture sentence-level semantics that we want to prevent the model from using.

As mentioned in Section 3, most of the presented results have the objective of being analyzed comparatively from within this work; not maximize the performance, which is why most results are presented as plots instead of tables.

Agglomerative Chunking

For the agglomerative chunking approach, we tested cosine similarity with average linkage and found that the combination produced unstable clusters, where big clus-ters tend to ‘capture’ the rest of the datapoints. We settled with using the Ward linkage strategy (Ward,1963), which joins clusters that minimize the variance within clusters (sum of squares of distance with respect to the cluster centerP_i_{||¯x − x}i||2)

and is known to be more stable for cluster sizes. Along with it, we use euclidean distance metric on normalized embeddings, given that our implementation based on Python’s scipy2 _{library only allows for euclidean distance given Ward linkage.}

Performance of Transformer Layers

We start by a comparison of performance on the tasks depending on the Trans-former output layer we use as features for the downstream tasks in Figure 5.1. We perform this experiment to justify our layer selection as contextualized embeddings that we use in further experiments. The zeroth layer (non-contextualized) performs substantially worse than contextualized ones, and the very last layers slightly under-perform previous contextualized ones; which is a known effect, presumably because last layers specialized on modeling fine-grained syntactic phenomena instead of se-mantics (Devlin et al., 2018). Based on this experiment, we decide to conduct our following experiments with features from layers 0 and 10 as non contextualized and contextualized embeddings. According to the literature (Devlin et al., 2018;

Camacho-Collados and Pilehvar,2018), using close to last layers but not last yields optimal performance in downstream tasks, so we have no reason to believe that using features from other deep layers such as 9 or 11 would change results substantially.

From our choice of layers 10 and 0, we retrain each task classifier until fully optimal convergence (+200k steps) and use those classifier checkpoints for the next

(40)

Figure 5.1: Non-compressed performance (accuracy) on tasks using each layer of the Transformer’s output as features for the task specific network, trained for 24k steps. Results from 5 different training runs with different random seeds, with the line as the average and shaded area the standard deviation.

SST2 MNLI QQP

Features dev test dev test dev test Layer 0 0.867 0.862 0.589 0.567 0.833 0.787 Layer 10 0.923 0.904 0.736 0.711 0.888 0.853

Table 5.1: Gold accuracies on development and test sets of different datasets, using features from layers 0 and 10 of bert-base-uncased pre-trained BERT (Devlin et al.,2018)

experiments, see the specifics of this run in Appendix C.1.2. From now on, we refer to these as gold or non-compressed checkpoints and performances; their accuracies are presented in Table 5.1. Boldface results are used to normalize compression-preformance trade-off figures in the following sections.

5.1.1 Non-parametrized chunking, non-parametrized

gener-ation

We start by compressing sequences with both non-parametrized chunking and gen-eration. The compression-performance tradeoff from using layer 10 is compared

(41)

to non-contextualized embeddings obtained from layer 0 of the same model, along with different strategies. Across the results, red is used to present contextu-alized results and blue to present uncontextucontextu-alized ones.

Zero-shot compression

Figures 5.2, 5.3 and 5.4 show the loss in accuracy with respect to the non-compressed representations in a zero-shot transfer3 _{setting, i.e. where compression is performed} without retraining the classifier network. A useful lens to interpret these results is as a degree of invariance of classifier networks under different chunking and pooling strategies.

From these experiments we highlight the following phenomena:

1. Despite no fine-tuning of classifiers, performance loss is not catastrophic in most of the cases, in the sense that performance is comfortably above a random baseline and the decrease in performance is considerably smooth, showing clear trends.

2. Differences between chunking strategies are in general minimal, with some exceptions where agglomerative chunking marginally underperforms the rest. However, generator functions do make a considerable difference, and mean pooling is clearly the overall most robust generating strategy in the zero-shot setting, though with some notable exceptions. These are for instance, in the case of the SST2 task for layer 0 embeddings, where mean-pooling introduces a high degree of degradation and frequency-based and max pooling mitigate that phenomenon. However, these same two pooling strate-gies introduce a mode of failure for QQP task, where performances ultimately drop below what a random baseline would achieve for non-contextualized em-beddings (i.e. 0.5_{× 0.787 < 0.5).}

This effect could be related to the specific architecture of the classifier for this task, which is based on a recurrent network, whereas the other two, QQP and MNLI, are mainly based on an attention mechanism.

3. Generally, the harder the task the steeper degradation we observe. This is specially true for contextualized representations, whereas non-contextualized

3_{The term ’zero-shot’ in Machine Learning usually refers to a task where a model is evaluated}

on identifying classes that have not been seen during training; however, we borrow the term here to emphasize the fact that the model is evaluated on out-of-distribution data.

(42)

show more irregular behaviour.

4. Non-contextualized embeddings tend to degrade to a greater degree, which is also an important, non-trivial effect given the normalization of the results: not only the non-contextualized embeddings do worse in absolute terms, but the relative rate of degradation is also higher. This effect presents its excep-tions, such as max-pooling in SST2, where contextualized embeddgins seem to enter a mode of high failure for high compressions where non-contextualized embeddings do not experience this effect. Also, for the MNLI task, non-contextualized performance to begin with is very weak (at 0.567 accuracy), naturally flattening the slope of degradation because it does not have a lot of margin to decrease: the original performance of non-contextualized embed-dings for this task is an accuracy of 0.567, so at a degradation of 0.6 perfor-mance would be equivalent to that of a random classifier 0.6× 0.567 ≃ 0.34.

Finally, In Figure 5.5, we observe the effects of performing random chunking in comparison to hard chunking with mean-pooling in the same zero-shot transfer setting. Degradation of contextualized embeddings is mild in comparison to the hard chunking, whereas non-contextualized compression suffers slighlt more from the random chunking.

Finetuning of classifier network

Figures 5.6, 5.7 and 5.8 show results when fine-tuning the classifier networks for 10k steps, comparing 3 different chunking strategies: hard and agglomerative chunking, along with mean, max and frequency pooling respectively. We also add random chunking to observe the base relevance of chunking and we also introduce a new chunking variant, where features from layer 0 are chunked according to the agglom-erative clustering from layer 10. Fine grained specifications on the configuration for these runs can be found in Appendix C.2.2.

The results show less degradation across the board (it is maintained above 0.9 in most cases); and some of the observations made before are still applicable:

1. Mean pooling performs best consistently. Max and frequency pooling perform similarly on SST2 and QQP, but frequency pooling on MNLI substantially underperforms the rest specially for non-contextualized embeddings.

(43)

(a) Mean pooling (b) Frequency-based pooling

(c) Max pooling

Figure 5.2: Zero-shot transfer of compression of fixed, hard, freq-based and agglom-erative chunking strategies, with mean and frequency pooling. Features from layer 0 (blue dashed) and 10 (red dotted) as a function of compression.

(44)

(c) Max pooling

(45)

(c) Max pooling

(46)

Figure 5.5: Zero-shot performance with random chunking and mean pooling. We fix the y-axis’ range to facilitate visual comparison.

Compressing Sequences of Semantic Representations

MSc Artificial Intelligence

Master Thesis