Meta-Learning for Few-Shot and Continual Learning in NLP

(1)

Meta-Learning for Few-Shot and

Continual Learning in NLP

by

Nithin Holla

12166804

August 10, 2020

48 ECTS November 2019 - August 2020 Supervisors: Dr. Ekaterina Shutova Dr. Helen Yannakoudakis Pushkar Mishra Assessor: Dr. Wilker Aziz

(2)

Acknowledgments

The past two years in the masters has been an incredible learning experience. I am extremely grateful to my supervisors Dr. Ekaterina Shutova, Dr. Helen Yannakoudakis and Pushkar Mishra for giving me the opportunity to work on this research topic for the thesis. Their support, guidance and timely feedback have been instrumental in making the thesis materialize into its current form. Also thanks to Dr. Wilker Aziz for being the assessor for the thesis.

I would like to thank Lotta Meijerink who has been a good friend and a fantastic study partner during the coursework. Interactions with her also helped me get acquainted with the Dutch culture and their way of life. I am very thankful to Shantanu Chandra and Phillip Lippe for the fruitful discussions where we exchanged ideas and shared insights while working on the thesis. I am also grateful to Shantanu, Phillip and Lotta for reviewing the thesis and providing useful feedback. I would like to thank Mahsa Mojtahedi, Bella Nicholson and Fernanda Duarte for some good times during the last two years. Special thanks goes out to Ellis van Beek and Shantanu Chandra for making the COVID-19 quarantine a little bit better.

Finally, I would like to thank my parents for their support and being the primary reason behind me pursuing the masters. This thesis is dedicated to their efforts over all these years to make it possible.

(3)

Abstract

The ability to learn new tasks rapidly with just a few examples by leveraging past experience, as well as the ability to learn new tasks continuously during a lifetime are hallmarks of human intelligence. In stark contrast, current deep learning models lack this versatility and struggle to learn in the absence of a large amount of task-specific labeled data and suffer from catastrophic forgetting when learning tasks sequentially. In natural language processing, even though large transformer-based language models have achieved great success on downstream tasks, they too overfit on a small number of examples during fine-tuning and are susceptible to catastrophic forgetting. In this work, we demonstrate that meta-learning can be an overarching learning paradigm that can overcome both these shortcomings on language tasks, taking one step closer to achieving general linguistic intelligence. Firstly, we propose a meta-learning framework for few-shot word sense disambiguation. We show that by training on several disambiguation tasks on a set of words, it is possible to effectively disambiguate new words with as few as four examples. Secondly, we present a meta-learning approach combined with sparse experience replay for continual learning on language tasks, achieving state-of-the-art performance on lifelong text classification and relation extraction among comparable methods. We show that our approach is, in addition, efficient in terms of both computation and memory.

(4)

List of Figures

2.1 Example meta-learning setup . . . 6

2.2 Training and test setup for siamese neural networks . . . 8

2.3 Overview of matching networks . . . 9

2.4 Illustration of prototypical networks . . . 10

2.5 Illustration of meta networks . . . 13

2.6 Illustration of LSTM meta-learner . . . 14

2.7 Illustration of MAML . . . 16

3.1 Number of meta-test episodes vs. number of support set senses . . . 23

3.2 Number of meta-test episodes vs. number of query set senses . . . 23

3.3 WSD model architecture . . . 24

3.4 F1 score distribution . . . 30

3.5 Effect of number of episodes . . . 31

3.6 Effect of number of senses . . . 31

3.7 t-SNE visualization with GloVe . . . 32

3.8 t-SNE visualization with ELMo . . . 32

4.1 Architecture of OML . . . 36

4.2 Architecture of ANML . . . 37

4.3 Task distribution for relation extraction . . . 40

4.4 Visualization of neuromodulation . . . 45

(6)

List of Tables

3.1 Statistics of our few-shot WSD dataset . . . 22

3.2 Best hyperparameters for WSD . . . 28

3.3 Average macro F1 scores of the meta-test words . . . 29

3.4 Average macro F1 scores of the meta-test words with second-order gradients . . . . 30

3.5 Challenging cases . . . 31

4.1 Hyperparameters for continual learning . . . 41

4.2 Test set accuracy on text classification . . . 42

4.3 Test set accuracy on relation extraction . . . 42

4.4 Ablation study . . . 43

4.5 Effect of replay rate . . . 44

(7)

Chapter 1

Introduction

The human brain has the exceptional ability of learning new tasks with little information and sometimes with just a few minutes of experience. For example, we can recognize new objects by seeing only a few instances of them. In language, we can use a new word in conversation or writing after encountering its usage in a few sentences. Furthermore, we are capable of learning new tasks continually during our lifetimes, for instance, first learning to ride a bicycle and then a motorbike or learning languages one after another. In doing so, we do not eventually forget to ride a bicycle or to speak languages that we acquired early on in life. Current deep learning systems, however, have fundamental limitations with regards to both these aspects. Firstly, they require a large amount of data to learn any new task, and secondly, they have the tendency to forget the knowledge gained from previous tasks when learning a new task, a phenomenon

known as catastrophic forgetting (McCloskey and Cohen,1989;Ratcliff,1990;French,1999).

The goal in few-shot learning is to learn tasks with just a handful of training examples, and

the goal in lifelong learning or continual learning (Thrun, 1998a) is to accelerate learning by

positive transfer between tasks while minimizing interference with respect to network updates

(Riemer et al., 2019). Both few-shot and continual learning are important in the real world

where task-specific data is often scarce and non-stationarity is inevitable due to the continuously evolving nature of data. Therefore, we need to design more generalizable systems as well as more robust learning mechanisms to deal with interference.

Meta-learning has paved the way to directly optimize for the desired learning behavior

instead of combining manually-designed building blocks (Clune,2019). In this thesis, we show

that meta-learning is an overarching learning mechanism that advances both few-shot and continual learning on natural language processing (NLP) tasks. Since few-shot learning on word-level semantic tasks is relatively unexplored, we first develop a meta-learning framework for the challenging task of few-shot word sense disambiguation, where the objective is to learn to disambiguate new words with very few labeled examples. Next, since current methods for continual learning on language tasks involve computational bottlenecks, we develop an algorithm based on meta-learning and sparse experience replay that is efficient in terms of computation and memory capacity.

1.1 Motivation

An important distinction between how learning occurs in humans and in deep learning models is that humans possess a wealth of prior experience that they can leverage to quickly adapt to a new task whereas deep learning models learn the task from scratch, starting from a random

(8)

initialization. Transfer learning (Caruana, 1993) is a popular approach of knowledge transfer by using models pre-trained with a large amount of data from one or few tasks and fine-tuning them on data from a new task. In NLP, despite the success of large pre-trained models such as

BERT (Devlin et al.,2019), they require a lot of in-domain examples for training on a new task

and are also prone to catastrophic forgetting, indicating that they are a far cry from general

linguistic intelligence (Yogatama et al.,2019). These models, like other deep learning models,

overfit on a small number of examples and are unable to learn effectively when faced with non-stationary data distributions.

Meta-learning, commonly called learning to learn (Schmidhuber 1987; Bengio et al. 1991;

Thrun 1998b), is a learning paradigm that aims to acquire knowledge from across a set of

related tasks so that at test time it can learn a new task with only a few examples. Its objective is to learn the learning process to enable knowledge transfer and fast adaptation. Meta-learning has emerged as a promising approach to few-shot learning. It has achieved

success in computer vision (Triantafillou et al.,2020;Fontanini et al.,2019;Wang et al.,2020)

and reinforcement learning (Wang et al.,2016;Duan et al.,2016;Alet et al.,2020). It has also

recently made its way into NLP, and has been applied to machine translation (Gu et al.,2018),

relation classification (Obamuyide and Vlachos,2019a), text classification (Yu et al.,2018), and

sentence-level semantic tasks (Dou et al.,2019;Bansal et al.,2019). However, in NLP, few-shot

learning on lexical (word-level) semantic tasks in particular has received little attention. Natural language is inherently ambiguous, with many words having a range of possible meanings. Word sense disambiguation (WSD) is a core lexical semantics task, where the goal is to associate words with their correct contextual meaning from a pre-defined sense inventory.

WSD has been shown to improve downstream tasks such as machine translation (Chan et al.,

2007) and information retrieval (Zhong and Ng,2012). However, it is considered an AI-complete

problem (Navigli,2009) – it requires an intricate understanding of language as well as real-world

knowledge. Since every word has different number of senses, traditional supervised WSD models typically need a significant number of labeled examples per word. Our key idea of designing a meta-learning framework for few-shot WSD stems from the need to be able to learn from just a few labeled examples per word in order to alleviate this problem.

Most works on continual learning focus on vision tasks and continual learning on language

tasks in particular has not been studied extensively so far (Li et al.,2020). Currently, popular

approaches to continual learning are based on manually-designed heuristics such as

regulariza-tion (Kirkpatrick et al.,2017) or gradient alignment (Lopez-Paz and Ranzato,2017;Chaudhry

et al.,2019) to mitigate catastrophic forgetting. A recent trend in machine learning is to instead

automatically learn generalizable solutions to problems via meta-learning (Clune,2019).

In line with this, meta-learning has been applied with the objective of learning new tasks

con-tinually with a relatively small number of examples per task (Javed and White,2019;Beaulieu

et al.,2020) or in a traditional continual learning setup by interleaving with several past

exam-ples from a memory component (Riemer et al.,2019;Obamuyide and Vlachos,2019b). While a

high rate of experience replay (Lin,1992) usually mitigates catastrophic forgetting, it is

expen-sive and does not scale well to realistic scenarios. Therefore, incorporating sparse experience replay into meta-learning can lead to more viable solutions.

1.2 Contributions

Our contributions can be grouped into two parts – designing a meta-learning framework for few-shot WSD and for continual learning on language tasks – which we detail below.

(9)

Few-shot WSD We present the first meta-learning approach to few-shot WSD. We propose models that learn to rapidly disambiguate new words from only a few labeled examples. Meta-learning approaches have so far been typically tested in an N -way, K-shot classification setting where each task has N classes with K examples per class. Owing to its nature, WSD exhibits inter-word dependencies within sentences, has a large number of classes, and inevitable class imbalances; all of which present new challenges compared to the aforementioned controlled setup. To address these challenges we extend three popular meta-learning algorithms to this

task: prototypical networks (Snell et al., 2017), model-agnostic meta-learning (MAML) (Finn

et al., 2017) and a hybrid thereof – ProtoMAML (Triantafillou et al., 2020). We investigate

meta-learning using three underlying model architectures, namely recurrent networks,

multi-layer perceptrons (MLP) and transformers (Vaswani et al.,2017), and experiment with varying

number of sentences available for task-specific fine-tuning. We evaluate the models’ rapid adap-tation ability by testing on a set of new, unseen words, thus demonstrating its ability to learn new word senses from a small number of examples. We show that it is possible to obtain as much as 72% F1 score on average using only four sentences to learn to disambiguate a new word. Since there are no few-shot WSD benchmarks available, we create a few-shot version of a publicly available WSD dataset. We release our code as well as the scripts used to generate

our few-shot data setup to facilitate further research.1

Continual language learning We propose a novel approach to continual learning on

lan-guage tasks using meta-learning and experience replay that is sparse in time and size. We consider the realistic setting where only one pass over the training set is possible and no task

identifiers are available. We extend two algorithms – online meta-learning (OML) (Javed and

White,2019) and a neuromodulatory meta-learning algorithm (ANML) (Beaulieu et al.,2020)

to language tasks and beyond their original setup of learning to continually learn such that a new sequence of tasks can be learned during meta-test time by better mitigating catastrophic forgetting. We show that combining a strong language model such as BERT along with meta-learning and sparse replay via an episodic memory module produces state-of-the-art performance on lifelong text classification and relation extraction benchmarks when compared against cur-rent methods under the same realistic setting. To the best of our knowledge, this is the first meta-learning approach to continual learning of language tasks that incorporates sparse replay. Through further experiments, we demonstrate that our method is considerably more efficient than previous work in terms of computational complexity as well as memory usage. To facilitate

further research in the field, we make our code publicly available.2

We believe that this work brings us one step closer to the grand ambition of achieving general linguistic intelligence. Our work underscores the strength of meta-learning in being a unifying learning paradigm that can solve two of the biggest challenges in NLP (and machine learning in general), namely few-shot learning and continual learning.

1.3 Outline

The thesis is organized into five chapters in total. In Chapter2, we first discuss meta-learning

and continual learning in general to make the rest of the chapters comprehensible. Then, we highlight some related work on meta-learning and continual learning in NLP. We present our

1

https://github.com/Nithin-Holla/MetaWSD

(10)

work on few-shot WSD in Chapter3 and our work on continual learning on language tasks in

Chapter 4. Finally, we provide concluding remarks and present some ideas for future work in

(11)

Chapter 2

Background and Related Work

In this chapter, we provide an overview of meta-learning as well as continual learning. In Section

2.1we introduce the concept of meta-learning and discuss the different terminologies. In Sections

2.2, 2.3 and 2.4 we present some of the existing and well-known approaches to meta-learning.

Section 2.3 serves to provide a full picture of meta-learning and is not required to follow the

rest of the thesis. In Chapter 2.5, we introduce continual learning formally and describe three

well-known approaches. Finally, we discuss previous work on few-shot and continual learning in

NLP. In Section2.6, we highlight works that applied meta-learning to few-shot learning tasks

in NLP and in Section2.7, we highlight some papers on continual learning on language tasks.

2.1 What is meta-learning?

Meta-learning, sometimes referred to as learning to learn (Schmidhuber 1987; Bengio et al.

1991; Thrun 1998b), is a learning paradigm that aims to quickly learn a new task with very

few training examples. To achieve this, a model is trained on a large number of related tasks so that it can acquire knowledge from across these tasks that helps in generalizing to an unseen task. Having acquired information on how to learn the different tasks, it would only need a few examples to adapt to a new task. A typical meta-learning setup consists of two components – a

learner that adapts to each task from the small amount of training data pertaining to the task

and a meta-learner that guides the learner by acquiring knowledge that is common to perform all the tasks. While transfer learning does fine-tuning of a pre-trained model on a particular task, meta-learning seeks to explicitly build a model that allows quick fine-tuning on a larger set of tasks.

At test time, we present a previously unseen task to the trained model and want it to learn to perform the task from a few examples. Thus, at test time, there is a small training set to train on and a test set to evaluate on. For gradient-based learning, this means that the model has to take only a few gradient steps to learn the task. To facilitate this form of learning, we would have to match the training conditions to the testing conditions, which leads to a different training setup compared to traditional machine learning. The training set and test set in meta-learning are called meta-training set and meta-test set respectively. The meta-training set is made up of a set of tasks instead of just a set of data points. Each task in the meta-training set/meta-test set is presented in the form of an episode consisting of:

• Support set : A small number of training examples for task adaptation. • Query set : Test examples to evaluate the performance on the task.

(12)

x(1)₁ x(1)₂ x(1)3 x(1)4 x′(1)1 x′(1)2 D(1)_support D(1)query T1 x(2)₁ x(2)₂ x(2)3 x(2)4 x′(2)1 x′(2)2 D(2)_support D(2)query T2 x(3)1 x(3)2 x(3)3 x(3)4 x′(3)1 x′(3)2 D(3)_support D(3)query T3

D

meta-train

D

meta-test

Figure 2.1: 2-way, 2-shot few-shot learning setup with two tasks in the meta-training set and one task in the meta-test set. Each colour indicates a different class.

More formally, let the meta-training set be denoted as Dmeta-train and the meta-test set

be denoted as Dmeta-test. Both the sets consist of episodes created from task Ti drawn from

a probability distribution over tasks p(T ). For a given task Ti, the support set consists of

D(i)_support = {(x(i)_j , y_j(i))}|S|_j=1 where x(i)_j denotes input examples, y_j(i) denotes the corresponding

labels and |S| denotes the size of the support set. Similarly, the query set consists of D(i)query=

{(x0(i)

j , y0(i)j )} |Q|

j=1 where |Q| denotes the size of the query set.

Most works on classification with meta-learning have a N -way, K-shot setup where N is the number of classes and K is the number of examples per class. In this case, each episode has

N · K examples in its support set. However, this is not a necessary requirement but rather a

simplified setting. A more realistic setup would have different number of classes and different

number of examples per class in each episode/task (Triantafillou et al.,2020).

To illustrate the different terminologies introduced so far, we show an example of a 2-way,

2-shot setup in Figure 2.1. The example shows two episodes from tasks T1 and T2 and their

corresponding support and query sets. The meta-test set consists of a previously unseen task

T₃. We want the learned model to use a few examples from the support set of this task and

produce a good performance on its query set.

Meta-learning approaches are categorized into three types based on their learning mecha-nism: (1) metric-based, (2) model-based, and (3) optimization-based. We delve into the details of each of the categories and highlight some important works in the following sections. The extensive literature review provides a comprehensive overview of meta-learning. However, the

methods that we use in the thesis are only the ones presented in Section2.2.3,2.4.2,2.4.3 and

(13)

2.2 Metric-based meta-learning

Metric-based methods first embed the examples in each episode into a high-dimensional space and then obtain the probability distribution over labels for all the query examples based on a kernel function that measures the similarity between them and the support examples. For a given query example, the kernel function places higher probability on classes for which there is a higher similarity to their support examples, and vice-versa. The kernel function is typically a pre-defined similarity metric whereas the embedding function is parameterized by a neural network. The goal is then to obtain good embeddings that minimize the loss function defined on the query examples. We now describe three well-known metric-based meta-learning methods, namely siamese neural networks, matching networks and prototypical networks.

2.2.1 Siamese Neural Network for One-Shot Learning

Koch et al. (2015) proposed a siamese neural network architecture for one-shot learning. A

siamese network is a twin network which receives a pair of data points as input, produces an embedding of the data points and then computes the distance between these embeddings based on a metric. The distance is then used to compute the probability that the pair belongs to the same class. Just like siamese twins, the embedding networks are identical, sharing the same weights, while the distance function joins the two outputs together. The objective of the training procedure is to produce embeddings with a small distance for pairs from the same class and a large distance for pairs from different classes.

Suppose the embedding network is represented as fθ with parameters θ. For the input pair

(xi, xj), the authors use L1 distance as the metric to obtain the probability of belonging to the

same class:

p(xi, xj) = σ wT|fθ(xi) − fθ(xj)|

(2.1) where w is a vector of additional parameters that controls the weights on each component of the distance and σ is the sigmoid activation function. The approach easily extends to other choices of the distance metric. The training proceeds with the standard binary cross-entropy loss function with the label being 1 if the pair comes from the same class and 0 otherwise.

For their experiments, they use the Omniglot dataset (Lake et al., 2011) which contains

handwritten characters from 50 different alphabets. They use a subset of the alphabets to train on the verification task, i.e., learning to discriminate between the classes of the image pairs. The remaining subset of alphabets are used along with the trained network in different one-shot tasks. Each task consists of only one labeled image per class. More specifically, suppose there

are C classes, a test image x is compared with every other image xi, i= 1, ..., C and the class

of the image with the highest score p is the prediction: ˆ y(x) = y arg max xi p(x, xi) (2.2)

where y denotes the known true label of an image and ˆy is the predicted label. The general

setup of the verification and the one-shot tasks is shown in Figure2.2.

The underlying assumption in this approach is that the representations learned during the verification task are useful for the one-shot learning tasks. However, this only holds if the data in both the tasks is similar. The performance is expected to degrade as the data in these two tasks diverge from each other, for example if they come from different distributions.

(14)

Figure 2.2: The general training and test setup of siamese neural networks for one-shot learning.

Figure taken from Koch et al.(2015).

2.2.2 Matching Networks

Matching networks were proposed byVinyals et al.(2016) for the problem of one-shot learning.

Their model is a classifier cS for every support set S consisting of a small number of N

input-label pairs {(xi, yi)}N_i=1, one for each of the N classes. There are two embedding functions –

f and g – parameterized by neural networks and used to embed the query examples and the

support examples respectively. For example, they can be convolutional neural networks for vision tasks and word embedding network for language tasks. A high-level representation of

the method is shown in Figure2.3. Given a query example x, the probability of it belonging to

class c is obtained as:

p(y = c|x, S) =

N

X

i=1

a(x, xi)I(yi = c) (2.3)

where I is the indicator function and a is an attention mechanism which is a softmax over the cosine distance d between the embeddings of the query example and the support examples defined as a(x, xi) = exp(d(f (x), g(xi))) P jexp(d(f (x), g(xj))) (2.4) The classifier is essentially an extension of the weighted nearest-neighbor classifier. The embed-dings can be obtained in two ways – simple embedding and full context embedembed-dings.

Simple Embedding In the simple case, both f and g receive a single input example and

produce the embedding, i.e., f embeds a query example x whereas g embeds a support example xi.

Full Context Embeddings The authors note that it is better to have the embedding

func-tions be influenced by the entire support set S instead of embedding each example independently.

(15)

Figure 2.3: Overview of matching networks. Figure taken from Vinyals et al.(2016).

similar to xi so that the function for embedding is modified. A bidirectional LSTM (Hochreiter

and Schmidhuber,1997) is used as the encoding function g(xi, S). Suppose g0(xi) is a feature

extractor network (a CNN for images, for instance), then − → hi, −→ci = LSTM(g0(xi), − → hi−1, −→ci−1) (2.5) ←− hi, ←−ci = LSTM(g0(xi), ←− hi+1, ←−ci+1) (2.6) g(xi, S) = − → hi+ ←− hi+ g0(xi) (2.7)

To enable the support set S to influence the emedding of the test image, the function f uses read-attention over S.

f(x, S) = attLSTM(f0(x), g(S), K) (2.8)

where f0(x) is a feature extractor network (such as a CNN for images), g(S) is the embedding

function g applied to each xi ∈ S, and K is the fixed number of processing steps. The

read-attention mechanism is based on the set-to-set paradigm proposed byVinyals et al.(2015) and

works as follows: ˆ hk, ck= LSTM(f0(x), [hk−1, rk−1], ck−1) (2.9) hk= ˆhk+ f0(x) (2.10) rk−1= |S| X i=1 a(hk−1, g(xi))g(xi) (2.11) a(hk−1, g(xi)) = softmax(hTk−1g(xi)) (2.12)

where a is content-based attention. For K steps of reads,

attLSTM(f0(x), g(S), K) = hK (2.13)

Training An important contribution of this work is that it proposed that the training

con-dition has to match the test concon-dition. Therefore, training is performed on batches of tasks called episodes, where each task T is a distribution over the possible label sets L. An episode is formed by first sampling a label set L and then sampling examples from L to form the support set S and query set Q. The model parameters θ are trained to minimize the error in predicting

(16)

the labels in the query set Q conditioned on S. The training objective is thus: θ= arg max θ EL∼T  ES∼L,Q∼L   X (x,y)∈Q log pθ(y|x, S)     (2.14) 2.2.3 Prototypical Networks

Prototypical networks proposed bySnell et al.(2017) tackle the problem of few-shot learning in

general instead of one-shot learning. The idea is based on clustering as well as nearest-neighbor

classification. They use an embedding network fθparameterized by θ that produces a prototype

vector for every class as the mean vector of the embeddings of all the support data points for

the class. Suppose Sc denotes the subset of the support set containing examples from class

c ∈ C, the prototype is µc= 1 |Sc| X (xi,yi)∈Sc fθ(xi) (2.15)

Given a distance function d defined on the embedding space, the distribution over classes for a query point x is calculated as a softmax over negative distances to the prototypes:

p(y = c|x, θ) = softmax(−d(fθ(x), µc)) =

exp(−d(fθ(x), µc))

P

c0_∈Cexp(−d(fθ(x), µc0)) (2.16)

An illustration is provided in Figure 2.4. The method is applicable to any distance function as

long it is differentiable. The loss function for training is the negative log likelihood of the true class c∗_:

J(θ) = − log p(y = c∗|x, θ) (2.17)

Training episodes are generated by randomly choosing a subset of classes and then sampling some examples for the support and query sets.

Figure 2.4: Illustration of prototypical networks for three classes c1, c2, c3. Figure taken from

Snell et al. (2017).

If Euclidean distance is used, it can be shown that it is equivalent to a linear model with a particular parameterization.

−||fθ(x) − µc||2 = −fθ(x)Tfθ(x) + 2µTcfθ(x) − µTcµc (2.18)

The first term is constant with respect to class c, so it does not affect the softmax probabilities and can thus be dropped.

(17)

2.3 Model-based meta-learning

Model-based approaches try to achieve rapid learning directly through their architectures. To this end, they typically employ external memory so as to remember key examples encountered in the past and thus avoid forgetting. Rapid learning on a new task occurs by swift retrieval of relevant content from memory. Next, we outline two model-based meta-learning methods – memory-augmented neural networks and meta networks.

2.3.1 Memory-Augmented Neural Networks

There exists a general class of neural network architectures that are equipped with an external

memory storage. Recurrent modules such as LSTM (Hochreiter and Schmidhuber,1997) and

GRU (Cho et al.,2014) however have internal memory and thus do not belong to this category.

Neural Turing Machine, abbreviated as NTM (Graves et al., 2014), is one such model that

consists of two main components – the external memory storage itself and a controller that controls the read and write operations depending on the input.

Since it can encode previously encountered information in its memory, Santoro et al.(2016)

propose an NTM-based model for meta-learning which they call Memory-Augmented Neu-ral Network (MANN). While the original NTM implementation used both content-based and location-based memory retrieval, they use a purely content-based retrieval. The controller is a

feed-forward network or an LSTM network that takes the input xt at time step t to produce

a key vector kt. Suppose Mt is a N × M memory matrix at step t where N is the number of

memory locations and M is the number of dimensions, the memory read is performed using

a soft attention. The soft attention mechanism produces a read weight vector wr

t where the

weight on each memory row i is computed as a softmax over the cosine similarity between the

memory content Mt(i) and the key kt:

wr_t(i) = softmax kt· Mt(i) ||kt|| ||Mt(i)|| (2.20) The final retrieved vector, which serves as an input to the classification layer, is a weighted sum of the memory contents:

rt=

N

X

i=1

wr_t(i)Mt(i) (2.21)

For writing into the memory, they use an access module called the Least Recently Used Access (LRUA) that writes either to the least used memory location or the most recently used

memory location. The usage weights wu

t are updated at each time step by adding the current

read weights wr

t and write weights wwt with a decayed usage weight at the previous step:

w_tu= γwu_t−1+ wr_t + ww_t (2.22)

where γ is the decay parameter. Suppose m(v, n) denotes the nth _{smallest element of a vector}

v, the least-used weights wlu

t are obtained from wtu as

w_tlu(i) = ( 0 if wu t(i) > m(wtu, n) 1 if wu t(i) ≤ m(wtu, n) (2.23)

(18)

where n is set to equal the number of reads to memory. The write weights ww

t are computed as

a gated combination of the previous read weights and previous least-used weights

w_tw= σ(α)wr_t−1+ (1 − σ(α))wlu_t−1 (2.24)

where σ denotes the sigmoid operator and α is a gate parameter. The least-used memory

location as indicated by wlu

t is set to zero and the memory contents are written as

Mt(i) = Mt−1(i) + wtw(i)kt (2.25)

During training, in each episode, the true label yt is presented to the model with a delay of

one-time step, i.e., along with the data point xt+1, in order to force it to remember. We omit

further details about the training procedure for brevity.

2.3.2 Meta Networks

Designed for one-shot learning, MetaNet (Munkhdalai and Yu, 2017) consists of two learning

components – a base learner that operates at the task level and a meta-learner that operates at a task-agnostic level. The meta-learner has access to a memory bank like in MANN. The base learner provides feedback to the meta-learner in the form of meta-information, which typ-ically consists of the gradients of the task-specific loss functions, and the meta-learner rapidly parameterizes both itself and the base learner using it. In addition to the task-solving compo-nents, there is a representation learning component that learns to produce embeddings for each data point. The weights in MetaNet are of two types – slow weights and fast weights. The slow weights are updated with gradient-based optimization on a loss function whereas the fast

weights are generated by a different neural network. The architecture is shown in Figure 2.5a.

More formally, we describe the different components of MetaNet below, using the notation

from Weng(2018) instead of the original paper:

• Representation function f parameterized by slow weights θ and fast weights θ+ _that

produces embeddings of a data point.

• Base learner g parameterized by slow weights φ and fast weights φ+ _{that actually learns}

the task.

• An LSTM Fw parameterized by w that produces fast weights θ+ of f .

• A neural network Gv parameterized by v that produces fast weights φ+ of g.

In each episode, some examples are sampled from the support set and the representation

learning losses for each example based on the embeddings produced by fθ are computed. Fast

weights θ+ _{are produced by the LSTM F}

w using the gradients of these losses with respect to θ.

Next, for each example xi in the support set, gφ(xi) is the class-wise probability distribution

from which the task-level loss Li can be computed. The gradient of the loss with respect to φ

is the meta-information used to obtain the example-level fast weights φ+

i .

φ+_i = Gv(∇φLi) (2.26)

These fast weights φ+_i are stored at row i of the value memory M . The input xi is further

encoded using both the slow and fast weights, θ and θ+_{, as}

(19)

(a) (b)

Figure 2.5: Architecture of meta networks (left) and a layer augmented MLP (right). Figure

taken fromMunkhdalai and Yu(2017).

This is achieved by layer augmentation, i.e., in each layer of the network f , the output ob-tained from the corresponding slow weights and fast weights are summed together as illustrated

in Figure 2.5b. The representation ri is stored at row i of the key memory R. Each example

x0_j in the query set is then encoded as

rj = fθ,θ+(x0_j) (2.28)

Soft attention weights aj are computed over rj and the key memory R with softmax over cosine

distances. The attention weights are then used to retrieve the fast weights φ+_j from value

memory M as

φ+_j = aT_jM (2.29)

With layer augmentation, the output class probabilities g_φ,φ+

j(x

0

j) are utilized to compute the

task loss again. All parameters θ, φ, w, v are updated using gradients from the sum of losses for each example in the query set.

2.4 Optimization-based meta-learning

Optimization-based approaches explicitly include generalizability in their objective function and optimize for the same. Most works are based on one of these two ideas:

• Have a parameterized optimizer that updates the classifier using gradients from the sup-port set whereas the parameters of the optimizer are updated based on gradients on the query set.

• Have a fixed optimizer but optimize for the initial parameters of the network to enable quick learning from the support set.

We describe LSTM meta-learner that falls into the former category and model-agnostic meta-learning as well as Reptile that fall into the latter.

(20)

2.4.1 LSTM Meta-Learner

Figure 2.6: Illustration of LSTM meta-learner. The support set is to the left of the dashed line

and the query set is to its right. Figure taken fromRavi and Larochelle (2017).

Ravi and Larochelle(2017) proposed an LSTM-based optimizer (meta-learner) that is trained

to optimize a neural network classifier (learner). The learner is specific to each task whereas the meta-learner captures knowledge that is common across tasks. The goal of the meta-learner is to converge the learner to a good solution quickly on each task. Learning occurs in episodes

drawn from the meta-train set Dmeta-train where each episode D in turn comprises of a

sup-port set Dsupport and a query set Dquery. The meta-learner uses the learner’s performance on

Dsupport to update the learner to achieve high performance on Dquery. Suppose the learner is

parameterized by θ, the gradient descent update at step t using the loss Lton Dsupportis of the

form

θt= θt−1− αt∇θt−1Lt (2.30)

where αt is the learning rate at step t. The key observation is that this update resembles the

update for the cell state in an LSTM,

ct= ft ct−1+ it ˜ct (2.31)

if the forget gate ft= 1, the previous cell state ct−1= θt−1, the input gate it= αt, the candidate

cell state ˜ct= −∇θt−1Lt. Instead of fixing the forget gate and input gate, it is possible to have

an LSTM meta-learner that produces the update rule for the learner network. The input gate which represents the learning rate takes the form

it= σ WI∇θt−1Lt, Lt, θt−1, it−1 + bI

(2.32)

It is thus a function of the current gradient ∇θt−1Lt, the current loss Lt, the current parameters

θt−1 and the current learning rate it−1. Similarly, the forget gate is

ft= σ WF ∇θt−1Lt, Lt, θt−1, ft−1 + bF

(2.33)

The initial cell state c0 is also treated as a parameter of the meta-learner and learned. It

corresponds to the initial weights of the learner network so that training begins from a point

(21)

To prevent explosion of meta-learner parameters, the LSTM parameters are shared across each coordinate of the learner parameters. The input to the LSTM is thus a batch of gradient

coordinates and loss values ∇θt−1,iLt, Lt for each dimension i. Hence, the same update rule

is used for each coordinate but it is still dependent on the history of each coordinate.

During training, an episode D = (Dsupport, Dquery) ∈ Dmeta-train is drawn and at each

time step t, the meta-learner receives the loss on Dsupport and the corresponding gradients

∇θt−1Lt, Lt as input from the learner and produces the updated parameters θt. The process

is repeated for T steps after which the learner with final parameters θT is evaluated on Dquery,

and the resulting loss is then used to update the meta-learner.

It should be noted that Lt and ∇θt−1Lt depend on the parameters of the meta-learner

and thus the gradients with respect to the meta-learner parameters would need to take this dependency into account. However, to avoid taking second-order gradients, the contribution from this dependency is ignored, yielding a simplifying gradient independence assumption.

2.4.2 MAML

Model-agnostic meta-learning (MAML) byFinn et al.(2017), as the name suggests, is a general

approach that is applicable to any model trained with gradient descent and to a variety of learning problems such as classification, regression, and reinforcement learning. It is a purely optimization-based technique that does not increase the number of learnable parameters. The optimization goal is to train a model’s initial parameters such that it can perform well on a new task after only a few gradient steps on a small amount of data from that new task. This

is illustrated in Figure 2.7. In other words, it seeks to build internal representations that are

suitable for many related tasks so that a new task needs only a simple fine-tuning on new data. There is essentially only a learner model that adapts to each of the tasks while the

meta-learner is an optimization process. The meta-learner is represented as fθ with parameters θ. During

meta-training, tasks are drawn from a distribution of tasks p(T ). Each task Ti is made of a

support set D(i)_support and a query set D(i)query. The model’s parameters are adapted from θ to

the task Ti using Dsupport(i) to yield θ0i. The update is performed using one or several steps of

gradient descent. This step is referred to as inner-loop optimization. With only one gradient step, the update is:

θ_i0 = θ − α∇θLsTi(fθ) (2.34)

where α is the learning rate and Ls

Ti is the loss for the task computed on D

(i)

support. Thus, each

task Ti has an updated model fθ0_i. The meta-objective is to have fθ0_i generalize well across tasks

from p(T ): min θ X Ti∼p(T ) Lq_T i(fθi0) = min_θ X Ti∼p(T ) Lq_T i(fθ−α∇θLs_Ti(fθ)) (2.35)

To achieve generalization, losses Lq_{are computed from D}(i)

query. The optimization is over θ even

though the losses are obtained from the updated parameters θ0_i, which effectively optimizes for

the model’s initial parameters so that it can perform a few steps of gradient descent and perform well. The meta-optimization, also called outer-loop optimization, does the update as

θ ← θ − β∇θ

X

Ti∼p(T )

Lq_T

(22)

Figure 2.7: Illustration of MAML. Figure taken from Finn et al. (2017).

It can be seen that the meta-optimization involves computing a gradient through a gradient

i.e., the backward pass works through the update step in Equation 2.34, resulting in

second-order gradients.

2.4.3 FOMAML

First-Order MAML (FOMAML) is a simplified version of MAML, also described inFinn et al.

(2017), which ignores the contribution from second-order terms in the outer-loop optimization

step. This first-order approximation computes the gradients with respect to the updated

pa-rameters θ0

irather than the initial parameters θ. The outer-loop optimization step thus reduces

to: θ ← θ − β X Ti∼p(T ) ∇_θ0 iL q Ti(fθi0) (2.37) 2.4.4 ProtoMAML

As shown in Section 2.2.3, prototypical networks with Euclidean distance are equivalent to a

linear model with the following parameterization:

wc= 2µc (2.38)

bc= −µTcµc (2.39)

where wc and bc are weights and biases for the output unit corresponding to class c and µc is

the class prototype. Triantafillou et al. (2020) combine the strengths of prototypical networks

and MAML by initializing the final layer of the learner classifier in each episode with these prototypical network-equivalent weights and biases and continue to learn with MAML. They call their approach as ProtoMAML. While updating θ, they allow the gradients to flow through the linear layer initialization. The same method which learns using FOMAML instead could be called ProtoFOMAML.

2.4.5 Reptile

Nichol et al. (2018) introduced yet another first-order optimization-based meta-learning

ap-proach called Reptile that is similar in spirit to FOMAML. However, unlike FOMAML, it does not explicitly require a support and query set for each task, but just requires batches to perform

(23)

is done by performing m steps of gradient descent which results in updated parameters θ_i0:

θ0_i= U (LTi, θ, m) (2.40)

Here, U denotes any gradient-based optimization method such as SGD. It uses the loss from

the task, LTi, to perform m update steps. For example, for a single update step using SGD

with step size α:

θ0_i= θ − α∇θLTi (2.41)

In the serial version, the initial parameters θ are updated immediately as:

θ ← θ+ β(θ_i0− θ) (2.42)

A batched version on the other hand first computes θ0_i for a batch of tasks Ti (i = 1, ..., n) and

then updates θ as:

θ ← θ+ β1 n n X i=1 (θ_i0− θ) (2.43)

The meta-gradient in Reptile used during the outer-loop optimization is defined as: gReptile= 1 αET θ − θ 0 T (2.44) where α is the step size in the inner-loop optimization. For SGD with a single update step,

gReptile,m=1= ET [∇θLT] = ∇θET [LT] (2.45)

This is equivalent to optimizing for the expected loss over the tasks, reducing it to joint training in a multi-task learning setup. But for more than one update step, it converges to a different solution.

2.5 Continual learning

A typical lifelong/continual learning setup consists of a stream of K tasks T1, T2, ..., TK. For

supervised learning tasks, every task Ti consists of a set of data points xj with labels yj, i.e.,

{(xj, yj)}N_j=1i that are locally i.i.d., where Ni is the size of task Ti. Regular training on the

sequence of tasks leads to catastrophic forgetting. A realistic setup for continual learning is one

where the goal is to learn a function fθ with parameters θ by only making one pass over the

stream of tasks and with no descriptors of tasks Ti available. In multi-task learning (Caruana,

1997), on the other hand, it is possible to draw samples i.i.d from all the tasks along with

training for multiple epochs. Therefore, multi-task learning is an upper bound to continual learning in terms of performance.

Current approaches to prevent catastrophic forgetting can be grouped into one of several

categories: (1) constrained optimization-based approaches with or without regularization (

Kirk-patrick et al.,2017;Zenke et al.,2017;Chaudhry et al.,2018;Aljundi et al.,2018;Schwarz et al.,

2018) that prevent large updates on weights that are important to previously seen tasks; (2)

memory-based approaches (Rebuffi et al., 2017; Sprechmann et al., 2018; Wang et al., 2019;

(24)

approaches (Shin et al.,2017;Kemker and Kanan,2018;Sun et al., 2020) that employ a

gen-erative model instead of a memory module; (4) architecture-based approaches (Rusu et al.,

2016;Chen et al.,2016;Fernando et al.,2017) that either use different subsets of the network

for different tasks or dynamically expand the networks; and (5) hybrid approaches that

for-mulate optimization constraints based on examples in memory (Lopez-Paz and Ranzato,2017;

Chaudhry et al.,2019). More recently, Riemer et al.(2019) proposed an approach based on a

first-order optimization-based meta-learning algorithm, Reptile (Nichol et al.,2018), augmented

with experience replay. However, it involved interleaving every training example with several examples from memory, leading to a high replay rate.

Next, we elaborate on three popular approaches for continual learning – one based on reg-ularization and the rest based on constrained optimization involving memory.

2.5.1 Elastic Weight Consolidation

Elastic weight consolidation (EWC) (Kirkpatrick et al.,2017) is a regularization-based approach

to continual learning. While learning a new task, it seeks to limit the updates on parameters that are important to previously seen tasks. This is achieved by a regularization term that constraints these important parameters to stay close to their old values.

Consider two tasks A and B that are learned consecutively. If θ is a shared set of parameters

and θ∗_A is the optimal parameters found for task A, the loss function that is minimized while

learning task B is:

L(θ) = LB(θ) + λ 2 X i Fi θi− θA,i∗ 2 (2.46)

Here, LB is the loss on task B only, λ is a parameter that controls how important task A is

compared to task B and i iterates over the model parameters. The importance of each of the model parameters to task A is captured in the Fisher information matrix F .

It should be noted that EWC requires task identifiers during training so that it can maintain a notion of “previous” and “new” tasks and apply the regularization term accordingly.

2.5.2 Gradient Episodic Memory

Gradient Episodic Memory (GEM) (Lopez-Paz and Ranzato,2017) solves a constrained

opti-mization problem to tackle catastrophic forgetting. It consists of an episodic memory Mkwhich

stores some/all examples from task k. Thus, it makes use of task identifiers too. Let the model

be represented as fθ with parameters θ. When it encounters a example (x, y) from task t, it

seeks to minimize the loss on the example subject to the constraint that the loss on examples stored in the episodic memory does not increase:

min

θ L(fθ(x), y)

s.t. L(fθ(Mk)) ≤ L(f_θt−1(Mk)) ∀k < t (2.47)

where f_θt−1 is the state of the predictor after learning task t − 1. If g and gk denote the gradient

of the loss on x and Mk respectively with respect to θ, the above problem can be reformulated

as finding the gradient ˜g that is closest to g such that its dot product with gk is greater than

or equal to zero for all k < t:

min ˜ g 1 2||g − ˜g|| 2 2 s.t. h˜g, gki ≥ 0 ∀k < t (2.48)

(25)

Lopez-Paz and Ranzato (2017) note that this can be posed as the following quadratic pro-gram: min v 1 2v T_GGT_v_{+ g}T_GT_v s.t. v ≥0 (2.49)

This is quadratic program in t − 1 variables with G =g1 g2 ... gt−1

T

. After solving this

for v∗, the required gradient is obtained by a projection as:

˜

g= GT_v∗_{+ g} _(2.50)

The primary drawback of GEM is that it is computationally inefficient to train – at each training step, it requires calculating gradients from all the examples in the memory and solving a quadratic program. When the size of the memory and/or the number of tasks is large, it becomes prohibitively expensive.

2.5.3 Averaged Gradient Episodic Memory

Chaudhry et al. (2019) proposed a more efficient version of GEM, called Averaged-GEM

(A-GEM). It modifies the initial constraint such that at every training step, the average loss on the previous tasks does not increase, instead of constraining on the losses of each of the tasks. This corresponds to the following constrained optimization problem:

min

θ L(fθ(x), y)

s.t. L(fθ(M)) ≤ L(f_θt−1(M)) where M = ∪k<tMk (2.51)

Suppose gref denotes the gradient computed using a randomly sampled batch from the

memory, the optimization can be formulated as: min ˜ g 1 2||g − ˜g|| 2 2 s.t. h˜g, grefi ≥ 0 (2.52)

Instead of t − 1 constraints in GEM, it has only a single constraint based on an average gradient of previous tasks. The solution to this can be obtained by a simple projection:

˜ g= g − g T_g ref gT_refgref gref (2.53)

By replacing the quadratic program involving gradients on all past tasks with a simpler projection step involving gradients on randomly drawn samples, A-GEM offers a more efficient solution compared to GEM.

2.6 Meta-learning for few-shot learning in NLP

Meta-learning in NLP is still in its nascent stages. Gu et al.(2018) apply meta-learning to the

problem of neural machine translation where they meta-train on translating high-resource

(26)

et al. (2020) demonstrate that meta-learning enables zero-shot cross-lingual transfer on low-resource languages in natural language inference and question answering.

Obamuyide and Vlachos(2019a) use meta-learning for relation classification where they treat

each relation as a task. Chen et al.(2019) consider relation learning by using meta-learning to

do few-shot link prediction in knowledge graphs.

Dou et al. (2019) perform meta-training on certain high-resource tasks from the GLUE

benchmark (Wang et al., 2018) and meta-test on certain low-resource tasks from the same

benchmark. Through this, they show that the learned representations during meta-training

are also useful on completely new tasks with their own distinct datasets. Bansal et al.(2019)

develop a new method called LEOPARD that builds upon MAML. They note that MAML has so far been used in a fixed N -way, K-shot setting only and thus propose a softmax parameter generator component that can enable different number of classes in the meta-training tasks.

They choose the tasks in the GLUE benchmark along with SNLI (Bowman et al., 2015) for

meta-training and choose entity typing, relation classification, sentiment classification, text categorization, and scientific NLI as the test tasks. Particularly, they study the domain transfer capability of the model with sentiment classification and scientific NLI.

Yu et al. (2018) extend metric-based meta-learning to work with multiple metrics by

per-forming task clustering for few-shot text classification. Geng et al. (2019) develop Induction

Network for the same problem, based on the dynamic routing algorithm in capsule networks

(Sabour et al.,2017) combined with an episodic meta-training strategy. A meta-learning

ap-proach to few-shot text classification with attention mechanism was explored in Jiang et al.

(2018). Sun et al. (2019) propose hierarchical attention prototypical networks for few-shot

text classification. Learning the distribution of emotions from texts in a few-shot setting was

proposed byZhao and Ma (2019).

Wu et al.(2019) employ meta-reinforcement learning techniques for multi-label classification,

with experiments on entity typing and text classification. Hu et al.(2019) adopt meta-learning

to learning good representations of out-of-vocabulary words by framing it as a regression task. Based on our literature review, it appears that the application of meta-learning to word-level language tasks has been relatively unexplored.

2.7 Continual learning in NLP

Continual learning in NLP has relatively received little attention compared to computer vision

and reinforcement learning. Wang et al. (2019) propose an alignment model named EA-EMR

that limits the distortion in the embedding space in an LSTM-based architecture for lifelong

relation extraction. For the same task, Obamuyide and Vlachos (2019b) show that utilizing

Reptile with memory can improve performance and call their method MLLRE. Han et al.

(2020) further improve relation extraction with their model, EMAR, through episodic memory

activation and reconsolidation. d’Autume et al. (2019) propose a model with episodic memory

called MbPA++ which incorporates sparse experience replay during training and local adap-tation on K-nearest neighbors from the memory during inference. Through their experiments on sequential learning on multiple datasets of text classification and question answering with

BERT, they show that their model can effectively reduce catastrophic forgetting. Sun et al.

(2020) present a model based on GPT-2 (Radford et al.,2019), called LAMOL, that

simultane-ously learns to solve new tasks and to generate pseudo-samples from previous tasks for replay.

They perform sequential learning on five tasks from decaNLP (McCann et al.,2018) as well as

(27)

Chapter 3

Few-Shot Word Sense

Disambiguation

Word sense disambiguation (WSD) is a core task in natural language understanding, where the goal is to associate words with their correct contextual meaning from a pre-defined sense

inven-tory. Approaches to WSD typically rely on (semi-)supervised learning (Zhong and Ng, 2010;

Melamud et al.,2016; K˚ageb¨ack and Salomonsson,2016;Yuan et al., 2016) or are

knowledge-based (Lesk,1986; Agirre et al.,2014;Moro et al.,2014). While supervised methods generally

outperform the knowledge-based ones (Raganato et al., 2017a), they require data manually

annotated with word senses, which are expensive to produce at a large scale. These methods also tend to learn a classification model for each word independently, and hence may perform poorly on words that have a limited amount of annotated data. Yet, alternatives that involve a

single supervised model for all words (Raganato et al.,2017b) still do not adequately solve the

problem for rare words (Kumar et al.,2019).

To address these issues, we investigate meta-learning as a means to perform few-shot WSD. By meta-training on disambiguating many words, we show that it is possible to disambiguate new words at meta-test time using only a handful of labeled examples.

3.1 Previous work on WSD

Early supervised learning approaches to WSD relied on hand-crafted features extracted from

the context words (Lee and Ng,2002;Navigli,2009;Zhong and Ng,2010). Later work used word

embeddings as features for classification (Taghipour and Ng, 2015; Rothe and Sch¨utze, 2015;

Iacobacci et al., 2016). With the rise of deep learning, LSTM (Hochreiter and Schmidhuber,

1997) models became popular (Melamud et al.,2016;K˚ageb¨ack and Salomonsson, 2016; Yuan

et al., 2016). While most work trained individual models per word, Raganato et al. (2017b)

designed a single LSTM model with a large number of output units to disambiguate all words.

Peters et al. (2018) performed WSD by nearest neighbour matching with contextualized ELMo

(Peters et al., 2018) embeddings. Hadiwinoto et al. (2019) used pre-trained contextualized

representations from BERT (Devlin et al., 2019) as features. Huang et al. (2019) fine-tune

BERT for WSD while also incorporating sense definitions from WordNet (Miller et al.,1990) to

obtain the current supervised state-of-the-art F1 score of 77% on the benchmark by Raganato

(28)

3.2 Task and dataset

We treat WSD as a few-shot word-level classification problem, where a sense is assigned to a word given its sentential context. As different words may have a different number of senses and sentences may have multiple ambiguous words, the standard setting of N -way, K-shot classification does not hold in our case. Specifically, different episodes can have a different number of classes and a varying number of examples per class – a setting which is more realistic

(Triantafillou et al.,2020).

Dataset We use the SemCor corpus (Miller et al., 1994) manually annotated with senses

from the New Oxford American Dictionary by Yuan et al. (2016)1_{. With 37, 176 annotated}

sentences, this is one of the largest sense-annotated English corpora. The corpus does not have a standard train/validation/test split however. We group the sentences in the corpus according to which word is to be disambiguated, and then randomly divide the words into disjoint meta-train, meta-validation and meta-test sets with a 60:15:25 split. A sentence may have multiple occurrences of the same word, in which case we make predictions for all of them. We consider four different settings with the support set size |S| = 4, 8, 16 and 32 sentences. We report the number of words, the number of episodes, the total number of unique sentences and the average number of senses for the meta-training, meta-validation and meta-test sets for each of the four

setups with different |S| in Table 3.1.

Support sentences Split No. of words No. of episodes No. of unique sentences Average no. of senses 4 Meta-training 985 10000 27640 2.96 Meta-validation 166 166 1293 2.60 Meta-test 270 270 2062 2.60 8 Meta-training 985 10000 27640 2.96 Meta-validation 163 163 2343 3.06 Meta-test 259 259 3605 3.16 16 Meta-training 799 10000 27973 3.07 Meta-validation 146 146 3696 3.53 Meta-test 197 197 4976 3.58 32 Meta-training 580 10000 27046 3.34 Meta-validation 85 85 4129 3.94 Meta-test 129 129 5855 3.52

Table 3.1: Statistics of our few-shot WSD dataset.

Training episodes In the meta-training set, both the support and query sets have the same

number of sentences. Our initial experiments using one word per episode during meta-training yielded poor results due to an insufficient number of episodes. Class imbalances and the presence of very frequent senses further hindered performance. To overcome this problem and design a suitable meta-training setup, we instead create episodes with multiple annotated words in them.

Specifically, each episode consists of r sampled words {zj}rj=1 and min(b|S|/rc, ν(zj)) senses

1

(29)

for each of those words, where ν(zj) is the number of senses for word zj. Therefore, each task

in the meta-training set is the disambiguation of r words between up to |S| senses. We set

r= 2 for |S| = 4 and r = 4 for the rest. Sentences containing these senses are then sampled for

the support and query sets such that the classes are as balanced as possible. For example, for |S| = 8, we first choose 4 words and 2 senses for each, and then sample one sentence for each word-sense pair. The labels for the senses are shuffled across episodes, i.e., one sense can have a different label when sampled in another episode. This is key in meta-learning as it prevents

memorization (Yin et al.,2019). The advantage of our approach for constructing meta-training

episodes is that it allows for generating a combinatorially large number of tasks (episodes). Herein, we use a total number of 10, 000 meta-training episodes.

2 3 4

Number of senses in the support set

0 25 50 75 100 125 150 Number of episodes (a) |S| = 4 2 3 4 5 6 7 8

0 20 40 60 80 100 Number of episodes (b) |S| = 8 2 3 4 5 6 7 8 9 10 11

0 20 40 60 Number of episodes (c) |S| = 16 2 3 4 5 6 7 8 9 10

0 10 20 30 40 Number of episodes (d) |S| = 32

Figure 3.1: Bar plot of number of meta-test episodes for different number of senses in the meta-test support set.

2 3 4

Number of senses in the query set

0 50 100 150 200 Number of episodes (a) |S| = 4 2 3 4 5 6 7 8

0 25 50 75 100 125 150 Number of episodes (b) |S| = 8 2 3 4 5 6 7 8 9 10

0 20 40 60 80 Number of episodes (c) |S| = 16 2 3 4 5 6 7 8 9

0 10 20 30 40 50 60 Number of episodes (d) |S| = 32

Figure 3.2: Bar plot of number of meta-test episodes for different number of senses in the meta-test query set.

Evaluation episodes For the meta-validation and meta-test sets, each episode corresponds

to the task of disambiguating a single word. Thus, each episode contains sentences where the correct sense of the given word is annotated. The total number of sentences in the support set is |S|. The number of sentences in the query set is equal to or less than |S|. Allowing fewer than |S| sentences in the query set gives us more episodes because otherwise, we would be excluding many words that don’t have sufficient number of sentences in their query sets. While splitting the sentences into support and query sets, we ensure that senses in the query set are present in the support set. This requires us to exclude words with more than |S| senses so as to accommodate all the senses in the support set. Furthermore, we discard words that have fewer than a total of |S| + 1 sentences since they cannot form a complete episode. Note that, unlike the meta-training tasks, our meta-test tasks represent a natural data distribution,

(30)

therefore allowing us to test our models in a realistic setting. In Figure 3.1 and Figure 3.2, we present bar plots of the number of meta-test episodes for different number of senses in the meta-test support and query sets respectively. It shows that the number of episodes drops quite sharply as the number of senses increases.

3.3 Methods

Our models consist of three components: an encoder that takes the words in a sentence as input and produces a contextualized representation for each of them, a hidden linear layer that projects these representations to another space, and an output linear layer that produces the probability distribution over senses. The encoder and the hidden layer are shared across all

tasks – we denote this block as fθ with shared parameters θ. The output layer is randomly

initialized for each task Ti (i.e. episode) – we denote this as gφi with parameters φi. θ is

meta-learned whereas φi is independently learned for each task.

Bidirectional GRU Shared encoder Shared linear layer Task-specific output layer w1 w2 wn−1 wn ... ... ... gϕi fθ GloVe embedding

(a) Bi-GRU encoder with GloVe input.

ELMo Shared linear layer Task-specific output layer w1 w2 wn−1 wn ... ... ... gϕi fθ ELMo embedding

(b) MLP with ELMo input.

BERT Shared encoder Shared linear layer Task-specific output layer w1 w2 wn−1 wn ... ... gϕi fθ BERT Tokenizer

(c) Entire BERT model as encoder.

Figure 3.3: Model architecture showing the shared encoder, the shared linear layer and the

task-specific linear layer. The inputs are words w1, w2, ..., wn of a sentence.

3.3.1 Model Architectures

We experiment with three different encoders: (1) a single-layer bidirectional GRU (Cho et al.,

2014) with GloVe embeddings (Pennington et al., 2014) as input that are not fine-tuned; (2)

(31)

BERTBASE (Devlin et al., 2019) that is fine-tuned. The architecture of our three different

models – GloVe+GRU, ELMo+MLP and BERT – is shown in Figure3.3.

3.3.2 Meta-learning Methods

Prototypical Networks

As a recap, suppose Sc denotes the subset of the support set containing examples from class

c ∈ C, the prototype µcis:

µc= 1 |S_c| X xi∈Sc fθ(xi) (3.1)

The distribution over classes for a query point is calculated as a softmax over negative Euclidean distance to the class prototypes. We generate the prototypes (one per sense) from the output

of the shared block fθ for the support examples. Instead of using gφi, we obtain the probability

distribution for the query examples based on the distance function. Parameters θ are updated

after every episode using the Adam optimizer (Kingma and Ba,2015):

θ ←Adam(Lq_T

i, θ, β) (3.2)

where Lq_T

i is the cross-entropy loss on the query set and β is the meta learning rate.

Model-Agnostic Meta-Learning (MAML)

MAML is designed specifically for the N -way, K-shot classification setting. The inner-loop update with m gradient steps can generally be represented as:

θ_i0 = U (LsTi, θ, α, m), (3.3)

where U is an optimizer such as SGD, α is the inner-loop learning rate and Ls

Ti is the loss for

the task computed on D(i)_support. Thus, the meta-objective becomes:

J(θ) = X

Ti∼p(T )

Lq_T

i(fU (Ls_Ti,θ,α,m)). (3.4)

where the loss Lq_T

i is computed on D

(i)

query. During the meta-optimization, or outer-loop

opti-mization, the update with the outer-loop learning rate β is:

θ ← θ − β∇θ

X

Ti∼p(T )

Lq_T

i(fθi0) (3.5)

The first order approximation with FOMAML does the update as:

θ ← θ − β X Ti∼p(T ) ∇_θ0 iL q Ti(fθi0) (3.6)

FOMAML does not generalize outside the N -way, K-shot setting, since it assumes a fixed

number of classes across tasks. We therefore extend it with output layer parameters φithat are

adapted per task. During the inner-loop for each task, the optimization is performed as follows:

(32)

where α and γ are the learning rates for the shared block and output layer respectively. We introduce different learning rates because the output layer is randomly initialized per task and thus needs to learn aggressively, whereas the shared block already has past information and can thus learn slower. We refer to α as the learner learning rate and γ as the output learning rate. The outer-loop optimization uses Adam:

θ ←Adam X i Lq_T i(θ 0 i, φ 0 i), β ! (3.8)

where the gradients of Lq_T

i are computed with respect to θ

0

i, β is the meta learning rate, and

the sum over i is for all tasks in the batch. ProtoMAML

We construct the prototypes from the output from fθfor the support examples. The parameters

φi are initialized with their prototypical network equivalent weights:

wc= 2µc (3.9)

bc= −µTcµc (3.10)

The learning then proceeds as in (FO)MAML; the only difference being that γ need not be too high owing to the good initialization. Proto(FO)MAML thus supports a varying number of classes per task.

3.3.3 Baseline Methods

We include several baselines in order to assess the relative performance of the meta-learning methods.

Majority-sense baseline This baseline always predicts the most frequent sense in the

sup-port set. Hereafter, we refer to it as MajoritySenseBaseline.

Nearest neighbor classifier This model predicts the sense of a query instance as the sense

of its nearest neighbor from the support set in terms of cosine distance. We perform nearest neighbor matching with the ELMo embeddings of the words as well as with their BERT outputs but not with GloVe embeddings since they are the same for all senses. We refer to this baseline as NearestNeighbor.

Non-episodic training It is a single model that is trained on all tasks without any distinction

between them – it merges support and query sets, and is trained using mini-batching. The output layer is thus not task-dependent and the number of output units is equal to the total number of senses in the dataset. The softmax at the output layer is taken only over the relevant

classes within the mini-batch. Instead of φi per task, we now have a single φ. During training,

the parameters are updated per mini-batch as:

Meta-Learning for Few-Shot and Continual Learning in NLP