Few-shot event detection with matching networks

(1)

Few-shot event detection with matching networks

submitted in partial fulfillment for the degree of master of science Dennis Craandijk

10541276

master information studies data science

faculty of science university of amsterdam

(2)

Few-shot event detection with matching networks

Dennis Craandijk

Universiteit van Amsterdam dennis.craandijk@student.uva.nl

ABSTRACT

We propose the convolutional event matching network (CEMN) model for the problem of few-shot event classification and detec-tion. The CEMN model combines a convolutional event detection architecture with novel few-shot learning methods and is trained end-to-end to classify novel event types given only a few examples. During training the model learns a non-linear embedding func-tions which maximizes the relation score between related events, while minimizing the relation score between unrelated events. Once trained, the CEMN model is able to classify events of a novel class by matching each unlabelled instance to the labelled instances with the highest relation score. The experimental results show the CEMN model to outperform the baseline for the few-shot event classification task.

KEYWORDS

event detection, few-shot learning, matching networks

1 INTRODUCTION

Understanding events occurring in text is an important component in information extraction and language understanding. It is, how-ever, a challenging task since events are described using natural, human-understandable language, which make the data limited in the degree to which it is machine-interpretable [6]. Event detection entails identifying event occurrences and classifying them into spe-cific types. The dominant approaches for event detection follow a supervised learning paradigm [3, 18, 23]. These methods rely on a large set of labeled instances to train models. The poor portabil-ity of supervised methods and the limited coverage of available event annotations has been a limiting factor however [10]. Due to the high cost of manual annotation of events, the coverage of training and evaluation datasets is limited. The widely used ACE 2005 [4] corpus only defines 33 different event types. Therefore the use of supervised systems comes with a risk of overfitting the model on the small domain covered in the annotated data. This leads to question whether these methods really address the problem of understanding events in text.

This research tries to overcome these limitations and focuses on ways to minimize the need for labelled data by modelling event detection as a few-shot learning task. Few-shot learning refers to learning techniques which utilize only a few labeled samples to learn new information. In a few-shot classification task a classifier must learn novel classes not seen during training, given only a few labelled examples of the new class [26, 27, 30]. Applying few-shot learning methods in event detection models can reduce the need for labelled data and could lower the effort needed to expand the coverage of the model. Nguyen et al. [22], Peng et al. [24] argue a few-shot learning setting to be a realistic learning environment for event detection models. After all, giving examples is one of the

primary methods humans employ when trying to define a new type or class. For example, when human annotators are asked to annotate event types in a text, often the annotation guideline will contain some example events for each type. It has been shown that humans are able to recognize new classes trough few or even one-shot learning, where only a single example is given [5, 15]. Supervised learning methods, however, rely on being iteratively trained on large amounts of training data, which prevents rapid introduction of new classes based on a few examples[25–27]. The availability of only a few examples challenges the standard fine-tuning practice in deep learning [7].

Recently, research in image classification has adopted new ap-proaches to few-shot learning which resulted in significant progress in few-shot image classification tasks [26, 27, 29]. Vinyals et al. [29] proposed matching networks (MN), a novel model architecture and training method designed for few-shot learning tasks. The archi-tecture and training method are designed not to learn a predefined set of classes, but to learn how to match labelled and unlabeled instances belonging to the same class. The authors demonstrate the resulting model is able to correctly classify images not seen during training, by matching them to only a few labelled example images of the novel class. Matching networks and similar architectures employing these methods have shown to achieve state-of-the-art performance on few-shot image classification tasks [26, 27, 29].

Motivated by this recent progress, this research models the event detection task as a few-shot learning task and tries to incorporate few-shot learning methods, such as described by Snell et al. [26], Vinyals et al. [29]. These types of networks have to the best of our knowledge, not been applied to event detection tasks nor any other text classification task. This research thus forms the first attempt to apply a MN inspired architecture to a text classification task. Since the few-shot image classification task used by [26, 27, 29] differs from an event detection task, this comes with some challenges. The architecture should be designed to handle textual input for instance, which implies finding a suitable text representation. In addition the model should be able to recognize event as well as non-event instances. Since not every sentence in a text contains an event, the event detection task requires a model to be able to distinguish events from non-events. Motivated by these challenges, the research question for this research is:

What few-shot learning methods can improve the extraction of novel event types in a few-shot event detection task? The research question will be answered trough the following subquestions.

(1) How can matching networks be implemented for a few-shot event detection task?

(2) How should an event be represented to enable few-shot learning?

(3)

(3) How can the few example events be leveraged to classify novel event types

(4) What methods enable the model to discriminate events from non-events?

2 RELATED WORK

2.1 Event detection

Until recently event extraction methods mainly relied on symbolic features [9, 11, 16, 19]. These methods extensively leveraged lin-guistic analysis to capture the discrete structure of events, focusing on a combination of various properties such as lexicon, syntax and gazetteers. Ahn [1], for example, uses lexical features, such as part-of-speech (POS) tags, syntactic features and external-knowledge features to extract events. Although these approaches achieve high performance, feature-based methods suffer from the problem of selecting suitable features [18]. Chen et al. [3], Nguyen and Grish-man [23] and Nguyen et al. [21] use distributed features to train a supervised event extraction model. Their models use representation-based methods, like word embeddings, which are typically fed into convolutional neural networks (CNNs). The CNNs automatically extract features which are used for classification trough a fully connected neural layer. The use of word-embeddings along with CNNs has enabled such event extraction models to achieve state-of-the-art results, indicating these models are able to capture an latent semantic structure in the distributed representations of events.

Only recently more research has focused on the sparsity of la-belled data and classification of event types not seen during training [10, 22, 24]. Nguyen et al. [22] propose the CNN-2-STAGE model which uses a two-stage training method to detect event types not seen during training. The authors decompose training into an aux-iliary learning phase, where transferable knowledge is learned, followed by a fine-tuning phase where the novel class structure is learned. During the first stage a CNN is trained as a binary event detector, learning only to discriminate between events and non-events. In the second stage the model is fine-tuned on a few examples of a novel event type which was not seen during the first stage. The model is finally evaluated on classification performance of the newly introduced event type. The authors thereby essentially construct a few-shot learning task (see section 2.2), although they do no recognize it as such. The authors show the model is able to learn transferable knowledge during the first stage, which is used to learn knowledge on novel event types in the second stage. CNN-2-STAGE is the current state-of-the-art few-shot event detection model and will therefore be used as baseline for this research. The approach has some downsides however. Since the CNN-2-STAGE model is as a binary event detector, the model can only be trained to classify one event type. Since the model has to be fine-tuned in the second stage in order to do so, it is required to re-train the model for every new event type. Additionally, for the model to achieve decent results it needs 50 to 150 examples for each new event type. These factors limit the portability of the proposed model.

Huang et al. [10] look at the event detection task and model it as a ‘grounding’ problem. The authors map event instances to an existing event ontology. The event ontology holds event structures for each event type. The authors train a CNN to map event instances to the corresponding event ontology in a shared embedding space.

The trained CNN acts as a mapping function which is independent of the event types, thus enabling classification of new event types based on the event ontology. This approach can be interpreted as zero-shot classification, since it is not dependent on any example instances. The model only depends on the event ontology and the quality of the mapping function to classify novel event types. This research shows the potential of mapping events in an embedding space but is still dependent on the event ontology as an external feature. Peng et al. [24] show it is possible to effectively represent an event type based on only a few sample event instances. By using the semantic similarity between different event mentions of the same type, the authors construct an event vector for each type. New event mentions are mapped to the most similar event vector. Additionally, event vectors can be constructed for new event types based on only a small set example event instances. The event vectors are not learned but constructed features using words selected by their POS tag. Although the model can therefore not be seen as a few-shot learning, it does show how unseen event types could potentially be represented with the use of a few sample instances.

2.2 Few-shot learning

Few-shot learning refers to learning techniques which utilize only a few labeled samples to learn new information. In a few-shot classification task a classifier must learn novel classes not seen during training, given only a few labelled examples of the new class [26, 27, 30] Only a limited amount of text classification research has focused few-shot learning learning [10, 30]. Few-shot learn-ing methods are predominantly developed for image classification tasks [14, 25–27, 29]. Few-shot learning tasks are challenging for supervised learning methods which typically rely on large amounts of training data. Parametric methods, like neural networks, require iterative optimization and risk overfitting if only a small amount of labelled data is available. Contemporary approaches to few-shot learning therefore often decompose training into an auxiliary learn-ing phase, where transferable knowledge is learned, followed by fine-tuning on phase where the novel class is learned [27]. This is also the approach adopted by Nguyen et al. [22], whose research forms the baseline for this research (see section 2.1). Non-parametric alternatives to neural networks, like k-Nearest Neighbor (kNN), do not require iterative training and do not overfit. However, the per-formance strongly depends on the chosen metric [2], challenging the performance in few-shot classification tasks.

To alleviate this discrepancy Vinyals et al. [29] proposed match-ing networks. Matchmatch-ing networks use a combination of parametric and non-parametric methods to learn an end-to-end weighted near-est neighbor classifier suitable for few-shot classification tasks. The architecture of matching networks combines a neural network with a kNN algorithm to learn a weighted kNN classifier. The neural net-work learns a non-linear mapping of the input into an embedding space. The classifier employs a nearest-neighbor method to classify instances in the embedding space according to a predefined metric. In addition to the novel architecture, the MN training scheme is different from contemporary supervised learning approaches. Dur-ing trainDur-ing mini-batches are sampled which are called episodes.

(4)

Each episode is designed to match the test environment by mim-icking a few-shot classification task. Each episode contains a sub-sample of all classes as well as a subsub-sample of their datapoints. The datapoints are divided into labelled sample and unlabelled query datapoints. The model should match all query instances to sample instances of the corresponding class. Employing episode based training improves generalization since it is more faithful to the test environment [8]. This methods allows the models to be learned end-to-end to support few-shot classification. Snell et al. [26] proposed a variation on matching networks called prototypical networks. Prototypical networks are based on the idea that there ex-ists an embedding in which points cluster around a single prototype representation for each class. In stead of matching query points to sample points trough kNN, prototypical networks cluster all sample points and use a linear classifier classifier which uses the nearest class prototype in the embedding space. The methods described by Vinyals et al. [29] and Snell et al. [26] serve as inspiration for the design of a few-shot event learner developed in this research.

3 METHODOLOGY

3.1 Data

Following previous work on event detection [3, 10, 16, 22–24] this research uses the Automatic Content Extraction (ACE) 2005 corpus [4]. The ACE corpus consists of 559 annotated document containing textual data from broadcast-conversations, newswire and webblogs. The texts contain 4182 event instances annotated for 8 event types, divided into 33 subtypes. For example, an event annotated with the ‘Life: Be-Born’ tag denotes an event of type ‘Life’ and subtype ‘Be-Born’. In the ACE 2005 corpus, all sentences which describe an event are called event mentions. Every event mention contains an event trigger which is the main word that most clearly expresses an event occurrence. Consider the following sentence:

(1) The Government of China has ruled Tibet since 1951 after dispatching troops to the Himalayan region in 1950. This sentence includes an event mention of the event type ‘Move-ment: Transport-Person’ which is triggered by the event trigger ‘dispatching’. For all non-trigger tokens an extra None class is in-troduced. This class serves as a label for all tokens which are not labelled as an event trigger. This results in all tokens in the sentence above , except for ‘dispatching’, being of the None class.

Analysis of the dataset shows the frequency of occurrence of event types to be imbalanced. As table 3 in appendix A shows, some event types occur significantly more often compared to others. The most frequent event subtype occurs 1269 times while the least frequent only occurs twice. This imbalance can provide difficulties for a learning model as well as for constructing a representative training, validation and test dataset. This research utilizes the same test set with 40 newswire documents, the same validation set with 30 other documents and the same training set with the remaining 529 documents as previous studies on this dataset [11, 16, 17, 22, 23]. Although this split results in some subtypes being severely underrepresented in the validation and test set, (see appendix A), this approach is still adopted for comparison purposes. Besides the imbalance in event types, the dataset also contains an imbalance between event and non-event tokens. Approximately 98% of all

tokens in the dataset are non-trigger tokens and are thus of the None class.

3.2 Task definition

The standard event detection task entails identifying event trigger tokens and classifying them correctly. The dominant approach is to feed a model with every token in a sentence [3, 21–23]. For each token the model should classify the token as one of the predefined event types or as None when confronted with a non-trigger to-ken. This research extends on the standard event detection task by designing two few-shot learning task:

• Few-shot event classification. For the classification task a model should classify event triggers as the correct event type, while not considering non-event tokens. This task most resembles the few-shot event detection task as used by most recent research [8, 26, 29, 30] and is designed to test how well the model can handle the introduction of new classes based on only a few examples.

• Few-shot event detection . This task is similar to the clas-sification task but includes all non-trigger tokens labelled as the None class. This test mostly resembles the standard event detection task, since the model should be able to discriminate between events and non-events in addition to assigning the correct event type.

For both tasks, one or more event (sub)types are chosen to act as target types the remaining (sub)types serve as auxiliary types. The model is trained on all auxiliary event types. The class of the target event types stays unknown to the model during training and is only introduced during evaluation trough a few sample instances. During training the target instances are not removed from the dataset, but treated as non-trigger tokens. Although this might seem counter intuitive, labelling the target events as None during training seems to be the most realistic scenario [12]. After all, removing the target instances would assume the label of the instance is known before training. In a real-world scenario, the label of the target instances would most likely not be known, inhibiting the removal of these instances from the training set. The training set could thus contain some instances of the target type which are unlabelled since the class is not yet introduced. Moreover, if the labels of these instance would already been known, there would probably be no need for a few-shot classifier. Note that this design choice essentially introduces false negatives into the training data, making the task more challenging. During testing the target instances retain their correct label and the model is tested on the classification performance of the target type(s).

In order to mimic the few-shot learning test setting during train-ing, the model is trained through episode based training as proposed by Vinyals et al. [29]. The network is trained by showing only a few sample instances per event type, much like how it will be tested when presented with a few examples of a novel event type. In each training iteration, an episode is formed by randomly samplingC classes from the training set along withK labelled instances belong-ing to theC classes to act as sample set. A sample of the remainder of the instances belonging to theC classes serves as the query set. In every episode the model should match sample and query instances belonging to the same event type. Additionally, when confronted

(5)

Figure 1: Convolutional event matching network

with non-events the model should not make a match with any event sample. When the model does not find a match between an instance and any sample, it should classify the instance as None.

3.3 Model

For both tasks a novel model is constructed, called the convolutional event matching network (CEMN) model. The CEMN model is the result of combining CNN-2-STAGE, the current state-of-the-art few-shot event detection model (see section 2.1), with novel few-few-shot learning methods as described by Vinyals et al. [29] and Snell et al. [26] (see section 2.2). CEMN consists of an embedding function f , a relation function д and a classifier, as shown in figure 1. The embedding function provides a non-linear mapping which projects the input into an embedding space. Sample eventsxs and query

eventsxqare fed trough the embedding functionf , which produces

the embedded event instances f (xs) and f (xq). The embedded

query and sample events are fed into the relation functionд which produces a relation scorer between 0 and 1. The relation scores represent the categorical similarity between a sample eventxsand

query eventxqwhere a pair of events of the same type have score 1

and the unrelated pairs have a score of 0. The relation score between two event instances is described by the following equation.

ri, j = д(f (xi), f (xj)) (1)

The classifier uses the relation scores between sample and query instances to classify the query instances.

The objective function for the model is to minimize the mean squared error (MSE) of the relation scores. The loss is expressed by comparing the relation scorer to the ground truth, where matched pairs have a score of 1 and mismatched pairs a score of 0. Sung et al. [27] mention how the choice of MSE seems to be somewhat non-standard but nevertheless appropriate for this type of model. Although the problem seems like a classification problem with label space 0, 1 it can better be interpreted as a regression problem where the relation scores are being regressed to the ground truth.

3.3.1 Embedding function. A convolutional neural network is implemented to serve as embedding function, as shown in figure 2. This design choice is motivated by research on few-shot learning models which utilize convolutional layers [26, 27, 29]. Although these models are optimized for image classification, CNNs are also suitable for sentence classification [13]. Specifically, CNNs have successfully been applied in supervised event detection methods [3, 23] and more importantly in the current state-of-the-art few-shot event detection model [22]. For comparison purposes, this research

Figure 2: Embedding function of the CEMN model

implements the same convolutional layers as the CNN-2-STAGE model described by Nguyen et al. [22]

The embedding function first transforms the selected input token, with its context in the sentence, into a feature matrix. Trimming and padding are applied to the context sentence to limit the length to a fixed size. This is necessary for the context window to be fed trough the convolutional layers. Let 2w + 1 be the window size andx = [x−w, x−w+1, ..., x0, ..., xw+1, xw] be an event trigger

candidate where the current tokenx0is positioned in the middle

of the window. Each tokenxi represents a word in the sentence.

Before being fed into the CNN all tokens are transformed into a real-valued vector. This vector is obtained by using a continuous representation for the word and the position of the token. Both representations are stored in the following look-up tables:

• Word embedding table. This table contains embedding vec-tors for all words. The vecvec-tors are initialized by pre-trained word embeddings, to embed the hidden semantic properties of the tokens [20, 28], and are optimized during training. • Position embedding table. This table holds the embedding

vectors for the relative distancei of the token xi to the

cur-rent tokenx0. This enables the model to distinguish the

proximity of each token in the sentence to the current token. The position vectors are initialized randomly and optimized during training.

For each tokenxi, the vectors obtained from the look-up tables

are concatenated into a single vectorXi. As a result, the

orig-inal event triggerx is transformed into a feature matrix X = [X_−w, X_−w+1, ..., X₀, ..., X_{w −1}, X + w] of size m × (2w + 1) where m is the sum of the dimensionality of the word representation mw

and of the position representationmp.

The matrix representationX is fed trough a convolution layer, followed by a ReLu and max pooling operation. The convolutional layer contains multiple feature mapsf of multiple filter widths k. The width of the filter corresponds to the amount of consecutive words feature map should process. A filter with widthk processes k consecutive words and thus has a dimension of m × k. Figure 2 illustrates the different layers of the embedding function. In contrast to the CNN-2-STAGE model, the convolutional layer is not followed by a softmax layer since the embedding function does not have to perform classification. The convolutional layers only learn a non-linear mapping of the input sentence into the embedding space.

(6)

3.3.2 Relation function. The relation function д expresses the categorical similarity between two embedded event instancesf (xi)

andf (xj) trough the relation scoreri, j, where the relation score

between a pair of the same class should be 1 and of different classes should be 0. The relation function depends on a predefined metric to define the distance between event instances in the embedding space. Although in principle any distance metric is permissible, this research uses cosine distance and squared Euclidean distance since these metrics have shown to facilitate high performance in few-shot image classification tasks [26, 27, 29]. The relation scores for both metrics are:

cos : ri, j = 1 − cos(θ(xi, xj)) (2)

whereθ(xi, xj) denotes the angle betweenxi, xj

eucl : ri, j = 1

1+ d(xi, xj)2 (3)

whered(xi, xj) denotes the Euclidean distance betweenxi, xj.

Note that the squared Euclidean distance is normalized to range from 0 to 1. The cosine distance already ranges from 0 to 1 and therefore does not have to be normalized as such.

For both metrics, the relation scoreri, jis designed to be inversely

correlated with the distance betweenxi andxjin the embedding

space. As a result, instances with a high relation score are in close proximity of each other and vice versa. Since the objective func-tion is to increase relafunc-tion scores of matched pairs and decrease relation scores of mismatched pairs, the relation function can be interpreted as facilitating cluster forming in the embedding space. Put differently, by ’pulling‘ together events belonging to the same type, while ’pushing‘apart event of different types, clusters of event types can be formed in the embedding space. The resulting clusters can be used by the classifier to classify unlabelled query instances. 3.3.3 Classifier. The classifier uses the relation scores between sample and query instances to classify the query instances. Two different approaches are considered.

•K-nearest samples. Similar to a k-Nearest-Neighbor ap-proach, the query instances are classified trough a majority vote of thek highest scoring sample instances. If no majority can be formed, the sample instance with the highest relation score is used.

•Prototypes. Query instances are classified as the nearest event prototype. An event prototype is defined as the mean of all it’s sample instances in the embedding space. In order to deal with non-event query instances a classification thresholdt is introduced.

ˆ

ri, j = max(ri, j−t, 0) (4)

The classification threshold ensures the relation score ˆri, j will only

be greater than 0 if the relation scoreri, jexceeds the thresholdt.

The threshold can be interpreted as a minimum certainty require-ment for the relation score. If a query point has no relation with any sample point which exceeds the classification threshold, the classi-fier will not match the query to any sample and classify it as None. Since the embedding and relation functions are trained to maxi-mize relation scores between related events and minimaxi-mize relation scores between unrelated events, non-event query instances should

receive low relation scores for all sample instances. If all relation scoresri, jfall below the threshold, the model can correctly classify

a non-event query instance as None. The classification threshold t forms a model hyperparameter of which the effect is dependent on the dataset and the chosen embedding and relation function. The value oft can be set from 0 to 1, where the higher the value the higher the threshold the relation score has to overcome before positively classifying an instance.

4 EXPERIMENTS

Two experimental settings are designed to evaluate the performance of the CEMN model: an event detection and an event classification setting. For both settings the model will be initialized with the same parameters and compared to the same baseline.

4.1 Event classification

The event classification setting is designed to align with most recent few-shot learning research [8, 26, 29, 30]. For this task all event subtypes are used, with exception of the subtypes occurring less than 25 times in the dataset. In this setting the models are tested on a 5-shot and 20-shot event classification task. Due to the low occurrence frequency of most event subtypes it is not possible to create a 50-shot or 100-shot classification task. In order to determine which event subtypes will serve as target types and which will serve as auxiliary types, two aspects are taken into consideration. First, the target event types should not contain any semantic overlap with the auxiliary types. Consider the ‘Justice:Convict’ and ‘Jus-tice:Sentence’ subtypes. Since both subtypes belong to the ‘Justice’ event type, instances belonging to these subtypes will most likely contain some semantic overlap. Both subtypes describe slightly similar events and will most likely occur in similar contexts. If one of these subtypes would serves as target type, while the other serves as auxiliary type, the model can already learn the structure of ‘Justice’ events trough the auxiliary type. When the target type is introduced, one could argue the model already had access to ‘Justice’ events during training, giving it an unfair advantage in a few-shot learning task. To prevent the model learning from highly related event subtypes, the subtypes should be grouped based on their overall event type. In addition, each group should contain at least five subtypes. To meet these conditions four event groups are constructed: Business & Conflict, Movement & Justice, Contact & Personnel and Life & Transaction. (see table 4 in appendix A)

Four experiments are executed, one for each event group. During each experiment, all subtypes belonging to one event group serve as target types while the remaining subtypes serve as auxiliary event types. The models are trained on the auxiliary subtypes and tested on the few-shot classification performance of the target subtypes. In correspondence with previous research [3, 22, 23] F1-scores serve as performance measure.

4.2 Event detection

The event detection setting is designed to align with previous event detection research [3, 22, 23]. In this setting the models are tested on a 5-shot, 20-shot, 50-shot and 100-shot event detection task. Similar to Nguyen et al. [22] this settings required the classification of the overall event types and non-events, ignoring the event subtypes.

(7)

Model Classification Detection 5-shot CEMN 55.9% 0.7% 2-STAGE 16.9% 0.0% 20-shot CEMN 62.7% 8.5% 2-STAGE 26.2% 0.0% 50-shot CEMN - 6.3% 2-STAGE - 4.3% 100-shot CEMN - 6.5% 2-STAGE - 23.2%

Table 1: Average performances for the event detection and classification tasks on the test set

Since ‘Business’ event type does not meet the minimum required event frequency for a 100-shot event detection task it is removed from the dataset, leaving the seven event types: Conflict, Contact , Justice, Life, Movement, Personnel and Transaction. In addition non-trigger tokens are included in the dataset. Seven experiments are executed, one for each event type. The models are trained on the auxiliary types and tested on the few-shot classification perfor-mance of the target type. Similar to the event classification task, performance is expressed trough the F1 classification scores of the target types. It should be noted that Nguyen et al. [22] use a slightly different setting as the authors remove all auxiliary event types from the test set. In our opinion this is an unrealistic setting since this would leave the target event type as the only event type in the test set. A model only needs to be able to distinguish events from non-events to perform well in such a setting. This leads to ques-tion whether such a test environment addresses the classificaques-tion problem. Therefore this approach is not adopted.

4.3 Baseline

For both experiments, the CNN-2-STAGE model by Nguyen et al. [22] serves as the baseline model (see section 2.1). The CNN-2-STAGE model is currently, to the best of our knowledge, the only model specifically designed for few-shot event detection. Since Nguyen et al. [22] could not provide their original code, an imple-mentation was made based on the information provided in their re-search paper. Since the paper did not provide sufficient information to replicate the original code, some assumptions were necessarily made1. In addition Nguyen et al. [22] use a slightly different event detection setting (see section 4.2). Both factors might contribute to differences between the results reported by Nguyen et al. [22] and the results of this research.

Since the CNN-2-STAGE model is a binary event classifier, a model is trained for each target event type. During the experiments each model will indicate if it believes the current instance to be of the type it is trained on. This could result in multiple of these models trying to assign their label to the target instance. In order to resolve such a conflict, the model with the highest classification certainty is chosen as the classifier. The classification certainty is determined by the value of the final neuron of the neural network, the same value the model uses to classify event instances.

1_{Assumptions were made on the appropriate learning rate and data pre-processing}

steps for example

Distance metric 1-nearest 3-nearest 5-nearest Prototypical Cosine 17.1% 17.2% 16.0% 15.3% Euclidean 57.7% 53.0% 51.2% 54.9% Table 2: Average CEMN performance for the event classifica-tion task on the validaclassifica-tion set

4.4 Parameters and Resources

The performance of the CEMN model is dependent on the parame-ters of the three components: the embedding function, the relation function and the classifier. To allow for comparison, the parameters of the convolutional layers in the embedding function are inherited from Nguyen et al. [22]. The fixed window size is set tow = 15. The convolutional filter widths are set to= {2, 3, 4, 5} with 150 filters for each window size. The dimension of the position embedding table is set tomp= 50. For the word embedding table the pretrained

word embeddingsword2vec from Mikolov et al. [20] are used to initialize the word embedding table. The dimension of the word embedding table is inherited from the pretrained embeddings and thus set tomw = 300.

For the relation function and the classifier different parameter settings are explored during both experiments. Both experiments are executed with a cosine based relation functioncos : ri, j and an

Euclidean based relation functioneucl : ri, j. Similarly, the

experi-ments are executed with a prototypical classifier and with a 1-, 3-and 5-nearest sample classifier. For the event classification setting the classification threshold is set tot = 0, since this setting does not include non-events. For the event detection setting the threshold is varied in the range 0-1 with step size 0.1. The the performances of the different parameters settings are evaluated on the validation set and the best performing parameters settings are used to compare the CEMN model against the baseline on the test set.

The models are trained using stochastic gradient descent with the AdaDelta update rule [31]. Similar to Vinyals et al. [29], episodes are formed by using 5 query points for 5 different classes andk samples depending on thek-shot test setting.

5 RESULTS

5.1 Event classification

5.1.1 Parameters. Table 2 shows the average performance of the CEMN model for the 5- and 20-shot event classification tests on the validation set. These results show the Euclidean based relation score to outperform the cosine based score on all tests. The model thus seems to work best using an Euclidean distance metric between instances in the embedding space. The performance of the model is much less dependent on the classifier. The k-nearest-sample and prototypical classifiers show similar scores. Based on these results the Euclidean based 1-nearest-sample CEMN model is chosen to be compared against the baseline.

5.1.2 Baseline comparison. Table 1 shows the performance of the CEMN model compared to the CNN-2-STAGE model for the event classification task. The CEMN model shows it is able to clas-sify the target events without needing to be trained on the few examples given. The CNN-2-STAGE model however does depend

(8)

0 0.2 0.4 0.6 0.8 0 2 4 6 8 t F1 1-nearest 3-nearest 5-nearest Prototypical

Figure 3: Average CEMN performance for the event detec-tion task on the validadetec-tion set

on fine-tuning on the target events, which results in a low perfor-mance given only 5 or 20 examples. As a result, the CEMN model clearly outperforms the CNN-2-STAGE model on all event classifi-cation tests. Given only 5 sample instances the CEMN model can learn the structure needed to achieve high performance. The CEMN performance per event type is shown in appendix B.

5.2 Event detection

5.2.1 Parameters. Figure 3 shows the average performance of the Euclidean-based CEMN model for the 20-, 50- and 100-shot event detection test on the validation set. The results show the per-formance of the model to increase with an increasing threshold until an optimal value is reached after which performance decreases. This indicates the classification threshold performs as expected by acting as a minimum certainty requirement. By increasing the threshold, the model will only positively classify an instance if the relation score is exceeds the threshold. Not unsurprisingly, as the threshold approaches 1 the performance drops since the minimum certainty requirement becomes to strict. The results shows the CEMN model to perform best when using 1-nearest-sample classification in com-bination with a classification thresholdt = 0.6. These parameters are therefore adopted when the model is compared against the baseline.

5.2.2 Baseline comparison. Table 1 shows the performance of the CEMN model compared to the CNN-2-STAGE model. Both models fail to achieve any significant performance on the 5-shot detection task. The CEMN model outperforms the CNN-2-STAGE model on the 20-shot and 50-shot event detection task. When 20 samples are provided, the CEMN model already has sufficient infor-mation to correctly classify a small sample of the query instances. The CNN-2-STAGE model needs more samples, in order for its performance to increase. As reported by Nguyen et al. [22], the CNN-2-STAGE model needs at least 50 samples to achieve any sig-nificant classification performance. The performance CEMN model seems less dependent on the amount of samples, allowing it to perform well on small sample sizes. The CEMN performance per event type is shown in appendix B.

5.3 Analysis

The CEMN model shows promising results for the event classifica-tion task. The model is able to learn an embedding funcclassifica-tion which enable the model to classify novel event types without needing additional training. Analysis of the incorrectly classified instances gives some insight in the limitations of the model. Consider the following sentence which contains a Conflict event mention.

(1) A second rocket landed in farmlands and the other hit a house inside the refugee camp , but without causing further casualties, Palestinian security sources said.

The CEMN model classifies the sentence as an event mention of the Movement type. This classification mistake can be understood through the semantic meaning of the trigger token. The trigger token ‘landed’ contains semantic information which can be asso-ciated with Movement events, since it could for example indicate an airplane has ‘landed’. However, since the sentence describes the landing of a rocket, the semantic meaning is that of a Conflict event. The model thus fails to account for the context word which is of importance in correctly classifying the trigger token. In general, trigger tokens which meaning is dependent on the context prove to be difficult for the CEMN model to classify.

Nevertheless, the CEMN model clearly outperforms the CNN-2-STAGE model and achieves high performance even when only 5 example events are available for each type. The performance how-ever show-everely degrades in the event detection task. When confronted with non-event tokens, the model is not able to effectively discrimi-nate events from non-events. The CEMN model often achieves 100% accuracy on a training episode, indicating it can indeed discriminate event from non-events in the training set. However, the knowledge gained which allow it to classify non-events during training set proves to be insufficient to recognize non-events in the test set. Analysis of the performance of the model shows multiple possible causes. First of all, whereas event tokens can be compared based on their semantic similarities, this proves to be much more diffi-cult for non-event tokens. Since all non-event tokens belong to the None class, the semantic variance within this class is much higher compared to variance within the event classes. Since all instances belonging to the same event class describe the same event they share some semantic similarity. On the contrary, non-event tokens do not share such a semantic similarity. Not unsurprisingly, words like ‘attack’ ‘shot’ ‘bomb’, all trigger tokens for Conflict events, contain more semantic similarity than non-event tokens like ‘after’ ‘government’ or ‘house’ for example. This makes it difficult for the model to learn the semantic structure of a non-event, causing the performance on the event detection task to be lower compared to the event classification task.

In addition to the large semantic variance within the None class, the None class also contains tokens which contain high semantic similarity to event tokens. Consider the following sentences

(1) A powerful bomb tore through a waiting shed at the Davao City international airport at about 5.15 pm -LRB- 0915 GMT -RRB- while another explosion hit a bus terminal at the city. (2) Television footage showed medical teams carting away dozens

of wounded victims with fully armed troops on guard. The CEMN model classifies both instances as a Conflict event while they are labelled as None according to the ACE corpus. These

(9)

Figure 4: A T-SNE visualization of the embedding space dur-ing the event detection task. The red dots indicate the target events while the other colors indicate the auxiliary events.

examples indicate the difficulty for the CEMN model in correctly classifying non-events. The trigger tokens ‘hit’ and ‘victim’ are words which indirectly indicate a conflict. However, these tokens are not considered events according to the ACE corpus and are thus labelled as non-events. The presence of tokens which contain a high semantic similarity to event tokens but are not labelled as such, could also be the cause of the low performance on the event detection task.

Finally, analysis of the dataset shows there exists an ambiguity in the annotation of non-event tokens. Consider this sentence from the annotation guideline:

(1) The Government of China has ruled Tibet since 1951 [..] One could argue the world ‘ruled’ to be an event trigger for a Political:Govern event type. However, the ACE guidelines do not include any Political events or subevents. Therefore even if in reality the token denotes a political event, it is not annotated as such. The ACE 2005 corpus is only annotated for the 8 event types described in the guidelines. As a result, the corpus contains a high volume of sentences which in reality contain event tokens but are classified as None. This essentially introduces false negative event triggers into the dataset, making it difficult for a classifier to learn the difference between trigger and non-trigger tokens.

The combination of a semantic variance within the None class, the presence of event-like tokens and the presence of false-negatives make it difficult for the CEMN model to learn the difference between events and non-events. Figure 4 shows a T-SNE visualization of the embedding space during the event detection task. The figure shows all event instances, except for the event instances. The non-events are excluded from the visualization to show the model is still able to cluster related event types together in the event detection task. These clusters should enable the model to classify novel event types. However, when the model is confronted with non-event instances, the clusters become less useful as shown in figure 5. The non-event instances are spread throughout the embedding space and interfere with the clusters of the target type. This makes it difficult for the model to discriminate between the target events and non-events.

Figure 5: A T-SNE visualization of the embedding space dur-ing the event detection task. The red dots indicate the target events while the blue dots indicate non-events

We suggest two approaches which fall outside of the scope of this research to increase the performance of the model on the non-event types. First the false negative non-events in the ACE 2005 corpus can be removed by annotating more event types. Currently only 8 event types are annotated, resulting in all other events occurring in the corpus being classified as None. By extending the annotations the CEMN model could gain more knowledge on the difference between events and non-events. Secondly, other models could be employed to separate the events from the non-events before feeding them into the CEMN model for classification. These models could specifically be trained to distinguish events from non events by using for example lexical features such as part-of-speech tags.

6 CONCLUSION

In this paper we introduced the convolutional event matching net-work model. This few-shot event detection model is designed to overcome limitations posed by the poor portability of supervised methods given the limited coverage of available event annotations. The CEMN architecture is the result of combining CNN-2-STAGE, the current state-of-the-art few-shot event detection model, with novel few-shot learning methods such as episode based training and a matching network architecture. The CEMN model is trained end-to-end to classify novel event types given only a few examples. A few-shot event detection and classification task are constructed to test the few-shot learning abilities of the model. The CEMN model yields the state-of-the-art performance on the 5- and 20-shot event classification task. The classification performance on the event de-tection task is limited however. The CEMN model seems unable to learn the general structure of non-events, causing non-event instances to interfere with the classification of all event-instances. The high semantic variance within the non-event instances, the presence of event-like tokens and the presence of false negatives are identified as possible causes. For future research we propose two di-rections which could improve performance on the non-events. The annotations of the ACE corpus could be extended and other models could be employed to select event instances before feeding them into the CEMN model. In addition, in future work we would like to employ different neural architectures in the embedding function,

(10)

such as recurrent neural networks. Finally we would like to verify the effectiveness of the CEMN architecture in other tasks such as relation extraction.

REFERENCES

[1] David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events. Association for Computational Linguistics, 1–8.

[2] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. 1997. Locally weighted learning for control. In Lazy learning. Springer, 75–113.

[3] Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceed-ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. 167–176.

[4] Linguitic Data Consortium. 2005. ACE (automatic content extraction) English an-notation guidelines for events. (2005). https://www.ldc.upenn.edu/collaborations/ past-projects/ace

[5] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 (2006), 594–611.

[6] David Ferrucci and Adam Lally. 2004. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10, 3-4 (2004), 327–348.

[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017).

[8] Stanislav Fort. 2017. Gaussian Prototypical Networks for Few-Shot Learning on Omniglot. arXiv preprint arXiv:1708.02735 (2017).

[9] Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaom-ing Zhu. 2011. UsQiaom-ing cross-entity inference to improve event extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin-guistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 1127–1136.

[10] Lifu Huang, Heng Ji, Kyunghyun Cho, and Clare R Voss. 2017. Zero-Shot Transfer Learning for Event Extraction. arXiv preprint arXiv:1707.01066 (2017). [11] Heng Ji and Ralph Grishman. 2008. Refining event extraction through

cross-document inference. Proceedings of ACL-08: HLT (2008), 254–262.

[12] Jing Jiang. 2009. Multi-task transfer learning for weakly-supervised relation extraction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1012–1020.

[13] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).

[14] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2.

[15] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33.

[16] Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 73–82. [17] Shasha Liao and Ralph Grishman. 2010. Using document level cross-event

in-ference to improve event extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 789–797.

[18] Shulin Liu, Yubo Chen, Kang Liu, Jun Zhao, Zhunchen Luo, and Wei Luo. 2017. Improving Event Detection via Information Sharing Among Related Event Types. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 122–134.

[19] David McClosky, Mihai Surdeanu, and Christopher D Manning. 2011. Event extraction as dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 1626–1635.

[20] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.

[21] Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 300–309.

[22] Thien Huu Nguyen, Lisheng Fu, Kyunghyun Cho, and Ralph Grishman. 2016. A two-stage approach for extending event detection to new types via neural

networks. In Proceedings of the 1st Workshop on Representation Learning for NLP. 158–165.

[23] Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 365–371.

[24] Haoruo Peng, Yangqiu Song, and Dan Roth. 2016. Event detection and co-reference with minimal supervision. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 392–402.

[25] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016).

[26] Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. 4080–4090.

[27] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2017. Learning to Compare: Relation Network for Few-Shot Learning. arXiv preprint arXiv:1711.06025 (2017).

[28] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384–394.

[29] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-ing networks for one shot learnMatch-ing. In Advances in Neural Information ProcessMatch-ing Systems. 3630–3638.

[30] Leiming Yan, Yuhui Zheng, and Jie Cao. 2018. Few-shot learning for short text classification. Multimedia Tools and Applications (2018), 1–12.

[31] Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

(11)

APPENDIX A

EVENT TYPES

Table 3: Event types in the ACE 2005 corpus

Event Dataset

Type Subtypes Training Validation Test None None 66317 15945 15768 Business 83 1 21 End-Org 23 0 5 Start-Org 21 0 14 Declare-Bankruptcy 27 1 2 Merge-Org 12 0 0 Conflict 1104 155 85 Attack 1045 146 78 Demonstrate 59 9 7 Contact 283 32 50 Meet 182 29 43 Phone-Write 101 3 7 Justice 523 39 55 Arrest-Jail 66 4 6 Execute 9 4 2 Trial-Hearing 90 1 5 Charge-Indict 84 2 8 Acquit 4 0 1 Appeal 23 7 5 Sentence 75 4 10 Convict 60 6 6 Sue 53 11 4 Release-Parole 32 0 1 Fine 20 0 6 Extradite 5 0 1 Pardon 2 0 0 Life 675 66 36 Die 449 53 16 Injure 107 13 1 Divorce 18 0 8 Marry 66 0 8 Be-Born 35 0 3 Movement 540 55 47 Transport 540 55 47 Personnel 343 39 42 End-Position 123 23 17 Start-Position 80 11 10 Elect 133 5 14 Nominate 7 0 1 Transaction 193 44 36 Transfer-Money 115 39 10 Transfer-Ownership 78 5 26 10

(12)

Table 4: Event groups for the event classification setting

Justice & Movement Business & Conflict Life-Transaction Contact & Personell Justice:Jail Business:Declare-Bankruptcy Life:Die Contact:Meet Justice:Hearing Business:End-Org Life:Injure Contact:Phone-Write Justice:Charge-Indict Business-Conflict:Start-Org Life:Divorce Personnel:Start-Position Justice:Appeal Conflict:Demonstrate Life:Marry Personnel:Elect Justice:Sentence Conflict:Attack Life:Be-Born Personnel:End-Position Justice:Convict Transaction:Transfer-Money Justice:Sue Transaction:Transfer-Ownership Justice:Release-Parole Justice:Fine Movement:Transport 11

(13)

APPENDIX B

CEMN PERFORMANCE

Table 5: CEMN performance on the event classification task

Target 5-shot 20-shot type CEMN 2-STAGE CEMN 2-STAGE Justice-Movement 41.8% 7.9% 61.3% 57.4% Business-Conflict 71.5% 17.1% 71.9% 17.1% Life-Transaction 62.6% 5.1% 60.3% 7.3% Contact-Personnel 47.6% 37.4% 57.1% 23.0% Average 55.9% 16.9% 62.7% 26.2%

Table 6: CEMN performance on the event detection task

Target 20-shot 50-shot 100-shot type CEMN 2-STAGE CEMN 2-STAGE CEMN 2-STAGE Movement 5.7% 0.0% 2.0% 0.0% 2.4% 22.6% Conflict 29.0% 0.0% 27.2% 7.8% 23.2% 21.6% Justice 0.7% 0.0% 7.1% 3.3% 6.1% 15.4% Life 15.6% 0.0% 4.9% 0.0% 7.2% 17.0% Personnel 1.9% 0.0% 0.7% 4.3% 2.8% 31.6% Contact 2.3% 0.0% 2.0% 9.8% 2.8% 45.3% Transaction 4.3% 0.0% 0.5% 5.0% 0.7% 8.7% Average 8.51% 0.00% 6.34% 4.32% 6.47% 23.17% 12