Unsupervised Domain Adaptation For Event Detection

(1)

For Event Detection

Mehdi Amiri

(2)

Unsupervised Domain Adaptation For Event Detection

Master’s Thesis

To fulfill the requirements for the degree of Master of Arts in Information Science

at the University of Groningen under the supervision of Dr. Andreas van Cranenburgh (University of Groningen) Dr. T. Caselli (Information Science, University of Groningen)

Mehdi Amiri (s5044030)

January 27, 2023

(3)

Abstract

The present study will investigate unsupervised domain adaptation for event detection between the news as the source domain and movie subtitles as the target domain, both in the same language.

Adapting such events to an unlabeled target domain allows us to mitigate the costs of the manual annotation process. Because language is so diverse, it is prohibitively expensive to gather and cu- rate training sets for each separate area. Moreover, changes in language, literary, and writing style between domains can significantly raise the error rate of cutting-edge supervised models, especially in the movie domain which has a wide diversity. In addition, predicting the target domain events might help to recognize the genre of movies, and extract movie emotion using the events and their temporal orders. This also can automate the principal narrative writing, summarizing and classify- ing the primary concept. By generalizing models from a resource-rich source domain to a different, resource-poor target domain, domain adaptation techniques offer a solution to the above issues. We propose a solution for domain adaptation by utilizing the further pretraining process before finetuning the event detection task. To this end, we provide custom pretraining movie scripts scraped from the IMSDB¹repository. Furthermore, we annotate a portion of 4 movie subtitles for testing purposes using Richer Event Description(RED) guidelines. The experiments illustrate that the proposed method outperforms the Conditional Random Field(CRF) as the baseline as well as the other competitors such as zero-shot and low-resource domain-adaptive pretraining models.

Keywords: unsupervised domain adaptation, further pretraining, domain-adaptive pretraining, fine- tuning, BERT, transformer

1https://imsdb.com/

(6)

1 Introduction

The word meaning is a key question in the philosophy of languages and linguistics. This subject might be encountered from two perspectives; structural linguistics and computational linguistics. Structural linguistics shows that the meaning of a word is related to a set of words that are semantically related in a near or far sentence [1]. [2] demonstrates that the word meaning can only be extracted by other words in a context in which they are utilized. On the other hand, context plays a main role in word representation [3]. In fact, the idea behind getting such meaning from the word is coming from the distributional hypothesis [4] that suggests the meaning from a large text. Deriving the meaning from the context is becoming the main and widespread idea in linguistics. Another aspect of word meaning is that every word may have multiple meanings. So apart from where they are placed in the context, they can carry some meanings. considering the semantically and syntactical interpretation along with the inference in the surrounding context would be encountered [5]. So the more precise meaning is extracted from wider knowledge settings and domain of the context.

When we expand this subject to the event detection task, we figure out that a word is not an event all the time which the historical meaning might offer and vice versa.

Take the word ”building” as an example. It seems strange to claim that this term is an event in some situations. The primary meaning of this word is ”constructing something” which is an event. This type of getting the meaning is typical. Now, imagine a situation that one wants to refer to a construction that has already been constructed. The sentence might be ”The building is beautiful”. The situation will change based on the syntactical role of the word ”building”. So the word ”building” refers to a structure that is available not the process of constructing something. Therefore the word would not be an event in these circumstances. This interpretation would be more challenging when we talk in a specific domain. Consider the word ”fire” has several meanings in different contexts of occurrence.

”burning ”, ” terminating employment” and ”discharging a gun or other weapon in order to propel”

can be figured out from it. The exact meaning would depend on the type of context and domain in which we talk. If the Natural Language Processing (NLP) system does not encounter the domain, a substantial drop would occur in the performance. Every NLP system needs to account for domain information. The ordinary solution is to train the model with data in the domain of interest. However, this faces some problems. The first one is it requires a lot of human effort to gather labeled train data which makes the task infeasible practically.

Due to the wide variety of languages, it is extremely expensive to compile and organize training sets for each individual subject. The error rate of cutting-edge supervised models can also be greatly increased by differences in language, literature, and writing styles between domains, particularly in the movie domain, which has a huge variety.

Therefore, adapting models to different domains will be a wiser solution for the aforementioned problems. The present study will investigate unsupervised domain adaptation for event detection between the news as the source domain and movie subtitles as the target domain, both in the same language (English). We can reduce the costs of the manual annotation procedure by adapting such occurrences to an unlabeled target domain. Besides, by foreseeing movie subtitle domain events, it also may be possible to identify the movie genre and derive movie sentiment from the events and their temporal sequences. Additionally summarizing and categorizing the central idea and making a draft of the narrative automatically would be beneficial. Domain adaptation strategies provide a solution to the problems by generalizing models from a resource-rich source domain with more resources to a resource-poor target domain.

(7)

1.1 Research Questions

The main research questions are:

Q1. How to use domain adaptation for event trigger detection from news corpus to movie subtitle?

Q1.1 What adaptation technique do we need?

Q1.2 What is the impact of further pretraining material in terms of size?

Q1.3 How is the similarity between the RED and movie subtitles?

1.2 Thesis Outline

The next chapter will explain the related works around supervised, semi-supervised unsupervised domain adaptation techniques and event detection approaches. After that, in section 3, the data which is used in the proposed model will be explained in three categories; RED labeled data, pretraining corpus, and annotated movie subtitles for the test purpose. The proposed method, baseline, and evaluation methods will be discussed in section 4. In section 5, all experiment setups and results will be explained. Then in section 6, we will cover an analysis of the experiments, and finally, in section 7 we provide a conclusion and give some tips for future works.

(8)

2 Related Works

This chapter discusses and analyzes domain adaptation strategies, providing a wide range of methods and associated literature. This might aid in gaining a better knowledge of the concepts and state of current research.

2.1 Supervised domain adaptation

The prerequisite for supervised domain adaptation is the availability of a sizable, annotated corpus of data from both the source domain and the destination domain. Instance-based and augmentation- based supervised domain adaptation are two categories of the method. Instance-based approaches weight individual observations during training depending on their significance to the target domain, whereas augmentation-based methods change an initial feature space in a new space such that it may be predictive for both domains.

2.1.1 Augmentation-based Methods

Domain adaptation is made ”frustratingly easy” so-called by [6], which offers a very straightforward yet successful method. For any supervised learning algorithm, their technique is used as a preprocessing step. The authors create a source-specific, target-specific, and domain-invariant version of each feature based on a sizable training set in the source domain and labeled examples in the target domain.

Each feature from the source domain and target domain is separately replicated into three versions.

The first one is a source-specific version, the second one denotes a domain-invariant version, and the third one is a target-specific version. The augmented features are put into a classifier after this preprocessing phase. The logic behind this strategy is illustrated by the feature weights. The approach gives the source-specific version more weight if the feature is only significant for this domain, as opposed to giving the domain invariant version more weight if the feature performs equally in both domains.

As a result, the system can recognize that ”the” is often employed as a determiner across domains.

The feature ”monitor as a verb” will have a big weight in the computer (target) domain, in contrast to the feature ”monitor as a noun,” which will have a significant weight in the general (source) domain.

Despite being relatively straightforward, this method produces excellent outcomes for a variety of real-world problems. Using Perl, the implementation just requires 10 lines of code.

2.1.2 Instance-based Methods

In the literature, the issue of instance-based domain adaptation is sometimes referred to as domain shift. Different domains in this context correspond to various probability distributions p(x, y) over the same pair of feature-label spaces (X, Y), where X is a feature space and Y is a label space. In general, instance-based approaches use data from the source domain to reduce the target risk. The target risk is connected to the source distribution. A target classifier’s predicted loss with respect to the distribution is represented by a function, which compares a classifier prediction with the actual label. The important weighting is expressed in the data marginal distributions ratio. A large weight represents that the sample has a high probability under the target distribution. The sample, on the other hand, has a low probability within the source domain. As a result, the loss may rise for some samples while falling for others [7]. Instance weighting techniques were also studied for statistical machine translation (SMT). For instance, [8] selects the phrases from a huge general domain parallel corpus that are most pertinent to the target domain. They select source domain samples based on the language model’s perplexity scores when trained on target domain data. This method is also known as the data

(9)

selection method in the literature. It was also used to improve task-specific word representations in [9]. The authors describe this strategy as curriculum learning, in which the model reads the corpus and learns the ordering of the training cases. They increase performance by optimizing the curriculum for training word representations and then using them as input features in NLP tasks. This strategy is based on the idea that some tasks, such as Part-of-Speech (POS) tagging, prefer vectors trained on curricula that emphasize well-formed sentences. Named Entity Recognition (NER) tasks, on the other hand, favor vectors trained on corpora that begin with named entities [9]. Thus, learning the curriculum aids in improving performance on downstream tasks in comparison to random or natural corpus orders.

2.2 Semi-Supervised Domain Adaptation

Semi-supervised domain adaption is a problem setting in which there may be insufficient labeled data to develop a competent classifier and only a limited amount of labeled data in the target domain is available. Several semi-supervised algorithms have been developed to address this challenge, which make use of various data resources, including a significant amount of labeled data from a source domain and a large amount of unlabeled data from both the source and target domains.

2.2.1 Embedding-based Methods

Word2vec[10] and Glove[11] as domain generic word embeddings have shown significant efficacy in transferring prior knowledge to downstream tasks. Typically, these embeddings are trained using a corpus such as Wikipedia or Common Crawl. When they are utilized as features in supervised learning tasks, they transfer knowledge. These approaches are semi-supervised domain adaptation methods since they benefit from vast volumes of unlabeled data. Word embeddings are frequently computed from scratch on unlabeled data from the target domain to capture domain-specific semantics and then utilized as input features to perform classification to adjust a classifier for a target domain. To produce high-quality domain-specific word embeddings, a considerable amount of unsupervised data from the target domain should be provided. To investigate the low-size data, a method introduced by [12] that obtains domain-adapted word embeddings using small-sized labeled data from the target domain and the knowledge from the generic word embeddings with the following steps:

1) algorithms like Glove or word2vec are used to generate domain generic embeddings.

2) Domain-specific embeddings are obtained by running algorithms such as Latent Se- mantic Analysis (LSA) on the target domain’s supervised data.

3) A linear Canonical Correlation Analysis (CCA)[13] or a nonlinear kernel[14] is used to combine domain generic and domain-specific embeddings.

Then, they are projected along with the maximum correlation directions. After getting the average of the projected domain generic and domain-specific embeddings, the new domain-adapted embedding is created. The assessment results on sentiment classification tasks show that domain-adapted embeddings outperform domain generic and domain-specific embeddings when used as input features to standard sentence encoding algorithms such as bag-of-words or state-of-the-art ones such as InferSent[15].

There is also another approach called cross-domain word embeddings which are based on regularization and use unlabeled data from different domains. As an example, [16] tries to connect the source

(10)

and target data. The frequent words will be marked as pivots. pivots need to be ensured to have the same embedding in both domains. The method uses the pivots to predict the surrounding non-pivot words.

2.2.2 Contextualized Embedding-based Methods

Contextualized embedding approaches were introduced to go beyond transferring word embeddings.

These approaches pre-train neural networks on huge unlabeled text corpora before fine-tuning the models on downstream tasks. They rely on Language Models(LM). LMs are able to capture numerous language-relevant characteristics such as long-term dependencies and hierarchical relationships [17]. As a result, it is beneficial to understand the complicated features of a word and to consider the context. Therefore, it is advantageous to learn the complicated features of a word while also considering the context. The vector representation of a word varies depending on the context in which it appears.

To adapt the language model for a specific task, it must first be trained on the downstream tasks by fine-tuning all pre-learned parameters. Such a technique needs few task-specific parameters. So, it may be easily applied to practically every NLP task and domain, from question answering to sentiment analysis. Thus, this technique may also be seen as domain adaptation, that is, the adaptation of a pre- trained language model to a target task from the target domain.

Contextualized Embedding can be discussed in the following approaches.

LSTM-Based Approach LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) that is commonly used in natural language processing tasks. In unsupervised domain adaptation, LSTM-based approaches can be used to adapt a model trained on one domain (e.g. news articles) to a different domain (e.g. scientific papers) without the need for labeled data from the target domain. This can be done by fine-tuning the LSTM model with a small amount of labeled data from the target domain or by using unsupervised techniques such as adversarial training or domain confusion to align the feature representations of the source and target domains. There are two successful LSTM-based contextualized models. The first one is an unsupervised language model fine-tuning method (ULMFiT)[17] that utilizes the LSTM language model to train the data. It contains general-domain language model pre-training, target task language model fine-tuning, and target task classifier fine-tuning. This technique is universal and operates across tasks. It employs a single architecture and training procedure, no new feature engineering or preprocessing is required nor extra in-domain documents or labels[17]. Another feature of ULMFiT is that it employs unique fine-tuning approaches to maintain past information and minimize catastrophic forgettings. Even with only 100 labeled samples, this method claims to prevent overfitting and deliver state-of-the-art outcomes on tiny datasets. The method outperforms the state-of-the-art methods on a variety of text classification tasks, including sentiment analysis, question classification, and topic classification. Another method in this category is the Embedding of Language Model(ELMo)[18]. In this approach, word vectors are learned functions of a deep bidirectional language model’s internal layers. For each downstream task, the model learns a linear combination of the vectors piled above each input word. In order to capture all of the layer representations for each word, the bidirectional language model is first calculated.

Then, a linear combination of these representations is computed. After that, the pre-trained representations are employed as additional features in task-specific architectures. The approach necessitates the adoption of task-specific designs in order to attain high performance. The authors recommend fine-tuning the bidirectional language model for domain adaptability. They believe that it improves the performance of a downstream task and may thus be utilized for domain adaptation.

(11)

Transformer-Based Approach Later contextualized embedding models use the transformer-based language model for pretraining [19]. Instead of recurrence, this model design depends on an attention mechanism to create global dependencies between input and output. As a consequence, strategies based on this language model establish a new state of the art that outperforms publicly released approaches. OpenAI GPT [20], GPT-2 [21], T5 [22], XLNet [23], BERT [24] and its improved variants RoBERTa [25], and DistillBERT [26] are examples of these approaches.

BERT (Bidirectional Encoder Representations from Transformers) [24] is one of the latest transformer based models. While previous models like ELMo [18] and GPT [27] use unidirectional language modeling to learn language representations, this method is designed to pre-train bidirectional representations. This allows modeling the context in both directions which is crucial for many NLP tasks ranging from sentence-level and question-answering tasks to token-level tasks.

A pre-trained model of BERT[28], is then tailored for a specific task. By jointly conditioning on both left and right context in all layers, BERT is made to pre-train deep bidirectional representations from the unlabeled text. As a consequence, the pre-trained BERT model may be finetuned with just one extra output layer to provide task-specific models for a variety of purposes, including question answering and language inference, without making significant architectural adjustments. The model is trained on unlabeled data over a variety of pre-training tasks during pre-training. Masked Language Modeling (MLM) and Next Sentence Prediction are two unsupervised tasks that are utilized for BERT pre-training. The BERT model is first started with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Despite being initiated with the same pre-trained parameters, each downstream task has a unique set of fine-tuned models(Figure 1). The

Figure 1: BERT: Pretraining and Finetuning [28]

multi-layer bidirectional transformer encoder is the model architecture of BERT. The original English BERT model includes two pre-trained general types: the BERT BASE model, a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture; and the BERT LARGE model, a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Both of these models were trained on the BooksCorpus with 800M words and a version of the English Wikipedia with 2,500M

(12)

words. Both a single phrase and a pair of sentences may be clearly represented by the BERT input representation in a single token sequence. With a vocabulary of 30,000 tokens, WordPiece embedding [29] is used. Every sequence always starts with a particular classification token as the first token [CLS]. For classification tasks, the last hidden state matching this token is utilized as the aggregate sequence representation. Sentence pairs are packed together into a single sequence. Sentences are distinguished in two ways: first, a specific token [SEP] is used; second, each token has a learned embedding that indicates whether it belongs to sentence A or sentence B. The input representation for a particular token is built by summing the relevant token, segment, and position embedding [28].

Figure 2 shows the input representation.

Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings, and the position embeddings [28].

The benefit of employing WordPiece is the requirement for specific handling of unfamiliar terms.

Additionally, it strikes a nice compromise between the decoding efficiency of whole words and the adaptability of single characters [29]. To determine the relative placements of each token in the phrase, position embeddings are required. Finally, for problems involving sentence pairs, segmentation embeddings aid in the distinction between two phrases.

Masked Language Modeling is a method for training a deep bidirectional representation that involves randomly masking a portion of the input tokens. In this case, just as in a standard LM, the output softmax is supplied with the final hidden vectors corresponding to the mask tokens. In BERT, a random mask is applied to 15% of all WordPiece tokens in each sequence. As masking causes a mismatch between pre-training and fine-tuning, masked words are not always altered by a mask token;

instead, they are changed by a random token 10% of the time and left unmodified by a 10% chance.

In several NLP tasks[30], the link between the sentences is crucial. BERT pre-train for a binarized next sentence prediction task that can be easily produced from any monolingual corpus in order to train a model that comprehends sentence connections. For each pretraining example, the sentences A and B are randomly selected, with 50% of the time B being the sentence that really follows A which is labeled as IsNext , and 50% being a random sentence from the corpus which is labeled as NotNext.

However, research has indicated that this strategy is ineffective[25]. It seemed that BERT learned topic similarity rather than the inter-sentence coherence that the NSP was supposed to learn.

Fine-tuning is rather simple since the transformer’s self-attention mechanism enables BERT to model a variety of downstream tasks, whether they require single texts or text pairs which is done by switch- ing out the relevant inputs and outputs. Input sentences A and B from the pre-training are similar to any pairs of sentences, including sentence pairs used in paraphrasing. However, At the output, the [CLS] representation is fed into an output layer for classification tasks like sequence tagging, while the token representations are fed into an output layer for tasks at the token level like named entity recognition. Figure 3 shows BERT fine-tuning on a number of different tasks.

(13)

Figure 3: Illustrations of Fine-tuning BERT on Different Tasks.[28]

(14)

Further pretraining a BERT-like model Recent trends in natural language processing advocate the use of large pretrained transformer-based language models [18], [28], and [31].

These models frequently attain good performance in a wide range of tasks by being fine-tuned on downstream tasks with a modest quantity of labeled data.

By continuing pretraining with additional data which is so-called further pretraining, the model learns more representations from the new data and will gain better performance[32]. During the pretraining, the parameters’ weights will be updated. The architecture of the model is the same as the pre-trained model that is going to train more and the training process will be done using two unsupervised tasks (MLM and NSP). Before feeding the data into the model, it has to be prepared and formatted. Bert- like models will accept unlabeled data that is arranged sentence by sentence. Every sentence must be in a single row. The tokenizer has to be applied to the data to extract the vocabulary of the data.

2.3 Unsupervised Domain Adaptation

Unsupervised domain adaptation is to build a domain invariant representation that can be used to train the classifier on labeled data from the source domain and then apply it to data from the target domain.

Domain invariant representations can be learnt to exploit both labeled and unlabeled data from the source and destination domains.

2.3.1 Pivot-based methods

[33] explain structural correspondence learning (SCL), a technique for adapting linear discriminative models from resource-rich source domains to resource-poor destination domains. The core concept is the use of pivot characteristics that are frequent and behave similarly in both the source and destination domains. SCL constructs a shared representation by searching for a low-dimensional feature subspace that enables reliable prediction of the presence or absence of pivot features in unlabeled input and outperforms state-of-the-art supervised models, utilizing just unlabeled target data for both tasks. In addition, a formal framework for examining domain adaptation tasks is developed.

[34] introduce the Pivot Based Language Model (PBLM), a representation learning model that combines pivot-based and NN modeling in a structure-aware manner. Specifically, their model analyzes text information with a sequential NN (LSTM), and its output is a context-dependent representation vector for each input word. In contrast to the majority of earlier representation learning models in domain adaptation, PBLM can naturally feed structure-aware text classifiers such as LSTM and CNN.

They investigate the task of cross-domain sentiment classification on 20 domain pairs and demonstrate significant gains over robust baselines.

2.3.2 Domain Adversarial training method

A Domain-specific Adapter-based Adaptation (DAA) framework is proposed by [35] to enhance the adaptability of BERT-based models for event detection across domains. DAA introduces a joint representation learning mechanism and a Wasserstein distance-based technique for data selection in adversarial learning by explicitly representing data from different domains with separate adapter modules in each layer of BERT. They utilized three adapters to create a shared-private representation subspace.

One adapter for the source domain representations. The second one is for the target domain representations. And the third one is a joint adapter that will be used for the final event classification task.

To learn joint representation to be global as much as possible, two components are being combined

(15)

Figure 4: Pseudo labeling training framework[36]

and utilized. A layer-wise domain adversarial (LDA) component and an adapter-wise domain disen- tanglement (ADD) component. LDA applies training on the joint adapters. The ADD component’s role is to guarantee that adapters have a shared private connection. They claim that this architecture significantly improves the performance of target domains.

2.3.3 Pseudo-labeling techniques

[36] introduced a Pseudo-label Guided unsupervised Adaptation(PGA) method to solve the unsupervised domain adaptation as shown in Figure 4. In this technique, two models which have the same architecture, are fine-tuned jointly over a general pre-trained model and a target domain pretrained model. These two fine-tuned models will do the prediction on the target unlabeled data. If both models agree with the prediction and exceed the threshold, the prediction will be regarded as a pseudo-label.

Finally, all pseudo labels are collected to fine-tune the target domain pretrained model. Experiments indicate that their technique produces promising outcomes on a variety of downstream NLP tasks.

(16)

3 Data

3.1 Overview

3.1.1 Source Domain

The present study will be looking at the source data through Richer Event Description(RED) corpus [37] whose formalism is one of the most comprehensive forms of event-event relation annotations on over 95 English newswires, discussion forums, and narrative text documents. On a concrete level, the task of RED annotation is to mark which units in the text should be considered to be entities, events, or times in a document, to label those units with features such as modality, and to mark temporal, causal, event-substructural and coreference relations between them.

3.1.2 Target Domain

The target domain is movie subtitles. We need to annotate some movies for the test data. 4 movies have been chosen to be annotated namely ”Agatha Christie’s Marple: The Body in the Library(2005)”,

”Batman v Superman: Dawn of Justice (2016)”, ”The Social Network (2010)”, ”Psycho (1960)”.

However, each movie is already annotated minute by minute for sentiment analysis with labels representing a negative, neutral or positive sentiment based on watching it [38]. We tried to keep the diversity in the types of movies to balance the data more and provide a reasonable insight.

3.2 The Pipeline and General Intuitions of Annotation Process

Each document includes several entities, numerous events, and numerous interactions between them.

Additionally, the annotation process is an adjudicated task. This implies that each document will receive two annotations, and any conflicts between the two annotations will be settled. Considering the RED annotation guidelines², the two-stage workflow is utilized for the RED annotation results from this. The annotation of events, entities, and the coreference and partial-coreference connections between entities will be the initial step in this process. A third annotator then decides how to interpret the annotations and resolves any conflicts. The second step of annotation highlights the connections between events over the adjudicated events, both to indicate when those connections are identical and to highlight links between similar events. The pipeline’s structure is intended to increase the reliabil- ity of the annotation process. It is crucial for annotators to consider this issue of consistency in terms of whether their annotations are understandable and predictable, rather than just avoiding absolute mistakes. So every relationship which is noticed in the text can not be marked since document-level annotation requires a thorough grasp of the context, and the interpretation will frequently differ in subtle ways from the fellow annotators. The conservative approach is that the rich ”meaning” of the text will be captured, and that’s why a variety of the rules and restrictions outlined are there in the first place such as rules of finding the events and entities, modality and polarity, marking the relationship to document time, event type, aspect, and implicitness, temporal expressions, entity coreference relations, linking events together, avoiding redundant causation annotation, event coreference, aspec- tual link annotation, reporting annotation and etc. Labeling EVENT and ENTITY instances are the primary and most important activity in the first stage of annotation. In this study, we just consider the events and annotate them in two rounds. Any conflicts, then, are resolved with the help of the supervisor.

2https://github.com/timjogorman/RicherEventDescription/blob/master/guidelines.md

(17)

3.2.1 Event Definition

Any occurrence, action, process, or event state that merits a spot on a timeline is defined as an event in RED. Events can take any syntactic realization, including verbs, nominalizations, nouns, and even adjectives. It’s crucial to determine if something is an event or not without regard to syntax; the syntactic factors will be taken into account when deciding how long something will be annotated, but not when deciding whether or not to consider it an event at all. Instead, at this point it should be concentrated on the semantic issues around what is actually happening, determining if the words which are taken into consideration represent a sequence of changes, transitions, or states happening in the real world.

3.2.2 BIO formatting

BIO formatting is a common scheme for representing named entity recognition (NER) data in a text corpus. The BIO scheme consists of prefixing each token in a sentence with a special tag that indicates whether the token represents the beginning, inside, or outside of a named entity. The prefixes used in BIO formatting are ”B-”, ”I-” and ”O”, respectively. For example, in the sentence ”Barack Obama was born in Hawaii.” :

• ”Barack” is tagged as ”B-PER” (beginning of a person named entity)

• ”Obama” is tagged as ”I-PER” (inside of a person named entity)

• ”was” is tagged as ”O” (outside of a named entity)

• ”born” is tagged as ”O”

• ”in” is tagged as ”O”

• ”Hawaii” is tagged as ”B-LOC” (beginning of a location named entity)

So the BIO formatted sentence would look like this: ”B-PER I-PER O O O B-LOC”

We will use such formatting for ”EVENT” tags so we will have ”B-EVENT”, ”I-EVENT” and ”O”

tags.

3.3 Train set for fine tuning

The RED corpus, which was annotated using its specific guideline³ to discover events, serves as the foundation for the training set. In contrast to semantic role annotation in the traditional sense, which concentrates on the relationships between events and the entities participating in them, RED is more concerned with the hierarchical structure of events, of time, and of participants, as well as the tracking of those events and participants over a document. Finding the textual elements that belong in a document as entities, events, or times is the aim of the RED annotation. The links between these elements’ temporal, causal, event-substructural, and coreference are then indicated, and they are given qualities like modalities. With labels for the timeline of each event’s occurrence and which events are contained within others, the output of this annotation may be compared to a playbill of the participants and a chronology of the events that took place within a document. This enables a continuous feeling

3https://github.com/timjogorman/RicherEventDescription/blob/master/guidelines.md

(18)

of an entity to be created by connecting every reference of a constant character, environment, etc.

According to this perspective, the desired consequence would be a profound understanding of the text, especially when combined with within-sentence comprehension of semantic roles. In order to do this, the study is concentrated on the capture of relations that aren’t frequently detected in tasks involving semantic-role labeling. We need the guidelines to annotate the movie test data. For this study, the RED train, development, and test data are already available. Table 1 shows the distribution of tokens and events on both the RED corpus and movie subtitle test set.

RED Train Set RED Dev Set RED Test Set Movie Test set

Number of tokens 65051 7599 7325 20574

Number of unique tags 3 3 2 3

Number of unique POS tags 43 40 42 48

Number of unique dependencies 45 44 45 44

Total Events 7621 855 846 2645

Multi-token events 46 8 0 34

Table 1: RED and Movie Subtitle datasets characteristics

Multi-token events are ones that are composed of more than one token and consist of ”B-” and ”I-”

tags in the corresponding chunk. We count them as one event overall.

Table 3 illustrates the distribution of the part of speech (POS) tags for event tokens per dataset. As we can see, most of the events are types of nouns, verbs or adjectives. The POS tags are based on the NLTK POS tagger library⁴and the meanings⁵of the most prevalent ones are shown in Table 2.

POS tag meaning

NN Noun, singular or mass

NNS Noun, plural

NNP Proper noun, singular

VB Verb, base form

VBD Verb, past tense

VBN Verb, past participle VBG Verb, gerund or present participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present

JJ Adjective

PRP Personal pronoun

Table 2: The most prevalent POS tags of event tokens in RED and Movie subtitles

4https://github.com/nltk/nltk/wiki

5https://cs.nyu.edu/ grishman/jet/guide/PennPOS.html

(19)

RED train set RED dev set RED test set Movie subtitle test set POS tag count POS tag count POS tag count POS tag count

NN 1963 NN 250 NN 231 VB 841

VB 1022 VBD 94 VBD 117 NN 338

VBD 999 VB 92 VB 100 VBD 300

VBN 726 VBN 88 VBN 98 VBP 298

NNS 578 JJ 59 NNS 61 VBG 234

NNP 480 NNP 51 JJ 60 VBN 189

VBG 459 NNS 51 VBG 54 JJ 154

JJ 406 VBZ 39 NNP 50 VBZ 95

VBP 316 VBP 28 VBP 25 NNS 55

VBZ 245 PRP 10 VBZ 21 NNP 54

Table 3: Top 10 RED and Movie subtitles POS tags of event tokens

3.4 Test set for evaluation

Four movies have been selected to be annotated considering diversity: ”The Social Network (2010)”,

”Agatha Christie’s Marple: The Body in the Library (2005)”, ”Batman v Superman: Dawn of Justice (2016)”, and ”Psycho (1960)”. Each of these four movies has been annotated for 1000 lines. Some lines have more than one sentence. Table 1 shows the distribution of general tokens and events of the annotated movie scripts for the test set.

3.5 Further Pre-training Dataset

Pretraining models on large unlabeled upstream corpora to optimize self-supervised goals, such as masked language modeling (MLM), is currently the best practice for training predictive models that operate on natural language data[32], [39]. The resulting weights are then used to initialize models that are subsequently trained (finetuned) on the labeled downstream data available for the task at hand.

When compared to models that were trained exclusively on the downstream job, large-scale pre- trained models might offer considerable performance improvements (with random initialization)[18], [24].

To provide such an upstream corpus for pretraining purposes, 1000 movies are scraped upon IMSDB⁶ which is a web service providing movie scripts. The scraper utilized a custom python script using the

”BeautifulSoup”⁷library to parse webpages. Excluding the 4 annotated movies, more than 2.5 million unlabeled sentences were gathered from all scripts that were available and downloadable through the IMSDB repository. The cleaning process consisted of:

• All single sentences that were broken down into multiple lines, were joint.

• Some special characters (i.e ”..” ) that distance the words in a single sentence far from each other were cleaned.

6https://imsdb.com/

7https://pypi.org/project/beautifulsoup4/

(20)

• The movie writer and some starting characters give some information about the movie as well as the name of the characters assigned to each line were removed from the text.

• The title of the movies is extracted for further tracking.

• The lines were split out to extract the sentences separately by the NLTK toolkit.

3.6 Textual Analysis on Corpora

To have a better insight into the behavior of the models especially in an unsupervised setting, textual analysis can be helpful by which we can estimate how the outcomes of the model are similar to the expectations. One of the tools in such analysis that is commonly used in natural language processing, is the Jensen-Shannon divergence (JSD)[40] which measures the distribution similarity between corpora. JSD is a way to describe the differences between corpora by identifying the divergent keywords which developed in some variants[41],[42].

Generally, JSD is based on the Kullback-Leibler(KL) divergence, but it is symmetric. In information theory, KL Divergence first appeared. Information theory’s main objective is to estimate the amount of information contained in a given set of data. The amount of information needed on average to adequately explain a discrete random variable, X, is measured by its entropy. The fact that it quantifies the uncertainty of a particular variable makes it the most significant metric in information theory.

When a discrete random variable X has a probability mass function P(x), Shannon defined its entropy H as equation (1):

H= −

N

∑

i=1

P(x_i).log_bp(x_i) (1)

The relative entropy calculates the separation between two distributions. The Kullback-Leibler divergence (KL divergence) between two samples is another name for it. It is given by the following for discrete probability distributions P(x) and Q(x), defined on the same probability space x:

D_KL(p||q) =

∑

x∈X

p(x).lnp(x)

q(x) (2)

Let there be two probability mass functions P(x) and Q(x) (i.e. discrete distributions). Then D(P||Q) ≥ 0 with equality D(P||Q) if and only if P(x) = Q(x) for every x. As a result, because it is not symmetric and does not meet the triangle inequality, KL divergence is not a true distance metric. Remember that the KL divergence is only defined if and only if for any x, Q(x) = 0 → P(x) = 0.

The Jensen-Shannon divergence (JSD)[40] is an alternative strategy for measuring the similarity between two probability distributions. It may be used as a metric for measuring distance since it is a symmetric and smoothed variation of the KL divergence. Assuming that M = (P + Q) ∗ (0.5) is defined, we may express the JS divergence as follows:

JSD(P||Q) =1

2D(P||M) +1

2D(Q||M) JSD(P||Q) = H(P+ Q

2 ) −H(P) + H(Q) 2

According to the aforementioned formula, the JS divergence is equal to the entropy of the mixture mi- nus the mixture’s entropy. The square root of JSD is frequently calculated as a real distance measure.

(21)

Although the weighted JSD which encounters the size of the corpus to compare the corpora seems to be more efficient, [43] demonstrates that it is not helpful and might systematically and incorrectly assign higher divergence scores to words that appear frequently in the smaller corpus.

In this study, we will use the JSD variant that does not care about the corpus size.

(22)

4 Methods

In this section, we discuss the baseline and the proposed methods that we aim to respond to the research questions to find out how we approach the problem. Before, we explain early stopping concepts

Early Stopping Early stopping is a regularization method intended to stop machine learning models from overfitting. When a model’s performance on a held-out validation dataset begins to deteriorate during training, it is necessary to interrupt the training process. This enables the model to avoid overfitting the training set and to perform better across other datasets.

Early stopping functions by keeping track of a model’s performance on a validation dataset while it is being trained. The validation dataset is a subset of the training set’s data that is solely used to assess how well the model performed during training. The effectiveness of the model on the validation dataset is monitored as the training process progresses. A sign that the model is beginning to overfit the training data is when the performance on the validation dataset starts to deteriorate (for example, when the validation loss starts to rise). The training process is then terminated, and the final model is created using the model’s weights and biases where the validation performance was best. To avoid overfitting, early stopping is frequently combined with additional regularization strategies including dropout, weight decay, and data augmentation. It is also a common technique and a straightforward method of avoiding overfitting. It’s also important to keep in mind that early stopping is not always implemented by just keeping an eye on the performance on the validation set during training. We may also keep an eye on additional factors, such as how well the model performs over time or any adjustments to its parameters. In this study, we utilize the early stopping on the validation set with a certain patience threshold along with weight decay.

4.1 Baseline: Conditional Random Fields(CRF)

CRF is a framework for building probabilistic models to segment and label sequence data [44] which is an alternative to Hidden Markov Models (HMMs) and stochastic grammars which are well-known and often applied probabilistic models for labeling and segmenting sequences. Topic segmentation, part-of-speech (POS) tagging, information extraction, and syntactic disambiguation are just a few of the text processing challenges that HMMs and stochastic grammars have been used to solve in computational linguistics and computer science [45]. In fact, stochastic grammars and HMMs are generative models that give paired observation and label sequences a joint probability; the parameters are generally trained to optimize the joint likelihood of training examples. A generative model needs to consider all possible observation sequences in order to obtain a joint probability over examples, which requires a representation in which observations are task-specific items, such as words. It is not practicable to describe numerous interdependent characteristics or long-range dependencies because it will lead to deduction problems over the examples.

A conditional model, in contrast, outlines the probability of potential label sequences given an observation sequence. As a result, it does not invest modeling time in the observations, which are fixed at test time. Additionally, the model is not required to take into consideration the distribution of these dependencies since the conditional probability of the label sequence might rely on any arbitrary, non-independent aspect of the observation sequence. If previous and future observations are available,

(23)

they may also reflect the attributes at any level of granularity for the same instances, and they may impact the likelihood of a transition between labels. It could be at the level of words or characters.

In contrast to hidden Markov models and stochastic grammars, it provides a number of advantages, including the flexibility to alleviate the strict independence conditions imposed by the former models.

The problem with such models was that they might not use the previous knowledge of the structure in information extraction tasks. Models that take into consideration whole state sequences at once by allowing some transitions to ”vote” more strongly than others in accordance with the associated observations are necessary to have an appropriate solution. This suggests that the score mass will not be preserved and that each transition will be able to ”amplify” or ”dampen” the mass that is applied to them. The greatest features of generative and classification models are combined via conditional random fields (CRFs). They are trained discriminatively and, like classification models, may handle a variety of statistically correlated input features. However, they can trade off decisions at various sequence positions, just like generative models, to get a labeling that is globally optimum. [44]

demonstrated that on synthetic data and a part-of-speech tagging test, CRFs outperformed comparable classification models and HMMs.

CRF Definition Let G = (V, E) be a graph such that Y = (Y_v)_v∈V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv | X, Y_w, w ̸= v) = p(Yv | X, Y_w, w ∼ v), where w ∼ v means that w and v are neighbors in G[44].

Where Y is a random variable over corresponding label sequences and X is a random variable over data sequences that need to be labeled. It is assumed that every element Y_i of Y lies inside a finite label alphabet Y. For instance, X may cover phrases in natural language and Y might cover the part- of-speech tags associated with those sentences, with Y representing the set of possible part-of-speech tags. Although the random variables X and Y are jointly distributed, a conditional model p(Y | X) is built using paired observation and label sequences in a discriminative framework without modeling the marginal p(X).

4.2 Proposed method

We follow a two-step process for unsupervised domain adaptation. 1) starting from a pre-trained model, we continue pre-training it using the masked language modeling task on data from the target domain (movie subtitles). 2) fine-tuning, the model will be trained for the event detection task using labeled data from the source. The transformer architecture forms the foundation of BERT[28].

Domain-adaptive pretrainig In order to adjust the model to the target domain, we start by initializ- ing it using the weights of a broadly pre-trained BERT and continuing pre-training on an unsupervised collection of movie subtitles unlabeled data.

In this stage, we continue utilizing the MLM objective for training and do not need to utilize any supervised data. Indeed, we attempt to pretrain the BERT language model with unlabeled target data known as domain-adaptive pre-training process (DAPT) so-called further pertaining. This model will be used in the next step for fine-tuning in order to be trained under the downstream event detection task later.

(24)

The previous study showed that a second phase of pretraining in the domain (domain-adaptive pretraining) leads to achieving gains, under both high- and low-resource settings [32], [30],[39],[46] and [47]. The purpose of pretraining with more unlabeled data is that we want the model to be trained with as much data as possible. When the training steps rise, the model will use different mask tokens in each epoch, considering the defined 15% of masking tokens during pretraining. So more parameters and weights will be updated.

Furthermore, in our study, the DAPT process will be done in three different data portions over target data which are 10%, 50%, and 100% of the whole data, respectively as the training size may have an impact on the performance of the model. By selecting a particular volume of data each time, 1) the tokenizer will be trained with that volume of data, which allows for the vocabulary list to have different members. This is a result of using the token frequency as a hyperparameter for the tokenization trainer, considering a fixed vocabulary capacity, and 2) the learning features and attributes, such as syntactic roles and text representations, may change depending on the size of the data. The aforementioned effects, therefore, might have an overall impact on how well the model performs. Although [48] demonstrates such an impact with the downstream data, our goal is to examine such effects with the pretraining process.

Finetuning In this phase, we fine-tune a DAPT model. A linear classifier with the dimension of (768,3) gets the output of the last BERT layer and classifies it for 3 labels namely ”O”, ”B-EVENT”

and ”I-EVENT”.

For the classification task, the model is trained on labeled data from the source domain(RED), and for the masked language modeling task, it is trained on unlabeled data from the target domain(Section 3).

For the test purpose, we also do finetune the standard pre-trained BERT model which will be done on the RED corpus. The model, then, will be examined over the movie test data under the zero- shot learning principle. We do not train this model with the target unlabeled movie subtitles but just fine-tune it for the task ”event detection” and test on the movie subtitles test set.

(25)

5 Experiments

In this section, we describe the five experiments that have been done namely CRF, ZSL, DAPT 1 + FT, DAPT 2 + FT, and DAPT 3 + FT.

ZSL stands for zero-shot learning and represents the standard pre-trained BERT model that is just finetuned over the RED corpus. Additionally, each DAPT+FT model is a domain-adaptive pre-trained model which is followed by a fine-tuning process over the source data. All the experiments were done against RED corpus as the source domain and movie subtitles as the target domain.

5.1 Evaluation Metrics

5.1.1 F1-score

Evaluation is executed against all the experiments. The major metric is the f1-score which is an evaluation metric that measures a model’s performance. It combines the precision and recall scores of a model. How many times a model is correctly predicted throughout the full dataset is determined by the accuracy metric. This measure can only be trusted if the dataset is class-balanced, meaning that each class contains an equal amount of samples.

However, our dataset includes severe class imbalances, rendering this statistic useless and the f1-score is the best measure that describes it and defined as the equation (3).

F1 − score = 2 ∗ Precision ∗ Recall

Precision+ Recall (3)

We evaluate the models, not for all the tokens but for events. In this way, the measurement would be more interpretable, and real and resists data imbalance. In fact, Compared to other tokens, the token which fires the event is not prevalent in the corpus.

5.1.2 Jensen-Shannon Divergence

Beside the f1-score, we compute the Jensen-Shannon Divergence metric [40] which measures how much the RED and movie subtitles corpora diverge from each other entropically. In fact, in an unsupervised setting, the models do not have enough knowledge about the movie subtitle domain. The movie subtitles word types and frequencies might be completely different from the RED corpus. JSD helps to measure the differences between the source and the target domain and aids to have better insight into what the domain shift influence will be, and how well the model behaves dynamically considering the word differences. To evaluate JSD, we form a dictionary of tokens with their corresponding frequencies among the corpus. After normalizing the frequencies which have to be between 0 and 1, the JSD is computed by the scipy library⁸which is based on the formula introduced in section 3.6.

5.2 Baseline: CRF

As the baseline, we consider Conditional Random Field(CRF) algorithm from the CRF++ library⁹ which is used for segmenting/labeling sequential data. The features of the library are as follows:

8https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html

9https://taku910.github.io/crfpp/

(26)

• Can redefine feature sets

• Written in C++

• Fast training based on LBFGS, a quasi-newt algorithm for large-scale numerical optimization problem

• Less memory usage both in training and testing

• encoding/decoding in a practical time

• Can perform n-best outputs

• Can output marginal probabilities for all candidates

• Available as an open-source software 5.2.1 Experimental setup

For the CRF++ library, the training and the test files must comply with a certain format. They contain multiple tokens, each of which has multiple columns. Token extraction is also based on the BERT tokenizer. Each token is shown on a single line, with blank spaces separating the columns. A sentence is formed from a series of tokens. An empty line is used to mark the transition between sentences.

There are some kinds of semantics among the columns as follows:

” accountability NN csubj B-EVENT ”

The first column is the token itself(i.e. accountability). The second column is the POS tag (i.e. NN) associated with the token. The third column is the dependency(i.e. csubj) of the token in the sentence.

And the last column is the label(i.e. B-EVENT). The possible labels are ”O” which represents ”not an event”, ”B-EVENT” which represents ”First token of the event”, ”I-EVENT” which represents any other tokens that are in the chunk that the first token positioned based on BIO formatting.

In addition to the data, there is a template that contains the features. This file describes which features are used in training and testing. The template file contains one template per line. To provide a token in the input data for each template, a specific macro%x[row, col] will be utilized(Table 5). Row gives the location in relation to the currently focused token, whereas col defines the position in absolute terms for the column. To elaborate more, consider the example below as a sentence:

The templates are available in two types. The first character of the templates specifies the type.

Unigram templates start with character ”U”. When we define a template ”U01:%x[0,1]”, CRF++

automatically generates a set of feature functions (func1 ... funcN) like:

func1 = if (output = B-EVENT and feature=”U01:NN”) return 1 else return 0 func2 = if (output = I-EVENT and feature=”U01:NN”) return 1 else return 0 func3 = if (output = O and feature=”U01:NN”) return 1 else return 0

....

funcXX = if (output = B-EVENT and feature=”U01:DT”) return 1 else return 0 funcXY = if (output = O and feature=”U01:DT”) return 1 else return 0

(27)

token POS-tag dependency label

The DT det O

accountability NN csubj B-EVENT

in IN prep O

the DT det O

process NN pobj B-EVENT << suppose we are here

is VBZ root O

with IN prep O

the DT det O

governor NN pobj O

· · O

Table 4: CRF: sentence format example template expanded feature

%x[0,0] process

%x[0,1] NN

%x[-1,0] the

%x[-2,1] IN

%x[1,1] VBZ

%x[-1,0]/%x[-1,1] the/DT Table 5: CRF: feature template examples

These functions are defined over all sentences per template. A template will produce (L * N) feature functions, where L is the number of output classes and N is the number of distinct strings that will be expanded from the provided template. The other type is the Bigram template which starts with the

”B” character and describes bigram features. The current output token and the previous output token are combined to create a bigram using this template. This particular kind of template creates a total of (L * L * N) distinct features, where L is the number of output classes and N is the total number of unique features the templates created. When there are many classes, this kind of template would generate a ton of unique features, which would make training and testing inefficient. we know that we can combine the features in unigrams. However, the bigram is all about combining output labeled to represent dependant output joint labels. So unigram and bigram features mean uni/bigrams of output tags.

• unigram: [output tag] × [all possible strings expanded with a macro]

• bigram: [output tag] × [output tag] x [all possible strings expanded with a macro]

We leave the default setting for the libarary as it is. Table 6 shows the templates which are pre-defined:

(28)

feature name template

# Unigram

U00 %x[-2,0]

U01 %x[-1,0]

U02 %x[0,0]

U03 %x[1,0]

U04 %x[2,0]

U05 %x[-1,0]/%x[0,0]

U06 %x[0,0]/%x[1,0]

U10 %x[-2,1]

U11 %x[-1,1]

U12 %x[0,1]

U13 %x[1,1]

U14 %x[2,1]

U15 %x[-2,1]/%x[-1,1]

U16 %x[-1,1]/%x[0,1]

U17 %x[0,1]/%x[1,1]

U18 %x[1,1]/%x[2,1]

U20 %x[-2,1]/%x[-1,1]/%x[0,1]

U21 %x[-1,1]/%x[0,1]/%x[1,1]

U22 %x[0,1]/%x[1,1]/%x[2,1]

U23 %x[0,1]

# Bigram B

Table 6: CRF default templates

5.3 Proposed method

In this section, we will explain our proposed method for domain adaptation problems to identify events in a corpus. To this end, we will do Domain-adaptive Pre-Training(DAPT) process and Fine- tuning.

5.3.1 Pretraining Process

We apply domain-specific pretraining to the Language Model of BERT BASE. We adhere to the first recommended masking method, which involves randomly masking 15% of WordPiece tokens. When a token at a certain position is chosen to be hidden, a [MASK] token is substituted 80% ,of the time, a random token 10% of the time, and the remaining unchanged 10% of the time. We have done three pretraining experiments. Most of the setting are the same among all pretraining experiments but data portion contribution. Table 7 illustrates the common configurations.

The table shows, in the training process, each time a sentence is processed, 15% of tokens were being masked and this will be iterated over 50 epochs.

(29)

Hyperparameters DAPTs setting

Learning rate 1e-5

Num of epoch 50

MLM probability 15%

Vocab size 50000

Hiden size 768

Number of hidden layers 12 Max position embeddings 512 Tokenizer minimum frequency 3

Train batch size 32

Table 7: Further Pretraining configuration

The only change between the further pretraining processes is the data portion size. We gradually increase the data contribution ratio in the pretraining process to 10%, 50%, and 100% of the whole data(Table 8). With this in a 50 epochs configuration, the model will be trained with more data each time. With such a strategy, the model learns more representation and updates the parameter’s weights hence expecting a better result.

Models DAPT 1 DAPT 2 DAPT 3

Data portion 10% 50% 100%

Table 8: Further Pretraining configuration: Data portion contributions

5.3.2 Fine-tuning Process

During the fine-tuning step, we train the model with the hyper-parameters described in table 9. We do use an early stopping strategy, as we found it is useful for the performance to stop the training if the loss increases instead of decreasing. We apply patience 3 for this purpose as we found it more efficient. Models are developed with PyTorch [49] and HuggingFace Transformers.

Moreover, as the entire work is based on Pytorch functionalities, results across the library versions, specific changes, or various platforms may not be entirely repeatable. Even when utilizing identical seeds, outcomes can not be consistent between CPU and GPU executions. Internal random numbers may be used in several PyTorch procedures. As a result, calling them on the input arguments numerous times in succession may produce different outcomes.

However, the same set of random numbers will be created each time the program is run in the same environment as long as torch.manual seed() is set to a constant at the beginning of the application, and all other causes of non-determinism will be removed. The same for the NumPy library is applicable when it tries to produce a random number.

The architecture consists of a BERT model with a classifier on top. We use a linear classifier with dimensions (768,3) to get the results from the last hidden layer of BERT and give the appropriate output label namely ”O”, ”B-EVENT” and ”I-EVENT”. In the tokenization process, we only keep the first token of the input words as the main token. We use cross entropy for the loss function and evaluate the model in each epoch.

(30)

hyper-parameters value

num of epochs 20

batch size 32

optimizer AdamW

learningRate 1e-5

epsilon 1e-8

seed vals [0,42,80]

early stopping patience 3

model BERT Base Cased

Table 9: Fine-tuning hyper parameters

5.4 Results

In this part, we demonstrate the result of the experiments.

5.4.1 Seed effect

The random number generator, which is used to shuffle data randomly during training or to randomly initialize the model’s parameters, is initialized using the seed value. Setting a seed value makes sure that the model’s outcomes can be replicated because the same seed value will consistently result in the same random numbers. This can be helpful especially for designing experiments that can be repeated.

Additionally, in some circumstances, changing the seed value can improve the model’s generalization performance. The result in the table 10 shows how seed value will influence the result of the ZSL model on the RED development set.

Seed value f1 score on RED

0 0.8933

42 0.8929

80 0.8942

Table 10: Effect of seed value on the performance

So we will calculate the average of three seed values (0, 42, 80) for all experiments.

5.4.2 Event detection results

In this part, we provide the result of the main part of the project which is the results of the models under certain circumstances. As the data is imbalanced, we just calculate the Precision, Recall, and F1 score on event tokens. We also provide the accuracy of the model on all tokens. Table 11 shows the performance of RED development and test set. Consider the test set has just two classes (”O” and

”B-EVENT”) without any ”I-EVENT” class. Therefore we can not conclude how the model would perform on I-EVENT test examples.

Besides, to see how the models will perform on the movie subtitles test set, we test the annotated data on each pretrained and fine-tuned model and get the results as table 12. Although we separate the result table from RED, the model by which they have been trained and fine-tuned is the same and the

(31)

RED development set RED test set

Model precision recall f1-score precision recall f1-score CRF (Baseline) 0.804 0.728 0.764 0.816 0.728 0.769

ZSL 0.888 0.899 0.893 0.881 0.897 0.889

Table 11: Experiments results: RED corpus ZSL: A standard BERT model which is finetuned using the RED corpus. We will use it later to test the

movie subtitles in a zero-shot setup.

whole evaluation process has been done for both RED corpus and movie subtitles at the same time with the same model and seed value.

Model precision recall f1-score

CRF 0.756 0.732 0.744

ZSL 0.757 0.800 0.778

DAPT1 + FT 0.752 0.745 0.748 DAPT2 + FT 0.746 0.824 0.783 DAPT3 + FT 0.752 0.820 0.785

Table 12: Experiments results on the movie subtitle test set DAPT+FT: The domain-adaptive pretrained models which

are finetuned using the RED data

5.4.3 Jensen Shannon Divergence

The Jensen-Shannon distance between the RED corpus and the movie subtitles annotated corpus is 0.504.

Besides, for each event token, the divergence score is calculated in both corpora by JSD, and the top general tokens and events with the highest divergence score are shown separately in figures 5 and 6, respectively. More distinguishing tokens are represented by higher ranks. The side of the chart indicates the corpus within which a token is more prevalent. The left one is related to the RED corpus and the right is related to the movie subtitles.

(32)

Figure 5: Results of corpus comparison between the RED and movie subtitles.

The most 50 divergent tokens are shown.

orange: Movie Subtitles (Right side) purple: RED corpus (Left side)

Figure 6: Results of corpus comparison between the RED and movie subtitles.

The most 50 divergent events are shown.

orange: Movie Subtitles (Right side) purple: RED corpus (Left side)

Unsupervised Domain Adaptation For Event Detection

For Event Detection

Mehdi Amiri

Unsupervised Domain Adaptation For Event Detection

Contents

Abstract

1 Introduction

1.1 Research Questions

1.2 Thesis Outline

2 Related Works

2.1 Supervised domain adaptation

2.2 Semi-Supervised Domain Adaptation

2.3 Unsupervised Domain Adaptation

3 Data

3.1 Overview

3.2 The Pipeline and General Intuitions of Annotation Process

3.3 Train set for fine tuning

3.4 Test set for evaluation

3.5 Further Pre-training Dataset

3.6 Textual Analysis on Corpora

∑

∑

4 Methods

4.1 Baseline: Conditional Random Fields(CRF)

4.2 Proposed method

5 Experiments

5.1 Evaluation Metrics

5.2 Baseline: CRF

5.3 Proposed method

5.4 Results