Emotion-Target Extraction for Criminal & Fraud Investigation

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Emotion-Target Extraction for Criminal & Fraud Investigation

by

MANASA JANARDAN BHAT

12184306 August 17, 2020 48 EC Nov 2019 - June 2020 Supervisors: Julien ROSSI Johannes C. SCHOLTES Assessor: Dr. M. DE RIJKE

(2)

2.2.1 Vanilla Transformers . . . 10 2.2.2 BERT . . . 12 2.3 Multi-task Learning . . . 13 3 Related Work 15 4 Dataset Creation 18 4.1 Methodology . . . 18 4.2 Annotation Aggregation . . . 20 5 Methodology 22 5.1 Input Data . . . 22 5.2 Input Encoder . . . 23 5.3 Sequence-Tagging Models . . . 24

5.4 Sequence-Tagging Model Variants . . . 27

5.5 Multi-Task Learning ST Model . . . 29

6 Empirical Setup 31 6.1 Datasets . . . 31 6.2 Training . . . 32 6.3 Evaluation . . . 33 ii ii

(3)

7 Results and Analysis 34 7.1 Sequence-Tagging Experiments . . . 34 7.2 Multi-task Learning experiments . . . 37 7.3 Qualitative Analysis . . . 40

8 Conclusion 45

Bibliography 46

A Appendix 52

A.1 Baseline Models . . . 52 A.2 Unified + CNN Experiments . . . 53

iii

(4)

Abstract

Human Emotion is one of the most influencing factors to consider in Criminal and Fraud

Investigation. Criminal motivations, suspicions and decision making can be studied by examining the underlying emotions of people involved. Consequently, emotion analysis is of particular importance to this domain. Though face-to-face communication allows easier inspection of emotions, analysing emotions in other sources of evidence is non-trivial. Among indirect sources of evidence, emotions are more likely to be observed in textual

communication documents like emails, messages and social media interactions. The

num-ber of such documents investigators analyse is enormous making manual discovery process inefficient and ineffective. Automatic emotion analysis becomes essential. Currently, such automation is applied only on domains like customer service and mental health. In this research, we apply emotion analysis on communication documents to advance the Criminal and Fraud Investigation process.

The advancements in Natural Language Processing has led to the development of efficient ways to represent textual data, and sophisticated deep-learning models that can capture complex semantic relationships intrinsic to human language. Therefore, we apply state-of-the-art deep-learning language representation model BERT, to extract emotions and their

targets from communication documents.

There are no existing datasets labeled with both emotions and targets to study this task. Therefore we first create a new dataset using Twitter multi-label emotion dataset. We then formulate a sequence-tagging solution by treating the sub-tasks of emotion extraction and target extraction as one unified task. We use a spatial aware design in the tagging module to perform target extraction at the phrase level. To enforce the correct mapping of

emotions and targets, we additionally perform document-level emotion classification over

input sequences. We also conduct an extensive qualitative analysis to understand where automation lacks language understanding.

1

(5)

Acknowledgement

Firstly, I would like to thank my supervisors Julien Rossi and Johannes Scholtes for their guidance and support throughout my thesis study. I am grateful for the constant discussions, feedback and suggestions of Jeroen Smeets and Zoe Gerolemou which helped me notice different perspectives. I am thankful for the resources provided by ZyLAB without which this thesis study wouldn’t have been possible.

Secondly, I would like to thank my friends Ipek and Youri for helping me keep my spirits up with countless conversations, relatable discussions and support. I am also thankful to my friends Santhosh and Samarth for their valuable inputs.

Third, I would like to thank Kishan Parshotam for the endless discussions, for enduring my emotional outbursts, helping me stay fit and healthy, and for the never-ending support and guidance throughout this research work.

Finally, I would like to thank my parents, brother and sister-in-law for the encouragement and invaluable support during the entire time of my Master studies. I, wholeheartedly, dedicate this thesis to them.

2

(6)

Introduction

1

1.1 Motivation

Criminal and Fraud Investigation is a field that requires careful inspection of human behaviour, and interpersonal relationships. Collecting factual material related to this is vital to any investigation process. However, objective information is often intertwined with a subjective, or an emotional component. Emotion is central to all human actions and relations. Consequently, investigating underlying emotions can help unwrap motivations for crime, direct suspicions, and sometimes, even affect criminal decision making[51]. Uncovering human emotions is therefore crucial in any investigation process[11].

The emotional state of people under direct interrogation can be observed through common revelations like facial expressions, speech and voice tones. However, a major part of legal pro-ceedings involves examining indirect sources of information. This may include, statements, reports, emails and other textual data. The prominent way of delivering emotions in text is through language. Lack of physical exhibits makes extraction of emotional evidence from textual data extremely difficult. Therefore it is valuable to develop effective and efficient techniques for extracting emotions from language.

Usage of language varies across different types of documents. Emotional patterns are more likely to be observed in the language used in non-factual communication documents in contrast to factual reports[19, 37, 14, 21]. Besides allowing freedom of expression, communication documents are also a window to human relationships. Hence analysing

emotions primarily in such documents is of particular importance to the domain of Criminal

and Fraud investigation.

Currently, human emotion has been mainly studied in the contexts of product perception evaluation [8], political policy opinion study [52], and mental health diagnosis [20]. There has been limited research done in emotion analysis focusing on human communication, let alone in the domain of Criminal and Fraud Investigation. Therefore this research work focuses on performing emotion analysis on communication documents with applications in Criminal and Fraud investigation.

3

(7)

1.2 Problem Statement

Criminal and Fraud Investigators deal with a large number of communication documents in the form of Electronically Stored Information(ESI). ESI related to communication includes, but is not limited to, emails, messages, databases, reports, social media and websites. The process of discovering relevant details from this vast collection of information is called

Electronic Discovery (eDiscovery). When communication is directed on a large scale,

many-to-many, eDiscovery of emotions becomes highly complex, making automatic recognition of emotions [6] essential. Our study explores solutions to automate emotion analysis in the eDisovery process.

Automatic emotion recognition has been studied using two key methods of quantifying emotional content - sentiment analysis and emotion analysis. Sentiment analysis comprises detecting polarity(- positive, negative, neutral) of emotions. Emotion analysis on the other-hand involves detecting more fine-grained concrete emotions: anger, sadness, joy and so on. However, detecting only emotion/sentiment is inadequate if it doesn’t indicate the recipient or target entity. Hence recently, attention has shifted to target-based analysis which involves extracting the targets of opinions along with the sentiment/emotion.

Mark: As I am sure you agree, I thought that CA Attorney General Lockyer’s comment last week in the WSJ was one of the most outrageous I have ever seen from a major public official.

target sentiment emotion

CA Attorney General Lockyer’s comment negative surprise, anger

you neutral anticipation

Tab. 1.1.: Email sample from ENRON data-set illustrating limitations of only extracting sentiment

compared to emotions and corresponding targets

Due to the volume of electronic data produced and stored, emotion analysis in communica-tion documents cannot be limited to sentiment analysis. Polarity cannot provide sufficient insights. Table 1.1 provides an example from the ENRON email data-set[29] illustrating limitation of sentiment, compared to target and emotion entities. Identifying concrete fine-grained emotions and extracting the target of emotions can expedite a legal investigation. To assist such explorations, we define the task ofEmotion-Target Extraction of detecting

emotions, and the corresponding targets in textual communication data.

Due to sensitive personal data and the confidentiality involved in the domain of Criminal and Fraud Investigation, there is limited or even non-existent dataset availability. This

(8)

gives rise to the problem of finding relevant data to use with additional challenges as follows:

1. Emotions range in a large spectrum, often overlap, and is sensitive to human perception. As a consequence, it is very difficult to collect consistent and clean human labels for this task in a large scale.

2. In communication documents, the definition of a target is speculative, making target

annotation process open to ambiguity.

Therefore, along with providing a solution for emotion-target extraction in communication, this research work also contributes a new dataset to study the task. We also incorporate and analyse techniques to handle the lack of a large amount of task-specific data.

Contributions of this research are as follows:

1. Provides a new dataset for performing Emotion-Target extraction on text. 2. Benchmarks the task of Emotion-Target Extraction in text.

3. Provides a deep qualitative analysis on the characteristics of the task where automa-tion fails to perform well.

1.3 Aim and Research Questions

Aim:

The aim of this thesis is to perform emotion-target extraction on textual communication data in order to detect emotions and the corresponding targets.

Research Questions

1: How can we solve the task of Emotion-Target Extraction as a Sequence-tagging problem?

Emotions in a document can be globally extracted at the sequence level.

How-ever, target extraction is at word/phrase or token level. Sequence-tagging (or sequence-labeling) has been successfully employed to extract opinion target

(9)

expressions in multiple studies [5, 65]. Different configurations of sequence-tagging approach have also been applied in target-based sentiment analysis task to retrieve both sentiment and targets [39, 67, 34]. Following such works, we try to formulate a solution for Emotion-Target extraction by abstracting the task as a sequence-tagging problem.

1.1: Can a unified sequence-tagging model outperform a joint-learning model?

In our task, the sub-tasks - emotion-extraction and target-extraction are highly correlated. Past studies have shown that an integrated solution is more effective than other configurations when sub-tasks are strongly coupled[72, 17, 41, 45]. Therefore a joint-modeling approach is more appropriate for the current task. However, in our task, an input token retrieved as a target should also be tagged with the corresponding emotions. In such a case, we believe

joint-modeling approach can suffer from inconsistency in learning tags

for each sub-task. We first prove this assumption in our study. To overcome this, we hypothesize that combining the sub-tasks into one unified sequence-tagging task should outperform joint-modeling of the two sub-tasks.

1.2: Can a spatial-aware module at the tagging layer improve emotion and target retrieval?

The targets in an input sequence is usually a text phrase comprising of multiple words/tokens. Hence, the tagging-model should be capable of retrieving all the sub-tokens of a target phrase. To achieve this, a desirable property of the tagging-model is being aware of the immediate neighbours in the sequence. This way, a token can be evaluated at a phrase-level context to improve retrieval. Therefore we propose using a spatial-aware design in the tagging block. 1.3: Can learning emotions across a sequence globally, improve emotion extraction at the token-level?

Information necessary for extracting emotions and targets, should be captured wholly in the token representation for sequence-tagging. This token-level representation may become a bottleneck to learning context information. Lack of global context can cause emotions

(10)

retrieved at the token-level to be inconsistent with the emotions across the complete input sequence. To overcome this, we propose performing document-level emotion classification in addition to sequence-tagging.

2: Can a multi-task learning approach to Emotion-Target extraction help overcome the lack of data?

Due to the lack of labeled data and the difficulty attached in collecting annota-tions, it is crucial to explore techniques to utilize existing data-sets of closely related tasks. The sub-tasks of emotion-extraction and target-extraction in our task are independently well-defined tasks with multiple publicly available datasets. It is only logical to try to utilize these datasets to improve the under-standing of a deep-learning model. To this end, we propose to abstract these sub-tasks as auxiliary tasks in a multi-learning setup to improve performance.

(11)

Background

2

This chapter discusses the concepts/models that are used in this thesis study.

2.1 Human Emotions

Emotions occur in every relationship in life. Theorist Paul Ekman believes the primary

function of emotion is to quickly deal with interpersonal encounters, and is crucial to the development and regulation of interpersonal relationships[15]. Even in the light of philosophy, emotions have surfaced as a threat to reason[60]. Hencehuman emotions have been deeply studied by a number of psychologists and philosophers.

Fig. 2.1.: Plutchik Wheel of emotions

Emotion is not just one affective state, but a family of related states. Hence there is no one definition or classification to emotions. Given the interdisciplinary nature, emotion

8

(12)

categories are updated constantly by new studies. Consequently, the number of human emotions can vary approximately from 4 to 50. However, most studies further encapsulate similar emotions together and provided a list of basic emotions.

Paul Ekman defined six basic emotions(anger, disgust, fear, happiness, sadness, surprise) based on the facial expression and later expanded the list with emotions not seen through facial appearance. Theorist William James proposed four basic emotions - fear, grief, love, and rage, based on bodily involvement. The most popular classification, however, is provided by Robert Plutchik. He defined emotions through a three-dimensional model based on the similarity of emotional words as shown in Figure 2.1. The innermost circle defines most basic emotions, the outer-most shows complex emotions. Widely used set of emotions is the 8 emotions - joy, anger, trust, sadness, anticipation, disgust, surprise, fear defined in the middle circle which is a combination of the basic and complex emotions. We use these 8 emotions in our research. This categorization is also more appropriate for language applications since its study is based on emotional words.

A topic of focus in research related to emotions ishuman interpretation. Researchers have been testing how confident we are in judging other people’s emotions. Results have demonstrated that how we interpret emotions is influenced greatly by personal experiences[2]. Some research also argues that emotions like fear, anger and sadness are dependent on human perceivers for their existence[3]. A general conclusion from all these studies is that in reality there can be a certain discord in how emotions are expressed and perceived. In language particularly, emotions are more difficult to interpret compared to other mediums. Incorrect usage of language can sometimes result in delivering unintended emotions. Similarly, personal biases can result in inaccurate interpretations from a recipient. Usage of sarcasm and metaphorical language are some ways of disguising intended emotions. All these intricate attributes of language makes the understanding of emotions complex.

2.2 Transfer Learning using BERT

Transfer Learning[61] is a methodology used to transfer knowledge from a domain with a

large amount of data to a target domain which lacks training data. The field of NLP has adopted this methodology primarily through large-scale language modelling or pre-training to capture underlying semantics and dependency in language, and use this knowledge to

(13)

train or fine-tune on target task data. Knowledge transfer is achieved either by

feature-representation transfer and/or parameter transfer. The former shares learned language

features across tasks, whereas the latter shares some parameters and the prior distribution of hyperparameters. To perform such transfer-learning, a number of language-modelling techniques and model architectures were released recently proving their prominence in a variety of tasks. Among such models, Bidirectional Encoder Representations from

Transformers(BERT)[13] is the state-of-the-art language representation model.

BERT is designed to pre-train deep bidirectional representations from unlabeled text. Each word token in BERT is jointly conditioned on both left and right context of the input sequence unlike other unidirectional models [50, 58]. This bidirectional representation proved its prominence by achieving state-of-the-art results on multiple tasks with only an additional task-specific output layer. The strength of BERT therefore stems from the pre-training approach which is described in section 2.2.2.

The architecture of BERT is based on the transformer architecture described below.

2.2.1 Vanilla Transformers

Background

Sequence models like Recurrent Neural Networks[24, 10] follow an encoder-decoder approach. An encoder neural network setup reads the sequence data and compresses this information into a fixed-length vector. A decoder neural network then uses this vector for predictions(or text generation). The performance of such models decrease rapidly as the length of input sentence increases[9]. In order to address this issue, particularly in

Neural Machine Translation, a mechanism called Attention[62] was introduced which is

the principle idea behind transformers.

Attention encodes an input sequence into a sequence of multiple fixed-length vectors in

contrast to one fixed-length vector. During the decoding process, the decoder learns to choose a relevant subset of these vectors to calculate a context vector to make predictions. This eliminates squeezing the entire sequence into one vector thereby retaining essential information required for output prediction. Attention can therefore be described as mapping from a query vector and a Key-Value vector set to an output vector. Given a query,

(14)

the output is calculated as the weighted sum of the values. The weight of each value is calculated by a compatible function of the query with the corresponding key as

Attention(Q, K, V ) = F (Q, K) ∗ V

where Q is the query vector, K the keys, V the values and F is the compatibility function.

Architecture

Fig. 2.2.: Transformer Architecture

The transformer architecture consists of encoder and decoder modules as shown in 2.2. The Encoder is made up of 6 stacked identical layers. Each one of these layers is made up of an attention layer and a feed forward layer that operates on every position of the input sequence.

Similar to encoder, the decoder has 6 identical layers. Unlike the encoder, each layer has an additional attention layer over the outputs of the encoder layers.

Additionally, each sub-layer of the two modules incorporates residual-connections followed by layer normalization.

The transformer architecture uses a Scaled Dot-Product Attention. Given query Q and

(15)

key K of dimension dkand values V of dimension dv, attention output is calculated as follows: Attention (Q, K, V ) = softmax QK T √ dk ! V (2.1)

Instead of performing single-attention function, the transformer consist of multi-head

attention by learning multiple linear projections of query, keys and values. These multiple

attentions are concatenated and projected again to get the final values.

Input

Transformer models, like general language models use learned word-embeddings to convert text input to fixed-length vectors. Since the model is non-sequential, to maintain sequence information, positional encodings are added to the input embeddings. Given position pos and embedding size i, transformers use the following sine and cosine functions to encode position:

P E_(pos,2i)= sinpos/100002i/dmodel

P E_(pos,2i+1)= cospos/100002i/dmodel

(2.2)

The functions are chosen such that the norm of the positional embeddings remains the same for all tokens and the cosine similarity between any two positions is

translation-invariant.

2.2.2 BERT

Architecture

BERT uses the Transformer Encoder stack in its architecture. Based on the number of encoder layers, two model configurations are presented 1) BERT Base which has 12 encoder layers 2) BERT Large which has 24 encoder layers. BERT takes sequence of words as input and applies self-attention at each layer. The encoder layers are connected through feed-forward network. However, the decoder of the vanilla Transformer architecture is replaced by single-layer neural network classifier in BERT.

(16)

Input Representation

BERT uses WordPiece embeddings[64] with vocabulary of 30, 000 tokens. Input sequences are prepended with a special ’[CLS]’ token, a place-holder to aggregate the complete sequence features into. Input sequences of two sentences are separated with a ’[SEP]’ token. The final token representation is a sum of the positional embedding and the token wordpiece embedding.

Training

BERT is trained on two unsupervised tasks.

1. Task 1: Masked LM - A percentage of input tokens at random are masked, and the task is to predict these tokens. This task is focused on learning the bidirectional representations of the token.

2. Task 2:Next Sentence Prediction - Pairs of input sequences are randomly (p=0.5) replaced with a random sentence from the corpus. The task is binary classification of predicting if the second sentence is the next sentence in the sequence.

For the pre-training corpus, the BooksCorpus (800M words)[73] and English Wikipedia (2,500M words) are used.

2.3 Multi-task Learning

Task 1 input Model 1 Task 1 prediction Task 2 input Model 2 Task 2 prediction

Fig. 2.3.: Single Task Learning

Model Task 1 input Task 1 prediction Task 2 input Task 2 prediction Shared representation Task-speciﬁc layer Task-speciﬁc layer Fig. 2.4.: Multi-task Learning Model 2.3 Multi-task Learning 13

(17)

Multi-task learning[7] is an approach to improve generalization from domain-specific knowledge of related tasks. This is achieved by training these tasks in parallel using a

shared representation. The main difference from transfer learning is that, the tasks are jointly trained in MTL. In traditional MTL frameworks, the deeper layers of the network

correspond to the shared representations while the outer-most layers serve as the task-specific layers. A general framework comparing a single task learning(STL) model and MTL model is shown in figures 2.3 and 2.4.

In NLP, multiple MTL architectures are explored[70] - 1) encoder is shared by all the tasks but decoders are task-specific used for machine translation and syntactic parsing. We follow a similar architecture in our MTL model. 2) each task has its own encoder but the decoder is shared by all the tasks is proposed for machine translation and image caption generation. 3) multiple encoders and decoders are shared among tasks, mainly used in machine translation. Though MTL approaches are known to perform well, in some cases STL maybe superior than MTL when an outlier task impairs the overall performance.

(18)

Related Work

3

This chapter discusses existing literature related to this thesis study. To our knowledge, there has been no research done on the extraction of both fine-grained emotions, and their

targets as a single task. Therefore the literature review is focused on studying the sub-tasks

of emotion analysis and target extraction as independent tasks.

Most of the research in the domain ofemotion analysis in text has been in sentiment

classification or detecting polarity. However, to benefit a wide-range of applications like

emotional chat-bots[46], personalized recommendations[12], better customer service, policy studies, the focus has been shifted to extracting more fine-grained emotions like

anger, joy etc. Existing literature has classified such fine-grained emotion classification

into three different types - 1) Single label classification - input text is assigned to one emotion category like anger or joy [1]. 2) Multi-label classification - input text is assigned to multiple emotion categories like fear and surprise [4, 27]. 3) Emotion intensity distribution prediction[71]. Emotions are both difficult to express and understand in language and often overlap leading to multiple interpretations. Due to these characteristics, multi-label classification of emotions is more practical than single label classification. Therefore this research focuses on multi-label emotion classification.

Traditional approaches of multi-label emotion classification include emotion lexicon based approaches[56], linear classifier based methods[47, 32], constraint-optimization techniques [63] and latent discriminative models [57]. The advancement of deep learning techniques in Natural language processing has led to utilizing deep neural networks to perform this task [68, 16]. The best performing model[4] in the multi-label emotion classification task in SemEval2018 - Affect in Tweets challenge[44], was a Bi-LSTM with a deep attention mechanism. Word2vec word embeddings trained on a large collection of 550 million Twitter messages, augmented by a set of word affective features was used. Other well-performing model[38] also included non-deep-learning features, along with deep-learning features, learned using support vector machine methods in its ensemble. A Binary Neural

Network, an end-to-end multi-class classification model that transforms the problem to

binary-classification task exploiting deep-learning systems was proposed in [26]. NVIDIA released state-of-the-art multi-label emotion classification model based on their research

15

(19)

on practical text-classification with limited data[55, 27]. They utilize the latest transformer architecture, which was pre-trained on a large text corpus and fine-tuned later on emotion classification task. Our research work in the sub-task of emotion extraction follows their work.

A relatively new topic of research in emotion analysis istarget extraction studied promi-nently as Aspect based sentiment analysis or ABSA[54][53]. ABSA goes a step further in sentiment analysis by mapping detected sentiment polarities to a pre-defined set of aspect

categories. The difficulty in ABSA is defining a list of aspect categories. Specially for a

domain as general as communication, it is nearly impossible to form such a list. Hence

Target based Sentiment Analysis (TBSA), a sub-task in ABSA is more relevant in most cases.

Literature classifies TBSA again into two types: 1) Target Sentiment Analysis - given a target entity and an opinion document, detect the right sentiment polarity[59]. 2) Target and Sentiment Extraction - given an input opinion document, extract both targets and corresponding polarities[66][34]. The latter is more relevant to our study.

The most prominent and effective approach in literature for target extraction is using sequence-tagging. Some works have demonstrated the usage of BIO tags - (B - beginning of the target entity, I - inside a target entity, or O - outside of the target) to extract

targets and sentiment tags (POS - positive, NEG - negative, NEU - neutral) to extract sentiment[40]. Different sub-task modelling techniques (like pipeline, joint, collapsed)

were explored in [40] using CRF-based architecture with a number of manually engineered features. The pipe-line approach was reported to perform the best. Other works have used joint modelling technique, to learn both opinion entities and expressions to capture the internal dependencies of the sub-tasks better [66]. Recently an end-to-end target-based sentiment analysis model with a unified tagging scheme was proposed in [34]. Their model consisted of two stacked RNNs and three custom-made components (Boundary Guidance, Sentiment Consistency and Opinion-Enhanced Target Word Detection) to ensure all target tokens are retrieved with a consistent sentiment tag. In TBSA, a target can be associated with only one sentiment which is a multi-class tagging problem. However in case of emotions, a target can be attributed to multiple emotions making it a multi-label, multi-class tagging problem. Currently, as far as we know, such complex system has not been analysed in the domain of emotion analysis. Our work provides such a multi-label, multi-class sequence-tagging solution to emotion-target extraction.

The domain of emotion analysis has limited number of datasets available owing to the difficulty in collecting annotations. Since deep learning techniques require huge amount of data, lately various techniques dealing with lack of supervised data has been studied.

(20)

Transfer Learning[61] from pre-trained language models has been known to show great

success in NLP either by sharing input representations or sharing a feature space[35]. Pre-training on large corpus of data through language-modelling has shown to achieve high performance on small sets of annotated data through fine-tuning[27]. State-of-the-art results on target-extraction was reported by transfer learning through the BERT pre-trained model[59]. Multi-task Learning(MTL)[7] is another popular approach to handling lack of data. A shared attention-based Bi-LSTM was simultaneously trained on sentiment

classification task to improve performance on the primary task of emotion classification in

[68]. A similar MTL approach with a message-passing technique to perform ABSA with auxiliary tasks of document-level sentiment classification and domain classification was shown to perform well in [22]. Our study follows similar techniques to overcome lack of task-specific data.

(21)

Dataset Creation

4

4.1 Methodology

Base Dataset

An appropriate dataset for emotion-target extraction from communication data requires 1) collection of correspondence documents 2) annotations for emotions in them 3) an-notations for targets of emotions. However, a small survey indicated emotions were extremely difficult to annotate without expert knowledge. We therefore decided to collect annotations only for targets starting with an existing dataset annotated for emotions. One of the most popular datasets used in literature for Emotion Classification is the

SemEval 2018 Task1-E-c Emotion Classification Twitter Dataset[42][43]. Though mostly

written in an informal way, this dataset is one of the closest collections of data to reflect the way people communicate their emotions/opinions using language. Therefore, this dataset was chosen as the base for our task. However, we acknowledge this data does not fully reflect our application.

This dataset contains tweets annotated for 11 emotions (Plutchik’s 8 emotions and love,

optimism, pessimism) by crowdsourcing platform, CrowdFlower1. Each data sample was annotated by an average of 7 annotators. Primary emotions were chosen by majority vote. To incorporate subtle instances of other emotions, any label with more that 25% of the vote were also included. If no emotion had above 40% agreement, the tweet was labeled

neutral. The dataset contains 6836 train samples, 885 validation samples and 3258 test

samples.

We pre-process this collection of tweets and labels and use it for labelling targets in our process.

1_{https://www.welcome.ai/crowdflower}

18

(22)

Data Pre-processing

The following pre-processing steps were taken before initiating the labelling process. 1. The train, validation and the test sets were combined into one dataset.

2. Samples containing emotions other than the Plutchik’s 8 emotions were ignored. 3. Samples only with less than 3 emotions were considered.

4. For emotion classes with a large number of samples, tweets with only less than 25% of misspelled words were selected.

The pre-processing steps resulted in 2500 data samples to be annotated for Target.

Annotation Process

The Label Studio framework2was adapted for annotation. The task was hosted on a server for easy access. The annotation process was carried out with 7 annotators (2-3 annotators for each sample). The annotators were given instructions on how to annotate to ensure consistency in labelling. Overview of each annotation task is shown in figure 4.1.

Fig. 4.1.: Annotation Task Overview

2_{https://labelstud.io/}

(23)

4.2 Annotation Aggregation

The target phrases were determined primarily if minimum of two annotators agreed on them. Since the targets are text phrases, there were cases when the annotations were not an exact match. In such cases, the longest text phrase, which included all other sub-string labels was chosen as the final label. The types of conflicts and resolution method are as follows. The principle idea behind the conflict resolution in most cases was to allow all types of human perceptions.

1. No label was chosen for a specific emotion Vs a label annotation : A third annotator was used to resolve the final labels.

2. Only a sub-set of labels were agreed upon : To incorporate all interpretations, a union of all labels were used as final labels. However, the agreement score was calculated according to the original agreement.

3. Other disagreements: Similar to the previous case, a union of all annotations were used as the final labels. Agreement score was set to 0.

Tweets that were not clear to the annotators and did not reflect the emotions present were discarded.

All the text data was processed to transform the twitter data into a less domain-specific format. Ekphrasis3text processor was used to unpack hashtags, unpack contractions and spell correct. User mention(@) and hashtag(#) characters were discarded. Contractions were expanded using python package contractions4and the emojis were expanded using package emoji5_.

The process resulted in a dataset of 2016 samples. The kappa score for the dataset is 0.708 which shows a moderate level of agreement. A sample data-point is shown in 4.1 with each emotion mapped to the corresponding targets with the start and end indices in the text.

The number of samples per emotion, and number of samples each emotion occurs with, in the multi-label setting is shown in the figure 4.2.

3_{https://github.com/cbaziotis/ekphrasis}

4_{https://pypi.org/project/contractions/}

5_{https://pypi.org/project/emoji/}

(24)

{

"id": "2017-En-20419",

"text": "hillaryclinton hypocritical considering the millions of dollars you and billclinton took from horrible people and spent on yourselves .",

"emotions": [ "disgust", "anger" ],

"disgust": [[ "hillaryclinton", 0, 14 ],[ "billclinton", 72, 83 ]], "anger": [[ "hillaryclinton", 0, 14 ],[ "billclinton", 72, 83 ]] }

Tab. 4.1.: Sample data point from the emotion-target dataset

anger anticipation disgust fear joy sadness surprise trust anger anticipation disgust fear joy sadness surprise trust 669 4 305 20 6 42 3 0 4 231 5 24 41 12 6 5 305 5 405 14 3 21 5 0 20 24 14 315 12 71 5 2 6 41 3 12 411 19 29 32 42 12 21 71 19 390 2 0 3 6 5 5 29 2 72 0 0 5 0 2 32 0 0 73

Fig. 4.2.: Data samples per emotion in multi-label setup. Each cell indicates number of samples

that have the two emotions corresponding to the row and column. Diagonal shows the number of samples per emotion.

The average number of words in a target phrase across all the samples is 2.8.

(25)

Methodology

5

The implementation methodology and the motivation for the architectural choices made are described in this section.

We first describe how our input data looks like in section 5.1 and the encoder used to represent this data in section 5.2. Next, we describe the two main sequence-tagging approaches we implement as solution to the emotion-target extraction task i.e joint-learning and unified modeling techniques (5.3 ). We later describe the enhancements we proposed to improve the unified model, which include incorporating spatial-aware design and performing document-level emotion classification (5.4). Finally, we describe the Multi-task Learning setup proposed to overcome the lack of task-specific supervised data (5.5).

5.1 Input Data

For target extraction, we define a target tag set comprising of two tags {T, NT} where T represents Target token and N T - Non-target token. Similarly for emotion extraction, the

emotion tag set is defined as - {{E}, NEU}. {E} can be any combination of the emotion

classes and NEU represents neutral.

Suppose the input text document X is made of N words [w1, w2, ..., wN] ∈ RN. The

number of emotion classes is C. Each word token wi is associated with target tag ti ∈ R1.

The mapping between values and target tags followed in training and prediction is as follows: map(ti) =    [0] ←→ N T [1] ←→ T ∀i ∈ N

Therefore corresponding target tag sequence is [t1, t2, ..., tN] ∈ RN

Similarly each word token wi is also associated with an emotion tag ei ∈ RC, a binary

22

(26)

vector. The mapping between values and emotion tags followed in training and prediction is as follows: map(ei) =    [{Cj} if (eij = 1) ∀j ∈ C] ←→ {E} [if e_ij = 0 ∀j ∈ C] ←→ N EU ∀i ∈ N

The emotion tag sequence therefore is [e1, e2, ..., eN] ∈ RN XC.

5.2 Input Encoder

The first step in providing a solution to the task in hand was to choose an appropriate

encoder. The input sequence X is first encoded using an encoder to extract the features

for each token in the word sequence. Each token is represented as a low-dimensional real-valued vector of dimension D. The output of the encoder block is therefore a sequence of token features F = [f1, f2, ..., fN] ∈ RN XD.

As input encoder, we adopt apre-trained BERT Encoder described in section 2.2 in our implementation. The choice of encoder model was based on reviewing existing model architectures that are known to perform well for language understanding and other closely related tasks. A major concern while choosing the right encoder for the current task was also lack of data.

The superiority of Transformer architectures in dealing with less data through large-scale language modelling and fine-tuning has been evident. Fine-tuning only on thousands of data samples, transformers have shown comparable results in emotion classification to other prominent model architectures like Bi-LSTM trained on millions of data samples[27]. Built over transformer architecture, the BERT model was particularly chosen over other popular pre-trained unidirectional models like ELMo[50] and GPT[58] for its powerful bi-directional representations. Each word token in unidirectional models, refers to only the previous word token of the sequence in the language modelling process. These restrictions are sub-optimal to sentence-level tasks and can even deteriorate the performance of token-level tasks during fine-tuning[13]. Bi-directional representations provide the tokens, context information from both directions of its input sequence. Such representation is very valuable for a sequence tagging problem where the focus is primarily at the token-level.

(27)

Other prominent models in NLP like the Recurrent Neural networks were set aside due to their drawbacks like amount of data required for training, vanishing gradients[23] and difficulty in optimizing[48].

The output of the encoder module is the last layer hidden-states of BERT model. The hidden-states corresponds to features of individual tokens of the input sequence.

5.3 Sequence-Tagging Models

This section describes the Joint-learning ST model and the Unified ST model implemented in this research. token_i token_i+1 ... ... BERT Encoder token_feat i token_feat i+1 Target Tag Projection Layer Emotion Class Projection Layer token_i Target tag ... ... ... token_i+1 ... Target tag token_i Emotion tag token_i+1 Emotion tag ...

This Thesis is great!

Fig. 5.1.: Joint-Learning ST Model

token_i token_i+1 ... ... BERT Encoder token_feat i token_feat i+1 Emotion-Target Projection Layer token_i Emotion-Target tag ... ... ... Emotion-Targettoken_i+1 ... tag

Fig. 5.2.: Unified ST Model

Joint-Learning Sequence-Tagging Model

The joint-learning model jointly learns the target extraction and emotion extraction sub-tasks[36]. It follows the human interpretation of first finding the targets in an input sequence and then associating an emotion towards it. Each word token in the input sequence is tagged with the target tags first. The target tag information is introduced along with the token features in a pipeline fashion to further predict the emotion tags. The

(28)

pipeline approach assists maintaining unity between the target prediction and emotion prediction. However, the approach is not strong enough for the current task and we expect inconsistency in prediction between the two sub-tasks. The architecture is shown in figure 5.1. The implementation details for one token is described in the following. The same applies for the whole input sequence.

Target tagging is achieved by performing multi-class classification over the target tags. The token features fi from the encoder is projected to the target tags space through a

Linear Layer as

y_Ti = ffi· WtT + bt

; , where Wtand btare the layer weights and bias.

The output of the layer is then passed through a softmax function to obtain tag scores for tags in the target tag set T as

p(yi_t) = e yti P t∈T ey i t ; ∀t ∈ T .

The target extraction loss for the whole input sequence is defined as word/token level

negative log likelihood loss as

LT = −1 N N X i=1 ti· log(p(yTi)).

Emotion tagging on the other hand is achieved through performing label multi-class token multi-classification over the emotion multi-classes C. The token features fi from the

encoder is first concatenated with the target tag projection yi

T. The concatenated features

are then projected to the emotion class space through a Linear Layer as

si= fi+ yiT

y_Ci = fsi· WcT + bc

where Wcand bcare the layer weights and bias.

The output projection is then passed through a sigmoid function to obtain class scores as

p(yi_c) = e yi c 1 + eyi c ; ∀c ∈ C. 5.3 Sequence-Tagging Models 25

(29)

The emotion extraction loss for the input sequence is Binary cross entropy loss at the token level defined as LC = −1 N N X i=1 ei· log py_Ci + (1 − e_i) · log1 − py_Ci .

Objective function, the model is trained on to minimize both the target extraction loss and the emotion extraction loss

LST = LT + LC.

Target tag prediction at each token is performed as

t∗_i = argmax_t∈T(p(y_ti)). Emotion tag prediction is defined as

(e∗_ci) =    1 if p(y_ci) >= threshold 0 if p(y_ci) < threshold ∀c ∈ C.

Unified Sequence-Tagging Model

This model architecture corresponds to the proposed unified model to overcome incon-sistency expected in the joint-learning model. The principle idea behind this approach is that when a token is tagged with {E} tag, it also implies the token is a target token

T. Thereby emotion extraction should implicitly perform target extraction too. We expect this integration to overcome inconsistency. A unified tagging scheme[33] is employed to combine the two tagging sub-tasks into one. Under this scheme, the target tag set and the

emotion target set are combined into one set of unified tag set as follows.

1. ’NT - NEU’ : Non-Target token - Neutral 2. ’T - {E}’ : Target token - {set of emotions}

With this approach, target and emotions are extracted together based on the ’T - {E}’ tag. The model architecture is shown in the figure 5.2.

Suppose U is the set of unified tags. Given, target tag sequence P and the emotion tag

(30)

sequence Q, we define a unified tag sequence R = P · Q ∈ RN XC_{. The mapping function}

for the tagging process is transformed as follows:

map(r_i) =    [{C_j} if (r_ij = 1) ∀j ∈ C] ←→ T − {E} [if r_ij = 0 ∀j ∈ C] ←→ N T − N EU ∀r_i ∈ R

The unified tagging task is achieved through multi-label multi-task token classification over the Emotion classes C same as described in section 5.3.

The objective function we try to minimize therefore is only the emotion extraction loss as follows:

LST = LC

5.4 Sequence-Tagging Model Variants

token_i token_i+1 ... ... BERT Encoder token_feat i token_feat i+1 CNN Layer token_i tag(s) ... ... ... token_i+1 tag(s) ...

This Thesis is great! cnn_feat i cnn_feat i+1 Task-speciﬁc projection layers Fig. 5.3.: CNN ST Model token_i token_i+1 ... ... BERT Encoder token_feat i token_feat i+1 token_i tag(s) ... ... ... token_i+1 _tag(s) ...

Task-speciﬁc projection layers sequence representation global projection layer emotion classiﬁcation

Fig. 5.4.: ST Model with Global Emotion

Classification

(31)

CNN Sequence-Tagging Model

This model corresponds to the spatial-aware design proposed to perform phrase-level target extraction. To provide context of the immediate neighbours for a token in the input sequence, we use a convolution neural network (CNN)[30] layer to gather token-specific neighbor information before feeding encoder features to the tagging layer. The model architecture is shown in figure 5.3.

The token features F from the encoder block is passed through a CNN layer to get se-quence representations H = [h1, h2, ...hN]which contains context of immediate

neigh-bours learned through convolution operations. The features H are then passed through the tagging layer instead of the raw encoder features. The objective function remains unchanged.

LCN N = LST (5.1)

ST Model with Global Emotion Classification

This model corresponds to the approach proposed for maintaining consistency between the emotions present over the whole sequence, and the emotions tagged at the token-level. We perform document-level emotion classification in addition to sequence-tagging by providing an additional sequence-wise loss over emotions. To achieve this, the token features F from the encoder block is aggregated into sequence feature representation Z through an aggregate function G. We perform multi-label multi-class sequence classification on Z. The model architecture is shown in figure 5.5. A Linear layer is used to project the sequence features F to emotion class space C followed by a sigmoid function.

Z = G(F ) yC = f Z · W_zT + b_z p(yc) = eyc 1 + eyc; ∀c ∈ C

We employ Binary cross entropy loss function:

LZ = −c · log (p (yC)) + (1 − c) · log (1 − p (yC))

(32)

where c ∈ RC _c

i = 1 if any eji = 1 ∀j ∈ N The model is therefore trained on the

sequence-tagging loss LST and sequence-classification loss LZ: LG = LST + LZ

CNN + Global Emotion Loss ST Model

This model variant combines features of the CNN ST model(5.4) and the ST Model with

Global Emotion Classification (5.4) to utilize the benefits of the spatial-awareness and the emotion consistency respectively. The architecture is shown in the figure 5.5.

token_i token_i+1 ... ... BERT Encoder token_i tag(s) ... token_i+1 _tag(s) ...

Task-speciﬁc projection layers sequence representation global projection layer emotion classiﬁcation token_feat i token_feat i+1 CNN Layer ... ... cnn_feat i cnn_feat i+1

Fig. 5.5.: CNN + Global Emotion

Loss ST Model token_i token_i+1 ... ... BERT Encoder token_feat i token_feat i+1 ... ...

shared parameters

Auxiliary Task 1

Layers Auxiliary Task 2Layers Primary Task Layers

task-speciﬁc parameters

Task 2 output

Task 1 output Primary output

Fig. 5.6.: MTL Model

5.5 Multi-Task Learning ST Model

This model configuration is based on the Multi-task learning framework proposed to overcome lack of task-specific data. The sub-tasks - emotion extraction and target extraction in independent context are abstracted as the auxiliary tasks to assist the primary task of

emotion-target extraction. The model architecture is as shown in the figure 5.6.

(33)

The encoder block is shared between the auxiliary tasks and the primary task.

For the auxiliary task of Target extraction, we follow the target tagging described in 5.3. i.e The model is trained on the objective function:

L = LT (5.2)

Similarly Emotion Classification auxiliary task follows the sequence classification described in 5.4. The objective function to minimize is:

L = LZ (5.3)

The primary task of Emotion-Target extraction is trained with the sequence tagging loss or the loss from other model variants. i.e:

L = LST (5.4)

The primary task and the two auxiliary tasks are jointly trained. The total loss is calculated as the sum of all three losses. The loss of each task can be further weighted to give appropriate importance to each task.

(34)

Empirical Setup

6

6.1 Datasets

Primary Task

anger anticipation disgust fear joy sadness surprise trust 0 50 100 150 200 250 300 350 400 Train Dataset Test Dataset

Fig. 6.1.: Data sample distribution in train and test splits.

Emotion-Target Dataset

This dataset refers to the new dataset created as part of this research. The dataset contains 2016data samples out of which 916 samples were set aside as test set. The remaining 1100 samples were used for training. The distribution of the data samples for each emotion in the train/test splits is shown in figure 6.1.

31

(35)

Auxiliary Task

Twitter - Affect in Tweets

This dataset is the SemEval 2018 Task1-E-c Emotion Classification Twitter Dataset described in section 4.1. Out of the 11k data samples available, samples containing only Plutchwik’s emotions were considered. Samples with >3 emotions and the ones used as part of our

emotion-target dataset were discarded resulting in 5668 samples for training.

ABSA - Restaurant and Laptop Reviews

The datasets are part of the SemEval 2014 Aspect Based Sentiment Analysis[54] challenge. The datasets were annotated with aspect terms(or targets) and their polarity for laptop and restaurant reviews. We only utilize the labels for target in our task. Each sentence of the two datasets was annotated by two annotators, a graduate student (annotator A) and an expert linguist (annotator B) using annotation tool BRAT1. This resulted in 3841 and 3845samples in resaurant and laptop dataset.

6.2 Training

For deep-learning implementation, pytorch framework[49] was used. For the input encoder, bert-base-cased configuration was chosen from the huggingface transformers package2. This particular model was chosen over other models for practicality in terms of the size of the model. All the layers of the pre-trained encoder were updated during training. The input sequence was tokenized based on spaces, prepended with the ’[CLS]’ token, appended with the ’[SEP]’ token and padded with ’0’s upto a length of 64(based on maximum length of the sequence) for batch-processing. The features of the ’[CLS]’ token was not used in the sequence-tagging layers since they reflect features of the whole sequence. A dropout of 0.3(value chosen through grid search) was applied on the output of the encoder.

All the task-specific linear layers were initialized using Xavier Initialization[18]. Adam

optimizer[28] was used for all experiments. All the models were trained(fine-tuned) for 20

1_{https://brat.nlplab.org/}

2_{https://github.com/huggingface/transformers}

(36)

epochs with a batch size of 8. For inference, the model performing best on a validation set was used. The threshold for calculating prediction from the probability values of sigmoid function was set to 0.5. All random choices are sampled from a uniform distribution. Additional hyper-parameters like learning rate, number of CNN filters are reported under corresponding experiments.

6.3 Evaluation

F1 score

For the task of emotion-target extraction, the metric used are the standard Precision, Recall and F1 scores. This is based onexact match to the ground-truth which is correctly retrieving target tokens and mapping them to the right emotions. For evaluating individual sub-tasks, we calculate metrics on the predictions of the corresponding sub-task.

Both Macro-Averaged and Micro-Averaged F1 scores are calculated for the purpose of training. However only micro-averaged F1 scores are reported since only these scores consider class-imbalance. Given true positives TP, false positives FP, false negatives FN and C different classes, the mathematical equations to calculate the metrics are as follows.

Microaveraging Precision Pr cmicro_{(D) =}

P

ci∈CT P s(ci)

P

ci∈CT P s(ci)+F P s(ci)

Microaveraging Recall Rclmicro(D) =

P

ci∈CT P s(ci)

P

ci∈CT P s(ci)+F N s(ci)

Macroaveraging Precision Pr cmacro_{(D) =}

P

ci∈CPrc(D,ci)

|C|

Macroaveraging Recall Recmacro(D) =

P

ci∈CRcl(D,ci)

|C|

We use the metrics package from sklearn_crfsuite3 for calculating and reporting the metrics.

3_{https://pypi.org/project/sklearn-crfsuite/}

(37)

Results and Analysis

7

In this chapter, we present the experiments conducted and discuss the results.

7.1 Sequence-Tagging Experiments

Joint-learning Model vs Unified Model

The aim of this experiment is to evaluate if unifying the sub-tasks of target-extraction and

emotion-extraction outperforms joint-modelling of these tasks. To evaluate this, we compare joint-learning ST model and the unified ST model corresponding to the two approaches

described in section 5.3. Learning rate for joint-modeling experiments was set to 1e-4 and unified modeling to 3e-5 based on grid-search.

Emotion Extraction Target Extraction Emotion-Target Extraction

0.329 0.548 0.327

Tab. 7.1.: F1 scores for sub-tasks and the primary task for joint-learning model showing

inconsistency in tagging.

Firstly, we prove the existence of inconsistency in the joint-modelling technique. To confirm this, we perform emotion-target extraction on the joint-learning ST model. The results of the complete task is then compared with the results of individual sub-tasks of

emotion-extraction and target-emotion-extraction. The F1 metrics for the two sub-tasks and the combined

task is shown in table 7.1. The results show that, independently, the performance of the sub-tasks is notably better than the overall task performance, confirming the problem of inconsistency in tagging. It is also evident from the results that, the performance of the model is affected mainly from the emotion-extraction component of the task which is understandable given the complexity of the multi-label token classification task.

34

(38)

Model Precision(↑) Recall(↑) F1-score(↑) Joint-Learning(Baseline) 0.387 0.283 0.327

Unified 0.424 0.294 0.347

Tab. 7.2.: Joint-learning model and unified model results for target-emotion extraction

The results of the proposed unified model to overcome this inconsistency is shown in the table 7.2. The F1 score shows that the unified model indeed outperforms the joint-learning model. The significant improvement in the precision of the unified model indicates the retrieved targets are classified into the correct emotion classes better. The improvement in the recall value however is relatively lower. This indicates the ability in retrieving output entities in both modelling techniques is almost similar and the contribution of unified

model in more towards precision.

From the joint-learning model results, we observed that emotion-extraction was the more convoluted sub-task. The unified model also deals with this through the unified tagging

scheme employed. By retrieving both emotions and targets by performing only emotion extraction, the unified model focuses on learning the emotions better, along with getting

rid of inconsistency altogether. Besides, having only one objective function to optimize makes model training easier. We therefore conclude that unified modelling technique is more suited for our task compared to joint-learning.

Unified ST Model Variants

This section first discusses the different experiments conducted to improve the unified ST

Model followed by an analysis discussing the advantages of the added modules.

Utility evaluation of CNN layer

This experiment set is carried out to analyse if the spatial information in the tagging-layer can help improve tagging performance. Spatial information is learned using a CNN layer between the encoder block and sequence-tagger block. The implementation details are described in 5.4. Based on grid search, we use 256 filters each focusing on a window of 7 tokens and a learning rate of 3e-5. The features from the CNN are passed through

ReLU non-linearity function and batch normalized[25]. A dropout of 0.3 is applied on the

(39)

normalized token features before tagging.

Utility evaluation of Global Emotion Classification Loss

The purpose of this set of experiments is to support the hypothesis that learning document-level emotions can improve local token-wise emotion extraction. We perform sequence-level emotion classification in addition to the sequence-tagging process to test this. The methodology is described in section 5.4. The sequence representation to perform sequence-level emotion classification is extracted through an aggregate function. Three options were considered for this - 1) BERT ’[CLS]’ token features, 2) max-pooling the sequence of token features of the whole input, 3) averaging the sequence of token features of the whole input. Grid search revealed maxpooling as the best aggregate function and learning rate

-2e-5.

CNN + Global Emotion Classification Model

This experiment is carried out to test if the spatial information from the CNN layer and the global loss from the sequence-level emotion classification from the above two experiments can together benefit the task. Learning rate was set to 3e-5.

Variants Result Analysis

Model Precision(↑) Recall(↑) F1-score(↑)

Unified(Baseline) 0.424 0.294 0.347

Unified + CNN 0.421 0.339 0.375

Unified + global loss 0.470 0.295 0.362 Unified + global loss + CNN 0.455 0.339 0.388

Tab. 7.3.: Results for different model variants of the base unified ST model

The results of the different experiments carried out to improve the unified model is shown in table 7.3.

The F1 score of the Unified + CNN shows that spatial information added through the CNN layer does improve the tagging performance. Retrieving a target entity requires tagging all sub-tokens of a target phrase as target. This is improved through the explicit neighbouring

(40)

context provided by the convolutional layer as reflected by the significant increase in the

recall of the model.

The other variant, Unified + global loss which performs sequence-level emotion classifi-cation, in addition to our task, also improves the unified model as seen by the increase in the F1 score. It can be observed that the improvement is mainly due to the increase in the precision of the model. The precision increases only when the target tokens in the sequence are tagged with the correct emotions. This observation confirms our hypothesis that learning emotions across the sequence globally helps the model tag the tokens more consistently.

From the results of the two variants discussed above, we derive a general conclusion that reinforcing global context helps improve precision of the sequence-tagger, whereas providing more token-level context improves the recall. The result of the experiment combining the advantages of the two variants is reported under the name unified + global

loss + CNN in the table 7.3. The results show an improvement in both precision and recall

of the model, in-line with our general conclusion of the behaviour of the two variants. Through the experiments conducted, we present the unified model, enhanced with the CNN module and a global emotion classification loss as a solution to the emotion-target extraction task in-hand.

More experiments related to this is discussed in appendix section A.2.

7.2 Multi-task Learning experiments

These experiments are performed to evaluate the MTL approach. The results show if the MTL approach helps improve performance, by overcoming lack of task-specific data. The methodology used is described in section 5.5. All the experiments are carried out with the best performing unified model configuration - Unified + global loss + CNN.

The sub-tasks - emotion-classification and target-extraction are considered as two indepen-dent auxiliary tasks in this experiment. The auxiliary task datasets are Twitter-Affect in

Tweets and the ABSA - Restaurant and Laptop Reviews described in section 6.1 respectively.

Primary task is trained on our Emotion-Target dataset.

The model is trained on the primary task emotion-target extraction or one of the auxiliary tasks randomly(ρ > 0.5). Auxiliary tasks are also chosen in random(ρ > 0.5). We switch between these tasks at every step. Number of steps per epoch was set to 100. The loss of each task is weighted to give more importance to the primary task. To this end, based on

(41)

grid-search, we set values of 0.6, 0.3, 0.1 for primary task, emotion-extraction and

target-extraction respectively. Learning rate was set to 2e-5 for all MTL experiments based on

grid search.

Model Precision(↑) Recall(↑) F1-score(↑)

Unified + global loss + CNN(STL) 0.455 0.339 0.388

Unified MTL Model 0.373 0.374 0.374

Tab. 7.4.: Results for best performing single task learning unified model vs unified model in MTL

setup

The results of the experiment are shown in table 7.4. From the F1-score we see that, this approach does not outperform single task learning. However the recall value is significantly higher than any other configuration. To understand this, we first look at how much each individual auxiliary task contributes to the primary task which is discussed below.

Auxiliary Task evaluation

Contribution ofemotion classification auxiliary task is evaluated by performing MTL with ignoring the target extraction auxiliary task. Loss weights for the primary and auxiliary task are set to 0.5 and 0.5 based on grid-search. Similarly, usefulness of thetarget extraction auxiliary task is evaluated by ignoring the emotion classification auxiliary task. Losses are weighted with values 0.7 and 0.3 for primary and auxiliary task respectively. The training procedure remains the same.

Auxiliary Task(s) used in MTL Precision Recall F1

None (STL Baseline) 0.455 0.339 0.388

Emotion Classification and Target Extraction (ABSA - Laptop and Restaurant Reviews &

Twitter - Affect in Tweets)

0.373 0.374 0.374 Emotion Classification

(Twitter - Affect in Tweets) 0.452 0.337 0.386 Target Extraction

(ABSA - Laptop and Restaurant Reviews) 0.461 0.289 0.355

Tab. 7.5.: Auxiliary Task Utility Evaluation

Comparing results of the MTL with emotion classification auxiliary task against STL, we see there is no improvement in the results. The auxiliary task does not seem to add any

(42)

value to the primary task. The datasets of both primary and emotion classification task originate from the same domain - Twitter. Hence we believe, additional data from the auxiliary task may not have any new information to contribute to the primary task.

Target extraction task on the other hand provides a boost in the precision, but the recall

is very low. This we believe is mainly due to the domain and the type of targets in the

Laptop/Restaurant dataset. We plot the distribution over the most common part-of-speech1

of the target tokens in both primary and auxiliary dataset in figure 7.1. Though, most tokens are NN(nouns) in both datasets, the primary dataset has more variety of POS like - DT(determiner), IN(preposition), PRP(personal pronoun). We also show the overlap in vocabulary between the two datasets in figure 7.2. The plot shows very small number of common terms. The language used in the laptop/restaurant dataset is very domain specific. These factors we believe cause lower retrieval, but higher precision.

DT IN JJ NN NNP NNS PRP VB VBG POS 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Probability source POS in Primary Task Dataset POS in Target Extraction Task Dataset

Fig. 7.1.: Showing the difference in

Part-of-Speech distribution between Primary Task and Target

extraction task. Vocabulary 0 2000 4000 6000 8000 10000

12000 Target Dataset Vocabulary_{Primary Dataset Vocabulary} Common vocabulary

Fig. 7.2.: Vocabulary size(Number of

tokens) of Primary Task and Target extraction task shows that

language used in the two datasets is very different.

However, we observed the significant increase in the recall value when both the auxiliary tasks are used together. We believe, the two different domains of the auxiliary tasks together has improved generalization capabilities of the model. But looking at the decrease

1_{https://www.nltk.org/book/ch05.html}

Emotion-Target Extraction for Criminal & Fraud Investigation

MS

A

I

M

T

Emotion-Target Extraction for Criminal & Fraud Investigation

Contents

Abstract

Acknowledgement

Introduction

1

1.1

Motivation

1.2

Problem Statement

1.3

Aim and Research Questions

Background

2

2.1

Human Emotions

2.2

Transfer Learning using BERT

2.2.1

Vanilla Transformers

2.2.2

BERT

2.3

Multi-task Learning

Related Work

3

Dataset Creation

4

4.1

Methodology

Base Dataset

Data Pre-processing

Annotation Process

4.2

Annotation Aggregation

Methodology

5

5.1

Input Data

5.2

Input Encoder

5.3

Sequence-Tagging Models

Joint-Learning Sequence-Tagging Model

Unified Sequence-Tagging Model

5.4

Sequence-Tagging Model Variants

CNN Sequence-Tagging Model

ST Model with Global Emotion Classification

CNN + Global Emotion Loss ST Model

5.5

Multi-Task Learning ST Model

Empirical Setup

6

6.1

Datasets

Primary Task

Auxiliary Task

6.2

Training

6.3

Evaluation

F1 score

Results and Analysis

7

7.1

Sequence-Tagging Experiments

Joint-learning Model vs Unified Model

Unified ST Model Variants

7.2

Multi-task Learning experiments