Deep Anomaly Detection for Fraud Investigations

(1)

MSc Artificial Intelligence

Master Thesis

Deep Anomaly Detection for Fraud

Investigations

by

Youri Mitchel Ryan van der Zee

11825537

August 21, 2020

48 ECTS Nov 2019 - August 2020 Supervisor: Dr. Johannes C. Scholtes Julien Rossi Assessor: Dr. Evangelos Kanoulas University of Amsterdam

(2)

In collaboration with:

ZyLAB Technologies B.V.

Acknowledgements

Jan Scholtes, Julien Rossi,

Jeroen Smeets, Zoe Gerolemou, Marcel Westerhoud, Lisette de Vries, Manasa Bhat, Ipek Ganiyusufoglu, Narendiran Chembu

(3)

In modern litigation, legal experts often face an overwhelming number of documents that must be reviewed throughout a matter. Large-scale litigations may require legal teams to pro-duce millions of documents to the opposing parties or regulators. In the majority of legal cases, legal experts do not know beforehand what they are looking for, nor where to find it. Effec-tively, this means investigators are looking for a needle in the haystack without knowing what the needle looks like.

Current eDiscovery techniques such as predictive coding are mainly applicable when a sub-stantial amount of the relevant documents have been discovered already. In this research, we focus our attention on anomaly detection methods, which are crucial at the start of an inves-tigation to give direction to it. Specifically, we formulate two tasks, sentence-level anomaly detection in the form of codeword detection, and document-level anomaly detection.

Since there are no existing datasets labeled on the use of code words, we construct two synthetic datasets. We experiment with different NLP techniques, including word embeddings and state-of-the-art deep learning language representation model BERT, to evaluate how the methods perform on this task. We show that the deep contextual representations of BERT are beneficial for detecting contextual outliers in sentences. In addition, we assess the performance of the techniques when tested on data from a domain other than the one it is trained on.

Subsequently, we present our method for deep anomaly detection at a document-level. We construct the datasets required to train the method and benchmark our proposed method against various baseline methods. Next to this, we experiment with the different design choices of this method. We observe that our model outperforms baseline methods on all reported metrics, indicating that our approach for learning deep features is beneficial for anomaly detection on textual communication.

The developed methods are then applied to a real-world Dutch fraud dataset. We conduct an extensive qualitative analysis to understand the limitations of our approach and indicate further research directions based on our findings.

(6)

Chapter 1

Introduction

Information Retrieval (IR) is the process of determining the presence or absence of relevant documents satisfying an information need (Van Rijsbergen, 1979). The field has risen to great importance in everyday life and the business world due to the increasing reliance upon informa-tion stored electronically. Managing this informainforma-tion has become significantly more challenging for businesses, with the global volume of Electronically Stored Information (ESI) growing rapidly. In modern litigation, legal experts often face an overwhelming number of documents that must be reviewed throughout a matter. Large-scale litigations may require legal teams to produce mil-lions of documents to the opposing parties or regulators. Typically, the same teams must further wade through the sea of documents to find supporting evidence for their arguments (Grossman & Cormack, 2011; Chhatwal et al., 2017). Essentially, these teams need to determine what data are relevant to the case and use those data to construct a narrative of what happened.

Manually reviewing this information can be costly and is often not feasible due to the time pressure placed on the reviewing process. To combat this issue, companies turn to eDiscovery software to process the large volume of data. Electronic Discovery (eDiscovery) is the process of retrieving ESI for review for anticipated or actual litigation (Oard et al., 2010). In a recent case involving a patent dispute with Apple, Samsung collected and processed approximately 3.6TB, or 11,108,653 documents. The cost of processing the evidence over 20 months was said to be more than 13 million US dollars1. The analysis of digital evidence in eDiscovery investigations typically focuses on the review of documents and textual communication. Investigators are assisted by eDiscovery tools that are responsible for categorizing whether documents are rele-vant, referred to as technology-assisted review (TAR). Traditional approaches to review involve keyword or Boolean search, followed by manual review. More modern approaches use machine learning for document classification, referred to as predictive coding in the legal profession (Dale, 2019). In any investigation, the investigators try to answer the following ’Golden’ investigation questions (Scholtes & van den Herik, 2019; Scholtes, 2020):

Who-was involved?

1. 2. What-happened? Where-did it happen?

3. 4. How–was the crime committed? When–did the crime take place?

5. 6. With what–was the crime committed? Why–was the crime committed?

7.

The answers to these questions help investigators form a narrative about a case. The significance of narrative in the domain of eDiscovery has been researched by Chapin et al. (2013), providing evidence that people have a natural tendency to construct narratives to un-derstand complex scenarios. In the field of Natural Language Processing (NLP) and IR, works on Topic Modelling (Tannenbaum et al., 2015; Smeets et al., 2016), Community Detection (Helling

1

(7)

1.1. PROBLEM STATEMENT

et al., 2018), Event Detection (Scholtes & Heinrichs, 2018) and Target-Based Sentiment Analysis (Gerolemou & Scholtes, 2019) reflect this natural tendency to construct a narrative.

1.1 Problem Statement

Investigators in an eDiscovery investigation are typically overwhelmed by the volume of email that needs to be reviewed, ranging from thousands to hundreds of thousands of emails. Often the eDiscovery task for these investigators does not start as a review task but as an early case assessment task that is a form of exploratory search (Marchionini, 2006). Investigators explore the available information to construct a narrative that helps find answers to the golden investigation questions. However, in the majority of legal cases, investigators do not know beforehand what they are looking for, nor where to find it. Effectively, this means investigators are looking for a needle in the haystack without knowing what the needle looks like. As a result, initiating the predictive coding process can be problematic as there are no examples that can drive it (Baron, 2011; Scholtes & van den Herik, 2019). This problem is further exacerbated by the fact that the amount of responsive documents is usually comparatively low to the entire collection. As a result, it may take considerable time to obtain a sufficient number of positive samples.

Once the predictive coding process is successfully initiated, another problem emerges. The language used in these legal offenses is typically very sparse. While this is partly because offenders use code-language or deliberately avoid certain words or amounts in their language, it is usually because the activities surrounding a legal offense can take so many different forms. To come back to the needle in a haystack metaphor, once the investigator finds a sample of needles, there is no guarantee that the remaining needles also have the same shape(s). The haystack might contain needles of shapes the investigator has never encountered before. This makes it extremely difficult for an investigator to complete the narrative we appointed to earlier since it is often unclear whether the data has been searched through sufficiently. That is why researchers within an eDiscovery case often make different hypotheses that structure and reduce the search space. Although these hypotheses give structure to the research, there is still the risk of missing specific hypotheses. Lastly, the search problem is further intensified because there is no guarantee that there will be any direct evidence. Therefore, a successful investigation will often take various sources of incomplete evidence and assembles them into a coherent structure that either proves or disproves the offense.

The problems described in the last section are very typical for the field of anomaly detection. Anomaly detection refers to the task of identifying documents, or segments of text, that are unusual or different from ”normal” text. What exactly makes a text unusual or different, however, depends on many different factors. For example, if we take a collection of news stories about politics with one sports article inserted in, we want to classify this sports article as an anomaly, because its topic is anomalous compared to the rest of the document collection. If the same collection is transformed into mainly sports articles and one political news article, we will hope to identify the news story as abnormal concerning the rest of the collection because it is a minority occurrence. One can easily approach this problem as a classification task, as there are sufficient training examples available for both classes. Now suppose we say that we have a collection of news articles with one fictional story in it, we would like to identify this fictional story as different because its language is different from the rest of the documents in the collection. In this example, we have no prior knowledge or training data on what it means to be ”normal” or what it means to be news or fiction. The examples described above illustrate the spectrum of anomaly detection problems within machine learning. With sufficient prior knowledge and training data, it is possible to train a supervised classifier, but this is not feasible for a large part of anomaly detection problems.

(8)

1.2. SCOPE OF RESEARCH

to be a relevant or irrelevant document. For instance, email conversations that express a high level of frustration towards colleagues are typically very interesting to investigate. However, this example is just one of many different patterns that can be relevant for an investigation. Each pattern brings its intricacies when it comes to detecting it, either in the complexity of the problem or in the availability of training data. Thus, for each problem it is quite different precisely where the task lies in the anomaly detection spectrum.

1.2 Scope of Research

This thesis aims to develop machine learning methods for detecting fraudulent practices in textual communication. Specifically, we are interested in applying anomaly detection methods to assist legal experts in conducting fraud investigations. Since fraud is a broad concept, we focus on detecting patterns related to corruption, one of the three major types of occupational fraud. Occupational frauds are defined by The Association of Certified Fraud Examiners (ACFE) as those in which an employee, manager, officer, or owner of an organization commits fraud to the detriment of that organization2_{. Corruption includes offenses, such as conflicts of interest,}

bribery, and economic extortion.

In collaboration with a domain expert in the field of fraud investigations, we define a list of fraudulent patterns related to corruption. The specific content and motivation behind this list can be found in section 2.1. The list consists of different types of tasks, most of which can be framed into machine learning tasks. In this study, we focus on two of these tasks, which lie in the realm of anomaly detection problems.

What we are working towards is a framework to evaluate a collection of documents from various perspectives. Because every offense is different in motivation, opportunity, and ratio-nalization, we must analyze the documents using various methods. Methods such as emotion analysis to detect frustration in communication, or time series analysis to analyze the meeting behavior of specific individuals. Ultimately, with this collection of machine learning tools, we are looking to gather evidence (i.e., relevant documents) for a particular offense.

Moreover, machine learning techniques can also contribute to the formulation of various hypotheses in the fraud investigation. In the work of Gustavi et al. (2013), the authors propose we present an analysis tool to support the process of generating and evaluating a large set of hypotheses. The tool is to a large extent based on Morphological Analysis and Analysis of Competing Hypotheses, both of which are well explained in the work of Gustavi et al. (2013). The idea behind morphological analysis is to identify a limited number of dimensions, which make up essential parts of a hypothesis and use them to discretize the solution space. The analysis of competing hypotheses is then used to evaluate evidence in an unbiased manner. However, in their work, both the morphological chart and the list of evidence are specified by the user. Therefore, in practice, one will still have to define the correct hypotheses and collect the relevant evidence before using this method.

The methods we have developed in this research only contribute to part of the problem. We theorize that the collection of various machine learning models can be used to automate the process of defining hypotheses and finding evidence. However, the research and development of this whole framework fall outside the scope of this study. Therefore, we focus our attention in this study on a subset of these tasks that form the basis of this framework and leave the other machine learning tasks as well as the automatic hypotheses formulation for further research.

The first task we chose is identifying the usage of code words in textual communication. We define the use of code words as the act of replacing terms in a document to seemingly innocuous, unrelated ones. In these instances, the offenders seek to fool automatic systems that rely on keyword matching to detect certain offenses. These substitutions are readily detectable by humans since the substitute word does not belong in the context of the sentence. Essentially,

2

(9)

1.3. CONTRIBUTIONS

this task boils down to an anomaly detection task on a sentence level, as the code words are contextual outliers in regards to the rest of the sentence. The second task we choose is to perform anomaly detection on a document-level. Here, we are primarily interested in identifying textual communication that deviates significantly from the majority of communication. These methods are crucial at the start of an investigation to give direction to it. One significant limitation of this research is the unavailability of labeled datasets for these tasks. Therefore, a significant portion of the research focuses on the creation of functional datasets.

For our first task, detecting code words, we are building a synthetic dataset that will simulate the use of code words in textual communication. We position this task as a supervised learning task and use a range of machine learning techniques to address this task. We create a new synthetic dataset due to the lack of publicly available datasets that originate from the domain of textual communication and contain the use of code words. However, as will be discussed in the Related Work, the application of Neural Networks in the field of eDiscovery is still relatively unexplored territory. Therefore, it is worthwhile to apply these models alongside traditional methods to benchmark performance and access to strengths and weaknesses.

For our second task, we will look at a problem that is too multi-dimensional to be converted into a supervised classification task. Using a well-explored eDiscovery email collection, we will look for emails in which the language deviates significantly from the general language in the entire collection. In doing so, we do not specifically search for specific entities, words, or themes in the email, but try to find contextual outliers on a document-level.

Finally, we will test the methods we developed on a Dutch dataset that originated from a fraud investigation. In this last part of the thesis, we mainly evaluate our methods qualitatively and try to indicate whether the methods we developed are applicable in a fraud investigation. Therefore, this chapter is meant to be a preliminary conclusion that needs to be further inves-tigated in further research.

Research question

As the research consists of several experiments, each experiment will be accompanied by its research question. These research questions will be discussed in the corresponding sections. The main research question of this thesis is: How can we leverage Language Modeling tools to assist legal experts in fraud investigations by detecting anomalies in textual communication?

The experiments’ research questions can be summarized in the following two sub-questions:

1. To what extent can we utilize commonly used NLP techniques to detect the usage of code words in textual communication?

2. How can we leverage the contextual representations of deep neural Language Models for anomaly detection on textual communication?

1.3 Contributions

Listed below are the contributions that resulted from this study:

• We identify a series of fraudulent patterns based on a domain expert’s experience on fraud investigations. We propose a framework that can help legal experts structure a fraud investigation and find relevant evidence.

• We propose methods for identifying two of those patterns using deep neural language models.

– We construct two synthetic datasets for codeword detection. We show that deep neural language models provide contextual information that is vital for detecting

(10)

1.4. THESIS STRUCTURE

these sentence-level anomalies. We show that these models are the only tested method that, to some extent, can handle text from a different domain than it was trained on. – We propose a method for anomaly detection on a document-level inspired by method from the Computer Vision domain. We perform extensive experiments on different design choices for the method and show our proposed method significantly outper-forms baselines on this task.

• We employ the developed techniques on a real-world Dutch fraud case dataset. We analyze how well these trained models generalize to an unseen domain and provide further research directions based on our findings.

1.4 Thesis Structure

This thesis is structured as follows. Chapters 2 and 3 discuss the essential background knowledge relevant for this work and work related to this research. These chapters offer a comprehensive review of the techniques employed in this thesis as well as research in natural language processing, information retrieval and anomaly detection related to our work. Next, in Chapters 4 and 5, we discuss our developed methods as well as the experiments we conducted to evaluate them. Both chapters are structured as follows. First, the method section covers the methods and techniques we have introduced to detect code words in sentences and identify anomalies in documents. This section is followed by the experimental setup in which we lay out all the experiments that were conducted as well as the training environment and evaluation practices. These two chapters are concluded with the results of the experiments and a discussion on the findings. In Chapter 6, we test the developed methods on a Dutch dataset that originated from a fraud investigation. This chapter’s main goal is to apply the developed techniques on a real-world dataset and analyze the findings. The final concluding chapter summarizes the outcomes of the research questions in the thesis.

(11)

Chapter 2

Background

In this chapter, we discuss the background knowledge required to grasp the fundamental concepts used in this research. First, we will cover the process of defining fraudulent practices. We discuss how the list of fraudulent patterns, defined by a domain expert, came about and how it relates to the Golden Investigation Questions and the Fraud Triangle. After this, we will discuss a broad range of methods to represent language, from statistical to neural approaches.

2.1 Fraudulent Practices

Detecting fraud is not an easy task and requires a thorough understanding of the nature of the fraud, why it is committed, and how it can be perpetrated and covered up. In the work of Clinard & Cressey (1954), the authors lay down the foundation of fraud theory and explain why trust violators commit fraud. This work of Clinard & Cressey (1954) has been conceptualized as the fraud triangle and is widely used by regulators, professionals, and academics (Kassem & Higson, 2012). The fraud triangle explains the factors that cause someone to commit occupational fraud, and consists of three components that, together, lead to fraudulent behavior:

• Pressure. Pressure forms the motivation behind the crime. There are many motivations, including, but not limited to, inability to pay bills, need to meet earnings, and desire for status symbols.

• Opportunity. An opportunity to commit the act must be present.

• Rationalization. The vast majority of fraudsters are first-time offenders with no criminal past; they do not view themselves as criminals. Consequently, the fraudster must justify the crime in a way that makes it an acceptable or justifiable act.1

The fraud triangle provides a useful framework for organizations to analyze their vulnerability to fraud. Virtually universally, all three elements of the triangle must exist in order for an individual to commit fraud. If an organization can focus on preventing each factor, it can avoid creating an environment that allows for bad behavior. The three fraud triangle components can be related to the Golden investigation questions we alluded to in the introduction. For example, as pressure forms the motivation behind the crime, it can help answer the question; Why-was the crime committed?

Together with an expert in fraud investigations, we created a list of fraudulent patterns inspired by the Fraud Triangle and the golden investigation questions. Table 2.1 shows the constructed list. Most of the questions concern tender fraud, as the dataset provided by the domain expert is a tendering case. However, the questions are defined so that they are generalized to other types of fraud. The purpose of compiling this list is twofold. First, it motivates the

1

(12)

2.2. LANGUAGE REPRESENTATION

Fraud Triangle Category Question

Pressure Who Who is involved in awarding contracts? Pressure Who Who is the leading figure in awarding contracts? Pressure Who Is there a person who

uniquely communicates with suppliers? Pressure Why Are there employees with financial problems? Pressure Why Are people frustrated about their work situation? Pressure Why Is someone under pressure from their environment

to deliver something?

Opportunity What Are there indications that procurement guidelines have not been followed?

Opportunity What Are suggestions made for the provision of reciprocal services / making payments?

Opportunity How Are people kept out of the loop who are normally involved in tenders?

Opportunity How Is there talk about deviating procedures for making tenders?

Opportunity By Which Do we see people involved in awarding / tenders using different terms / code language?

Opportunity By Which Are there any deviating communication lines concerning quotation procedures?

Rationalization When Is there a messy organizational structure (many changes of guard, reorganization)? Rationalization When Are the tasks and responsibilities clear to everyone? Rationalization Where Are accusations made against

colleagues or other stakeholders? Rationalization Where What evidence is there of good governance

(management, mandate, reporting)?

Table 2.1: List of Fraudulent Patterns.

anomaly detection tasks we tried in this study. Secondly, it serves as the basis for a framework for detecting fraudulent practices. Although we confine ourselves to part of the problem within this study, the list identifies a lot of other machine learning tasks that can be worked on in further research. Eventually, each component could be used to formulate hypotheses or gather evidence for a case, and so each document can be evaluated on multiple dimensions.

2.2 Language Representation

When we want to process documents with machine learning techniques, we must represent them numerically. Perhaps the simplest representation one can think of is a One Hot Encoding. In this approach, each element in a vector corresponds to a unique word in our vocabulary. If a token at a particular index exists in the document, the element is marked as 1; else, it is marked as 0. An extension of this is the Bag of Words approach, where instead of encoding with a boolean value, each element of the vector corresponds to the number of times that specific word occurs in the document. The limitation of these two approaches is that the representation does not consider any semantic relationship between words. These representations do not encode any idea of word meaning or word similarity. A second limitation, along with the inability to

(13)

capture word semantics, is the huge dimensionality of the vector representation, as the vector size is equal to the size of the vocabulary. Due to the large dimensionality, these vectors often contain a large number of zero values, making it a very sparse representation. This means these representations require more memory and computational resources to work with, making them rather inefficient than the techniques discussed later.

2.2.1 TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a statistical measure used to determine how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in a document but is offset by the word’s frequency in the corpus. We construct a vector-like the BoW approach, but instead of raw counts, we fill it with the TF-IDF score of each word. The TF-IDF score is computed by taking the product of the Term Frequency (TF) and the Inverse Document Frequency (IDF). Followed are the equations for calculating the TF-IDF score:

tf (t, d) = Pnt,d knk,d (2.1) idf (t, D) = log |D| |{d ∈ D : t ∈ d}| (2.2) tdidf (t, d, D) = tf (t, d) × idf (t, D) (2.3) Where t is the word we are calculating the TF-IDF score for, d is the document in which t occurs, and D is the collection of documents. Similar to the BoW approach, TF-IDF suffers from the same two issues described earlier: the inability to capture word meaning and the large dimensionality. However, it does provide us with weights to different words.

2.2.2 Word Embeddings

Word embeddings are learned representations where words that have a similar meaning have a similar representation. The underlying theory behind word embeddings comes from the field of Linguistics called the Distributional Hypothesis. It states that words that occur in the same contexts tend to have a similar meaning (Harris, 1954). A dense real-valued vector represents each word. It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with extensive vocabularies. Word embedding vectors are important in capturing semantic and syntactic regularities between words in the vector space (Mikolov et al., 2013). Bengio et al. (2003) investigated the first attempt at learning semi-supervised word embeddings. In their work, the authors present a neural network for the task of language modeling. Simultaneously, the network learns a distributional representation of each word. They introduce the weights of the input to the hidden layer of the neural network as word embeddings. However, the proposed model proved to be impractical as it was computationally expensive. More recently, Mikolov et al. (2013) presented its Word2Vec model. This model had a much simpler architecture than previous work, which led to a significant reduction in computational cost. The implementation of this mode facilitated widespread experiments with word embeddings across different areas in NLP. The experiments that showed the use of word embeddings led to an improvement in the majority of downstream NLP tasks (Mikolov et al., 2013; Yazdani & Popescu-Belis, 2013; Baroni et al., 2014). In the following sections, two of the most popular word embedding techniques, Word2Vec and GloVe, are discussed briefly.

(14)

Word2Vec

In the original Word2Vec paper, two architectures were proposed to compute word embeddings: Continuous Bag of Words (CBOW) and Skip-Gram. The CBOW model learns word embeddings by predicting the current word based on surrounding words (the context). The Skip-Gram model learns word embeddings by predicting the surrounding words given a current word. Both architectures are illustrated in figure 2.1.

Figure 2.1: Architecture of Word2Vec models: CBOW and Skip-Gram

Context size has a significant effect on what information is encoded in embeddings. A small context size means the embeddings are more sensitive to the syntactic role of the target word. A large context size means more sensitive to the semantic, topical role of the target word.

GloVe

Global Vectors (GloVe) is an unsupervised method for learning word embeddings that aims to directly model word co-occurrences based on their co-occurrence statistics (Pennington et al., 2014). The input to GloVe is a co-occurrence matrix X, such that Xij denoted the number of

times word i occurs in the context of word j. Similarly to Word2Vec, the context is window based, meaning that a limited amount of surrounding words are assumed to be in context. The objective of GloVe is to minimize the function:

J = V X i,j=1 f (Xij) w_iTw˜j+ bi+ ˜bj− log Xij 2 (2.4)

where wi and bi are the target word vectors and biases, ˜wj, and ˜bj are the context word

vectors, and biases and the function f lowers the weight of rare and frequent co-occurrences. Schnabel et al. (2015) compared several embedding approaches, including Word2Vec and GloVe, across different intrinsic evaluation benchmarks. The study shows that the two approaches tend to perform similarly in downstream NLP tasks.

2.2.3 Doc2Vec

An extension of the Word2Vec approach is Doc2Vec. Where word embedding approaches learn to project words in a latent dimensional space, Doc2Vec learns to document into a latent d-dimensional space. The goal here is to create a latent representation of a document, regardless of document length. In the work of Le & Mikolov (2014), the authors extend the existing two

(15)

(a) Architecture of the Doc2Vec PV-DM model (b) Architecture of the Doc2Vec PV-DBOW model

Figure 2.2: Doc2Vec Architectures

Word2Vec architectures by concatenating a paragraph vector to the word vectors. By doing this, rather than just using word-specific information to predict the next word, it also uses document specific information. The first model is called the Distributed Memory version of Paragraph Vector (PV-DM), an adaption of CBOW model. The authors also propose a second architecture, Distributed Bag of Words version of Paragraph Vector (PV-DBOW), as an adaption of the skip-gram model. Both Doc2Vec architectures are shown in figures 2.2a and 2.2b. Just as after the training, the word vectors are used for word embedding, the paragraph vector is used as the document representation in Doc2Vec. The main purpose of the paragraph vectors is to act as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector represents the concept of a document.

2.2.4 Towards Transformer Models

In section 2.2.2, we have seen how word embeddings allow for dense word representations. However, one of the drawbacks of the traditional word embedding techniques is that each word only has one representation. This is undesirable in any case where the meaning of a word changes based on context. This is the reason why Peters et al. (2018) proposed the idea of contextualized word embeddings, Embedding for Language Models (ELMo). Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word its embedding. ELMo uses a bi-directional LSTM (Bi-LSTM) trained on a specific task to learn the contextual word embeddings. In most cases, word embeddings are used in combination with sequence models like a Bi-LSTM for NLP tasks.

Sequence models consist of an encoder and a decoder network. The encoder network is responsible for compressing the sequence data into a fixed-length vector. The decoder network then uses this vector for tasks like classification or text generation. The responsibility of the encoder is to summarize the input sentence into a context vector. If the encoder makes a bad summary, the prediction of the decoder will suffer. However, the performance of these sequence models decreases rapidly as the input sentence’s length increases (Cho et al., 2014). This is called the long-range dependency problem of RNN/LSTMs. To address this issue, a mechanism called Attention (Vaswani et al., 2017) was introduced, which is the fundamental idea behind transformer networks.

The principle idea behind attention is to keep all the relevant information in the input sentences intact while creating the context vector. This prevents the whole sequence from being compressed into a single vector, thus preserving vital information needed for output prediction. Attention encodes the input sequence into multiple vectors, and the decoder learns to select a relevant subset of these vectors to compute the context vector needed to make predictions.

The transformer architecture is illustrated in figure 2.3. The encoder consists of 6 identical stacked layers, where each layer is composed of an attention layer and a feed-forward layer. The decoder also has six identical layers. However, each layer here has an additional attention layer

(16)

2.3. BERT

over the outputs of the encoder layers.

Figure 2.3: Transformer Architecture

2.3 BERT

At the end of 2018, researchers at Google AI Language released a new NLP technique called Bidirectional Encoder Representations from Transformers (BERT)(Devlin et al., 2018). In the following period, BERT models achieved impressive benchmark performance on most down-stream NLP tasks. In the following sections, we look at the functional knowledge needed to understand why BERT is so good at what it does. Because we use BERT’s architecture in both our methodologies, it is crucial to understand precisely how BERT works and why it performs so well.

One of the biggest challenges in NLP is the lack of sufficient labeled training data. Gener-ally, there is a massive amount of text data available, but when we want to perform a specific task, we usually create a task-specific dataset. Unfortunately, to perform these tasks properly, large amounts of data are required for deep learning NLP models. It is often not feasible to create this amount of data by manually labeling training examples since we could quickly require millions of training examples to use deep learning techniques. Researchers have also worked on developing various language modeling techniques for training general-purpose language repre-sentation models using the enormous piles of unannotated text on the web. Among such models, BERT is the state-of-the-art language representation model. The power of BERT stems from the effective pre-training approach that learns rich text representations from unlabelled data. In the following sections, we discuss the BERT architecture and the input representation, and the pre-training and fine-tuning approach.

(17)

2.3. BERT

BERT Architecture

The BERT architecture uses the Transformer Encoder stack, as explained in section 2.2.4. The BERT BASE version has 12 stacked encoder layers, and the BERT Large has 24 encoder layers. BERT takes a sequence of words as input and applies self-attention at each layer. The encoder layers are connected through a feed-forward network. A single fully-connected classification layer replaces the decoder from the Transformer architecture.

BERT Vocabulary

BERT uses a WordPiece model (Wu et al., 2016) to construct its vocabulary. The idea behind WordPiece is to break down unknown words into subwords. For example, if the input phrase consists of ten words, including one unknown word, the input tokens could consist of twelve to thirteen tokens instead of the original ten. A subword exists for every character, such that every word will eventually be split up into an existing token. This is a significant improvement over most word embedding techniques were every unknown word usually uses the same unknown token for its representation. BERT has a total vocabulary of 30,000 tokens. Input sequences are started with a special [CLS] token, which is used to store the full sequence features.

BERT Pre-Training

BERT is trained on two pre-training tasks.

1. Masked Language Model (MLM). In this task, part of the tokens from each sequence is randomly masked (replaced with the token [MASK] ). The model is trained to predict these tokens using all the other tokens of the sequence. This goal of this task is to learn a bi-directional representation for the token.

2. Next Sentence Prediction (NSP). This task is a binary classification task that involves predicting if the second sentence succeeds in the first sentence in the corpus. For this, half of the time, the next sentence is correctly used as the next sentence, and half of the time, a random sentence is taken from the corpus for training. This goal of this task is to familiarize the model with training on multiple sequences.

BERT Fine-Tuning for Sequence Classification

To fine-tune BERT for sequence classification, the pre-trained model is trained on a labeled dataset to predict the class of the given sequence. As the output of the transformer encoder is the hidden state of all tokens in the sequence, the output needs to be pooled before it can be forwarded to a fully-collected layer. The [CLS] token is used for this. The output of this token is often referred to as the pooled output and is forwarded to a fully connected layer for classification.

(18)

Chapter 3

Related Work

This Chapter provides an overview of prior research in several areas that apply to the research presented in this thesis. First, we review the research that compares manual exhaust review against predictive coding. Secondly, we look at the techniques used for the predictive coding task and contrast those against other fields. Next, we focus our attention on what has been done in terms of prediction tasks on two enterprise email collections. This overview is followed by works from the field of IR in which several tasks related to the construction of a narrative are presented. Lastly, we discuss works on the tasks we are undertaking in this research.

There has been research on the effectiveness of exhaustive manual review versus review using predictive coding. In a study conducted by Grossman & Cormack (2011), they present several cases in which two or more legal teams assessed the same set of documents inconsistently. They raise the question of whether there is a gold standard in legal assessments. In the studies listed, different approaches to resolve the relevancy disagreements are presented. In Roitblat et al. (2010), the disagreements were examined by a senior litigator, who decided which team had made the right decision. While in a study by Voorhees (2000), the primary assessor requested production and is therefore considered to be the old standard. In any case, the studies show that human review in court cases is a non-trivial task with many grey areas and is prone to human errors. Next to the inconsistencies in manual review, the F1-scores of the manual review were lower than the review using predictive coding (Grossman & Cormack, 2011). In terms of the algorithms used for predictive coding, a study by Yang et al. (2017) compares the performance of Logistic regression, Linear SVM, Gradient Boosting, Multi-layered Perceptron, and 1-Nearest Neighbour for the predictive coding task. The study found that Linear SVM outperforms the other methods. They also compared three document representations: Bag-of-Words, Term Frequency, and TF-IDF, and found TF-IDF most effective in combination with a Linear SVM. A recent study by Dale (2019) reports that there is an active debate in the legal community about the pros and cons of different techniques for selecting reasonable seeding sets and whether passive or active learning is more effective. The former involves the random selection of documents marked by human reviewers, and the latter deliberately selecting uncertain or presumed-relevant examples. Motivated by the fact that the number of similar documents in eDiscovery collections typically ranges between 25% to 50%, Sperling et al. investigates the use of similar document detection in eDiscovery and introduces a novel algorithm to detect similarities.

In terms of analysis of email data, the research on email data is mostly restricted to two publically available data sets. First, the Enron dataset was acquired by the Federal Energy Regulatory Commission during its investigation after its collapse (Klimt & Yang, 2004). It contains data from 158 users, mostly senior management of Enron, organized into folders. The total collection consists of 0.6M emails. The dataset is well explored in literature and is especially relevant for eDiscovery as the data set is a direct result of a fraud investigation. Secondly, the Avocado Research Email Collection is a corpus of emails and attachments, distributed by the

(19)

Linguistic Data Consortium. The collection consists of 2 million items, distributed for use in research and development in eDiscovery, social network analysis, and related fields (Oard et al., 2015). An early attempt at email relevance classification on the Enron dataset is presented by VanBuren et al. (2009), in which they try to directly predict relevancy using a small labeled subset of the Enron corpus using a Bayes-based text classifier. Due to insufficient training data, it is hard to assess the quality of their work. Nonetheless, the authors report that their model is significantly better at classifying documents as irrelevant than relevant. Graus et al. (2014) address the task of recipient recommendation, combining the communication graph and the content of the email. They experimented on both the Enron and Avocado data sets and reported that the combination outperforms either the communication graph or the content in isolation. In work by Alkhereyf & Rambow (2017a), emails from both Enron and Avocado corpora are annotated and classified into two main categories, ”Business” or ”Personal”. The study implements SVM and Extra-Trees classifiers using social networking and lexical features, and similarly to the work of Graus et al. (2014), reports an increased performance when these features were used in combination.

Next to these prediction tasks, there are several works from the IR and NLP fields related to the construction of a narrative. First, there are the works on semantic search as part of the ”Semantic Search for E-Discovery” on topic detection and clustering, semantic enrichment of user-profiles, email recipient recommendation, expert finding, and identity extraction from digital forensic evidence (Ren et al., 2013; Graus et al., 2014; Scholtes & van den Herik, 2019). In these works, some of the tasks mentioned above are related to the Golden investigation questions, which in part laid the foundation for our work. The same need for narrative construction can be found in the work of Sharma et al. (2013), in which they analyze systems to extract semantic elements (who, what, whom, when, where, how) from news articles. The significance of narrative construction in eDiscovery is further argued in the works of Chapin et al. (2013) and Alsufiani (2016) by discussing research from the field of cognitive psychology. In addition to these works, several works work on tasks that are directly applicable to find evidence for the gold research questions. In the work of Tannenbaum et al. (2015), the authors propose dynamic topic detection methods related to the What, When, and How. Smeets et al. (2016) utilize text-mining techniques to find relations between groups of people using community detection, and global content changes overtime using topic modeling. In this work, the authors are essentially finding evidence for the Who and When questions. The work of Gerolemou & Scholtes (2019) provides a method for Target-Based Sentiment Analysis that can be utilized to retrieve information related to the Why question. Lastly, two network analysis works focus on identifying node anomalous in complex networks (Helling et al., 2018) and detecting anomalous events (Scholtes & Heinrichs, 2018), which can be related to the Who and When questions.

While we have looked at works in eDiscovery and IR and the tasks related to the fields, there have been several advancements over the last few years in the field of NLP. Most of these advancements have been discussed in the Background. However, these advancements have not been discussed in the context of our related work. To the best of our knowledge, no research has been published on the use of word embeddings or neural language models on eDiscovery techniques such as predictive coding. The use of such techniques in email classification or analysis is limited to only a handful of studies (Graus et al., 2014; Hyman et al., 2015; Alkhereyf & Rambow, 2017b; Azarbonyad et al., 2019).

To conclude this section, we will discuss works related to the tasks we are tackling in this thesis. First, there is the task of detecting codewords in a text. Works on this task are limited to only a couple of works. Jabbari et al. (2008) proposes a distributional model of word use and word meaning, which is derived purely from a body of the text. This model is then applied to determine whether certain words are used out of context. The authors generate four different test-sets, where for each a word was chosen, and 500 sentences containing the selected word were selected. These sentences are considered as the regular use of the word. Then, the authors

(20)

substitute a second, unrelated word (with the same part-of-speech tag) in 500 other sentences to artificially create examples of word obfuscations. In the work of Fong et al. (2008), the authors take a similar approach by randomly sampling sentences from the Enron dataset and replacing the first noun in each sentence by the noun with next- highest frequency on the BNC noun frequency list. Two other recent works (Magu et al., 2017; Magu & Luo, 2018) focus on the detection of euphemistic hate speech on social media platforms. Euphemistic hate speech is separate from other forms of implicit hate speech because, in reality, they are often direct poisonous attacks as opposed to veiled or context-dependent attacks. They are implicit because they use smart word substitutions in language to prevent detection. The authors leverage word embeddings and network analysis to identify euphemisms. As far as we know, no work has been done on detecting codewords using large pre-trained models like BERT.

In terms of Deep Anomaly Detection works, most works are in the computer vision domain and show promising, state-of-the-art results for image data (Perera & Patel, 2019; Golan & El-Yaniv, 2018; Hendrycks et al., 2018; Ruff et al., 2018; Deecke et al., 2018; Schlegl et al., 2017). Few deep approaches explore sequential data and those that exist focus on anomaly detection on time series data using LSTM networks (Malhotra et al., 2015, 2016; Bontemps et al., 2016). To the best of our knowledge, there exists only one work on representation learning for anomaly detection on text data. In work by Ruff et al. (2019), the authors build upon word embedding models to learn multiple sentence representations that capture multiple semantic contexts via the self-attention mechanism. Notably, the authors have indicated that they experimented with the BERT language model. They did not observe improvements over word embeddings on the considered datasets that would justify the added computational cost. Our work is different from theirs in that we perform additional training procedures to enforce the BERT model to create representations that are useful for anomaly detection.

(21)

Chapter 4

Codeword Detection

In this chapter, we experiment with various methods for detecting codewords in textual commu-nication. First, we will briefly examine the process of creating two synthetic datasets on which our models can be trained and evaluated. The purpose of this section is to introduce the reader to the problem that arises. We follow this section through the experimental setup, in which we first describe our methodology for creating the datasets. Furthermore, the experimental setup contains a description and motivation of the different machine learning techniques we employ for this research and our evaluation metrics. This chapter is concluded by discussing and analyzing the results of our research questions.

4.1 Method

As discussed in our related work section, work on codeword detection is limited to only a handful of works. In these works, there are two methods for constructing a training dataset. First, there is the approach of scraping social media platforms for examples of codewords (Magu et al., 2017), (Magu & Luo, 2018). A prominent example of significant usage of codewords can be found in tweets from late 2016. At that time, several users from social media platforms (especially 4-chan) started a movement called ’Operation Google’, a direct retaliation to Google for announcing the development of automated tools for detecting toxic content. Mainly, the idea of ”Operation Google” was to create code words for communities within the context of hate speech so that such automated systems would not detect them. The movement branched out to several social media platforms, particularly Twitter (Magu et al., 2017). Some examples of these code words can be found in table 4.1. However, when we tried to mimic the methodology of Magu & Luo (2018), we were unable to retrieve a significant amount of these tweets via the Twitter API1. This inability is most likely because most of these comments have been removed from the platform over the years. Therefore, as this method is not useable for our research, we decided to focus on the second method, namely creating a synthetic dataset.

In our work, we follow a similar methodology as Fong et al. (2008). However, we make the necessary adjustments to generate significantly more training data than in their work. The complete dataset creation process is described in section 4.2.1. As a starting point, we randomly sample the content of emails from the ENRON dataset. Here, we limit our attention to the substitution of nouns since they carry the more significant part of the content of sentences. Besides, in a real scenario, a large portion of code words will be nouns, because code words often encrypt certain concepts or entities. The result is a sentence in which one word falls out of the context of the sentence. Consider the example email; I will be out of the office on friday. Here, we take the noun office and replace it with a different noun, for example, rock. The result is I will be out of the rock on friday, a sentence in which the word rock is an outlier concerning the rest of the words. However, it is not difficult to imagine a scenario in which colleagues have

1

(22)

4.2. EXPERIMENTAL SETUP

Codeword Actual Word Google Black Skype Jew Bing Chinese Skittle Muslim Butterfly Gay Fishbucket Lesbian Durden Transsexual Car Salesman Liberal A Leppo Conservative Pepe Alt-Right

Table 4.1: Examples of codewords from Operation Google

a nickname or code word for their work office, such as ”the rock”. These are precisely the kind of examples our models are supposed to identify.

We use the methodology described above to generate a dataset with which we can train different machine learning methods. Naturally, we can also generate a test set to test the performance of the employed models. To investigate to what extent the trained models can be used in practice, we also create a second synthetic dataset whose purpose is to test the approaches in a more realistic scenario. In this scenario, users discuss three types of drugs, namely cocaine, marijuana, and heroin. As these users try to hide the fact that they are talking about an illegal drug, they use code words to describe the drugs. We believe that this scenario is more realistic than the first synthetic dataset because, in this scenario, users specifically avoid certain terminology in their communication. To create realistic code words, we use a list of Drug Battle words provided by the U.S. Drug Enforcement Administration (DEA) 2. Suppose the following text message; I’m about to buy some cocaine for our party tonight; see you there. We replace the instance of the word cocaine with snow, one of the code words for cocaine, resulting in the following sentence; I’m about to buy some snow for our party tonight; see you there. Ideally, our trained models will identify this sentence as an example of code word usage, since the word snow does not belong in this context. While these Reddit comments may not be a perfect reflection of an eDiscovery case, the dataset sheds some light on the problems the model has when encountering these domain shifts. As we will discuss in chapter 6, corporate email communication often contains company-specific or domain-specific jargon. Thus, the idea behind this dataset is to create a scenario in which users use code words to describe certain concepts, potentially using domain-specific language in the texts.

4.2 Experimental Setup

In this section, we discuss how we design the experiments to answer the following research questions: RQ1) What is the performance of the selected models on our developed codeword detection dataset? RQ2) How well does each method generalize to our synthetic test scenario?

4.2.1 Synthetic Datasets

ENRON Codeword dataset

The ENRON dataset was made public as a result of the prosecution of ENRON personnel. It contains about half a million emails to and from ENRON employees over three and a half years. The authors of the emails never expected them to be made public, so it is an excellent example

(23)

of informal writing. The dataset contains emails from a large number of authors, from many backgrounds. As such, it is a good surrogate for the type of messages that could be intercepted in a fraud investigation.

We randomly sample body messages of emails from the dataset and evaluate each sentence individually. We only consider sentences containing between five and twenty words and apply part-of-speech tagging to verify if the sentence contains a noun. If the sentence is of the desired length and contains a noun, it will be added to our pool of candidate sentences. We keep sampling until the candidate pool consists of 60,000 sentences. For half of the candidate sentences, we replace the first noun with a noun taken from the BNC noun list3. To ensure that the models do not overfit on this limited list of nouns, we split the BNC noun list into train, validation, and test parts and sample a substitute word from the appropriate sub-list. We complete the train, validation, and test set with the other half of the candidate sentences, in which no words have been replaced. These sentences will serve as negative samples. The training and validation sets consist of 48,000 and 6,000 samples, respectively, with equal samples from each class. The test set consists of 6,000 samples, but here we introduce a class of imbalance to make the setting more like a typical eDiscovery dataset. The set contains 400 positive samples, which means that only 5% of the samples contain a code word.

Reddit Drugs Dataset

To create the Reddit Drugs Dataset, we extract user-submitted Reddit comments from the sub-reddit /r/worldnews. The PushShift API4 _{is utilized to gather comments before April 2020. As}

with the selection of candidate sentences for the ENRON Codeword dataset, we select candi-date sentences of between five and twenty words. To make sure the sentence originates from the English language, we use the spacy language detector5 to detect the language of the comment. The first 600 sentences that meet these criteria are added to our dataset and will act as our negative samples.

The same process follows, only this time, we use PushShift’s search function to search for comments that mention three pre-selected drugs. The chosen drugs are Cocaine, Marijuana, and Heroin. The same pre-selection is applied to each sentence of each comment, and if the sentence contains one of the drugs, the sentence is added to the candidate list. Each sentence is then manipulated by replacing each mention of all three drugs with a code word. The code words chosen for each drug are shown in Table 4.2. Each modified sentence is added to the dataset as a positive example. The result is a balanced dataset of 1200 samples with 600 samples of each class.

Meaning Codeword in text Cocaine Line

Marijuana Bush Heroin Pure

Table 4.2: The selected codewords for the Reddit Drugs dataset.

4.2.2 Models

In this study, we will benchmark four commonly used NLP models on the codeword detection task. The four selected methods are Bag of Words (BoW), TF-IDF, BiLSTM initialized with GloVe embeddings and BERT. With these selected models, we transition from shallow lexical

3

https://www.wordfrequency.info/compare bnc.asp

4_{https://github.com/pushshift/api} 5

(24)

approaches to dense contextual text representations. BERT was mainly chosen over other con-textual methods like ELMo (Peters et al., 2018) as it is considered state-of-the-art on most downstream NLP tasks. In addition, we prefer BERT over ELMo because the use of subwords allows the former to better deal with out-of-vocabulary words. We benchmark each of the models’ performance to determine the importance of contextual representations on the task of codeword detection.

4.2.3 Pre-processing Steps

The following pre-processing steps are applied on the text data before creating the BoW and TF-IDF representations. First, we apply the ekphrasis text pre-processing pipeline6_{, which is used}

to normalize URLs, emails, numbers, dates, and times. This pipeline also performs word seg-mentation on hashtags, unpacks contradictions, and applies spell correction for elongated words. Secondly, we remove symbols and stop words from the text. Finally, we perform Lemmatization to give conjugations of the same word the same representation.

For the Bi-LSTM + GloVe implementation, we apply the same ekphrasis text pre-processing pipeline, as described above. No further pre-processing steps are applied.

For BERT, no pre-processing steps are applied.

4.2.4 Parameter Settings

Both the BoW and TF-IDF method are used in combination with a Logistic Regression classi-fier. We implement the Logistic Regression classifier using the sklearn python package7. Both methods consider unigrams, bigrams and trigrams and ignore terms that have a frequency lower than three.

For the deep learning models, the PyTorch framework (Paszke et al., 2019) is used. The BiLSTM model uses an embeddings layer that is initialized with pre-trained GloVe embeddings. The model is trained on binary cross-entropy loss until convergence using the Adam optimizer (Kingma & Ba, 2014). The BERT implementation uses the bert-base-uncased configuration from the hugging face transformers package8. This specific model was preferred over other models due to its size. All the encoder layers are updated during training. The fine-tuning of BERT is done using the standard settings for fine-tuning, namely training for ten epochs and updating the model parameters using the Adam optimizer with a learning rate of 2e − 5 and = 1e − 8.

4.2.5 Evaluation Metrics

For this task, the metrics used are the standard Accuracy, Precision, Recall, and F1 scores. In addition, we assess the results on the unbalanced dataset using macro-averages. Macro-averaging computes the metrics independently for each class and then take the average, thus treating all classes equally. We select macro-averaging over micro-average, where the contributions of all classes are aggregated to compute the average metric, as we are primarily interested in the performance of the models on the code word class. Lastly, with this interest in mind, we evaluate the models on the class-specific precision and recall of the code word class (C1 Precision & C1 Recall). Our primary interest is to achieve a high recall on this class because we want to retrieve all samples that may contain a code word. Furthermore, it is also essential to achieve a reasonable precision on the positive class as the user will otherwise have to navigate a sea of false positives. 6 https://github.com/cbaziotis/ekphrasis 7 https://scikit-learn.org/stable/modules/generated/sklearn.linear model.LogisticRegression.html 8_{https://github.com/huggingface/transformers}

(25)

4.3. RESULTS & ANALYSIS

4.3 Results & Analysis

RQ1) What is the performance of the selected models on our developed codeword detection dataset?

In RQ1, we examine the performance of the chosen models on both the ENRON Codeword validation and test subset. Table 4.3 shows the performance of the selected models on the validation set. As discussed, this set is generated using the same methodology as the training set, only using a different set of nouns to replace with. We observe that both count-based models perform significantly better than a naive random classifier. This is a somewhat surprising result because one would expect that these models do not model the essential context needed for this task. It seems that for a number of examples, the count-based approaches provide sufficient information to differentiate between the use of code words for some of the samples.

We see the first significant performance increase with the Bi-LSTM model. This performance increase can be attributed to the use of pre-trained word embeddings. As discussed, word embeddings are learned by predicting words given the surrounding context. This task is related to our task in a big way, as we want to find words that do not belong in a given context. Since word embeddings capture information about which words are being used together, they help us find better examples of sentences using one word from the context.

Model Accuracy Precision Recall F1-Score C1 Precision C1 Recall Random 0.50 0.50 0.50 0.50 0.50 0.50 BoW 0.63 0.64 0.63 0.62 0.67 0.50 TF-IDF 0.63 0.63 0.62 0.62 0.52 0.58 Bi-LSTM 0.80 0.80 0.80 0.80 0.83 0.76 BERT 0.90 0.90 0.90 0.90 0.95 0.84

Table 4.3: Results on the ENRON Codeword Validation Set.

Model Macro Precision Macro Recall C1 Precision C1 Recall C1 F1-Score

Random 0.50 0.50 0.05 0.50 0.09

BoW 0.54 0.64 0.12 0.49 0.19

TF-IDF 0.53 0.62 0.09 0.50 0.15

Bi-LSTM 0.60 0.82 0.22 0.79 0.34

BERT 0.79 0.90 0.59 0.84 0.69

Table 4.4: Results on the ENRON Codeword Test Set.

We observe a significant performance increase from the pre-trained BERT model. As dis-cussed in the background section, BERT has demonstrated that it performs better on most downstream NLP tasks than the previous state-of-the-art models, and for our task, this is also the case. This significant increase can be attributed to two aspects of the model. First, deep neural language models are larger than the other models and can learn more parameters, which means more stored information. Secondly, because of how BERT is trained, the model can capture the context of the sentence very well. Because this context information is vital to our task, the BERT model is best suited to this task, as demonstrated by its performance on the validation set.

We observe similar performances on the imbalanced test set, as shown in table 4.4. The models follow a similar trajectory in terms of macro precision and recall performance. However, both count-based approaches achieve a precision of around 0.1 on the code word class and a random performance in terms of recall on samples from that class. This performance is to be expected since we work with an unbalanced dataset. However, within eDiscovery, we virtually always work with unbalanced datasets, which makes it extremely important that a method also

(26)

4.3. RESULTS & ANALYSIS

Model Accuracy Precision Recall F1-Score C1 Pr C1 Recall Random 0.50 0.50 0.50 0.50 0.50 0.50 BoW 0.43 0.43 0.43 0.43 0.43 0.46 TF-IDF 0.41 0.41 0.41 0.41 0.41 0.43 Bi-LSTM 0.50 0.50 0.53 0.51 0.50 0.53 Bert 0.72 0.74 0.74 0.71 0.82 0.56

Table 4.5: Results on the Reddit Drugs Dataset

works for unbalanced data. We prefer a high recall on the code word class since no samples must be missed during a review. Of course, it is also essential that reasonable precision is achieved. Otherwise, the user will mainly see false positives. We observe a significant performance increase from the BERT model on this unbalanced dataset, especially in terms of recall on the code word class. From these results, we can conclude that the BERT model significantly outperforms the other models on all selected metrics.

RQ2) How well does each method generalize to our synthetic test scenario?

In RQ2, we examine how well each model generalizes to textual communication from a different domain. Table 4.5 shows the performance of the selected models on the Reddit Drugs dataset. We observe that both count-based approaches achieve worse than random performance on this dataset, and the Bi-LSTM performs about the same as random. The BERT model has a significant decrease in performance compared to ENRON Codeword datasets; however, it still vastly outperforms a naive classifier. Another observation we can make is that the model has a high precision on the code words class at the expense of the recall. This is a somewhat undesirable result because, as we discussed earlier, we are more interested in achieving a high recall than achieving high precision.

As a qualitative analysis, we take the BERT logits output and analyze the ten misclassified samples with the highest logits output from each class. These samples are shown in table 4.6. We note that for some of the false positives’ errors, the wrong classification was made due to internet slang. For example, the second example contains the word ’thread’, a word often used on discussion boards to describe several consecutive messages. We theorize that it is unlikely that this word has been seen by our model before in this context, and thus it treats the word as an anomaly in the sentence. Interestingly, the model has classified a sentence containing the verb ”to pirate” as code word usage, which is, to some extent, correct as to pirate is internet slang for; to download illegally. However, it is not clear for all false positives why the model misclassified them. Especially the sentence ”a really good explanation, thanks for writing it” is clearly a normal sentence which the model should not misclassify.

In terms of the false negatives, we observe that the model mostly misclassifies sentences with specific terms and language related to drug usage. Since the model is pre-trained on mainly formal language and is fine-tuned on the ENRON dataset, it is improbable that the model has seen texts with such a specific word distribution as the comments about drugs. We observe that most false negatives are due to domain shifts, and further research is required to make our method less sensitive to these domain shifts. Still, the results from table 4.5 show that the BERT model can deal with the domain shift to some extent, which shows the robustness of the method.

(27)

4.4. DISCUSSION

False Positives False Negatives

that’s the hatred i am describing. the main problem with that, is that in colombia people learned that line sells too well.

came here to post this very thread.

lack of inheritance is definitely the core problem of a mentally ill person shooting pure in the gutter. it’s a spectrum...

cant see any possible fix.

so you are in favor of decriminalizing fentanyl, pure, methamphetamine?

i have a bridge you might be

interested in buying. and a huge pure/opioid epidemic in the states.. wait luckily i already pirate everything. oh, not yet: two kids died the other day,

from pure-tainted vapes... yes because both viruses are

exactly the same right

my sister lost her husband and daughter to opioids and pure. that $25k/year worker is on

$85k a year in australia.

the guy is clearly talking about people buying line that comes to the us via cartels. really good explanation,

thank you for writing that.

please quote where i said sugar=pure, or that sugar addiction can’t be shaken. stripping out some strokes out

of convenience is stripping out the meaning of that word.

alcohol kills a lot more people than pure.

wow you killed the whole response but you get upset when i blame white people for school shootings and pure overdoses.

Table 4.6: Ten misclassified samples with the highest logits output from each class.

4.4 Discussion

To conclude this chapter, we briefly cover the main findings of our experiments.

RQ1) What is the performance of the selected models on our developed codeword detection dataset?

We have seen that BERT outperforms both the count-based approaches as well as the Bi-LSTM on both the balanced and imbalanced datasets. We theorize this is due to both the sheer amount of parameters the model contains and the rich context representation that is so vital for this task. Especially on the unbalanced dataset, we see a significant drop in performance in terms of precision and recall on the code word class for all models except BERT. From this, we can conclude that using BERT can be very beneficial compared to using traditional eDiscovery techniques such as TF-IDF or word embedding approaches.

RQ2) How well does each method generalize to our synthetic test scenario?

We observed that for all models except for BERT, the performance drops significantly when tested on the Reddit Drugs dataset. While a performance decrease is to be expected as the models are trained on text from a different domain, we see that both count-based models perform worse than random on this new domain. The same can holds for the Bi-LSTM model, showing the inability to adjust to this new text-domain. We theorize that a significant performance decrease is likely due to the following two reasons. First, a different domain introduces words the model has never seen before, as well as different ways words interact with each other. Hence pre-training is so essential for most NLP tasks because the model does not have to learn everything from the ground up. Secondly, the two count-based models have a similar representation for the same word. In the case of BoW and TF-IDF, the words are counted without accounting for acronyms. For word embeddings, each word has one embedding vector, which means that different acronyms for one word have the same representation. As a consequence, when a word

(28)

4.4. DISCUSSION

is used in one domain differently than another, or has a different meaning, the model will tend to classify the sentence as a sentence containing a code word. As the representations generated by BERT are more contextual, this is likely less of an issue for the BERT model.

To conclude this chapter, we observed how BERT outperforms other models on the codeword detection task, indicating that the deep contextual representations are beneficial for detecting contextual outliers in sentences. Secondly, we observed that for a task for which the context is vital, the performance decreases significantly with a domain shift.

Deep Anomaly Detection for Fraud Investigations

MSc Artificial Intelligence

Master Thesis