Using a Multi-gate Mixture-of-Experts model for Multitask Text Classification

(1)

Master Thesis

Using a Multi-gate Mixture-of-Experts model

for Multitask Text Classification

by

Ruben van Heusden

11022000

August 21, 2020

48 EC November 2019 - August 2020

Supervisor:

Yangjun Zhang

Assessor:

Dr M. Marx

Faculty of Science

(2)

I would hereby like to offer my special thanks to my supervisor Yangjun Zhang, who was instrumental in helping me to determine the direction of the research and who was always willing to help me when I ran into problems, or wanted to brainstorm about ideas. I would also like to thank the Gemeente van Amsterdam and in particular my supervisor Bas de Jong at the Informatie Voorziening department for his help with the part of the research conducted at the Gemeente van Amsterdam, where he was always available for discussions and a great help in aligning the research direction of the thesis with the needs of the Gemeente.

(3)

With the increased use of digital assistants and customer service chatbots, it has become increasingly important to accurately detect and classify certain aspects of the types of text encountered in those applications. Such aspects include the intent, emotion and various other aspects of the text. Many previous approaches to solve this problem either focused on learning each task separately or on learning multiple tasks at the same time, respectively called single task and multitask learning. Although the multitask learning approaches are successful for several tasks, knowing when and how to combine tasks is still a difficult problem. Sometimes tasks are combined that turn out to be incompatible which can harm the overall performance of the model. The approach presented in this thesis uses a specific multitask learning method called the Multi-gate Mixture-of-Experts model. This model extends a classical multitask learning framework of shared and task specific layers by using gating networks to control the sharing of information between the tasks. Unlike previous approaches, this model allows for explicit sharing between tasks and offers a way of separating the tasks when needed. The model is tested on a subset of the Enron Email Dataset and the DailyDialog dataset, each of them containing classification tasks for different aspects of the pieces of text. The results of the model on these datasets indicate that the Multi-gate Mixture-of-Experts model with an appropriate expert balancing strategy is able to offer increased performance over its non-gated counterpart, indicating that the addition of the gating network offers benefits to performance on the particular type of datasets used in this research. The addition of contextual embeddings extracted from BERT offers an even greater increase in performance on all tasks tested in this research in comparison to the use of baseline Glove embeddings.

(4)

1 Introduction 6

1.1 Text Classification . . . 6

1.2 Multitask Learning . . . 6

1.3 The Multi-gate Mixture-of-Experts model . . . 7

1.4 Research Questions . . . 7

1.5 Scientific Contribution . . . 8

2 Background 10 2.1 Neural Networks for Natural Language Processing . . . 10

2.1.1 Word Embeddings . . . 11

2.1.2 Recurrent Neural Networks . . . 11

2.1.3 Convolutional Neural Networks for Natural Language Processing 13 2.1.4 The Attention Mechanism . . . 14

2.1.5 Transformers . . . 15

2.1.6 Bidirectional Encoder Representations from Transfomers (BERT) 15 2.1.6.1 Extracting word embeddings from BERT . . . 16

2.2 Multitask Learning . . . 17

2.2.1 Inter-Task vs. Intra-Task Multitask Learning . . . 17

2.2.2 Parameter Sharing . . . 17

2.2.3 Mixture-of-experts . . . 19

3 Related Work 21 3.1 Multitask Learning . . . 21

3.1.1 Cross-Stitch Networks . . . 21

3.2 Multitask Learning for Text Classification . . . 21

3.3 Mixture-of-Experts . . . 22

3.3.1 Gating network Polarization . . . 24

4 Method 26 4.1 Model Architecture . . . 27

4.2 Datasets . . . 28

4.2.1 DailyDialog Dataset . . . 28

4.2.2 UC Berkeley Enron Dataset . . . 29

4.2.2.1 Dataset Examples . . . 30

4.2.3 Statistics of Datasets . . . 30

(5)

4.4.2 LSTM and Bidrectional LSTM . . . 32 4.4.3 CNN . . . 32 4.4.4 BERT . . . 32 4.4.5 Transformer . . . 32 4.4.6 Multitask CNN . . . 32 4.4.7 Multitask LSTM . . . 32 4.5 Training Details . . . 33 4.5.1 Class weighting . . . 33 4.5.2 Stratified sampling . . . 33 4.5.3 Model evaluation . . . 33

4.5.4 Fixed size of training samples . . . 33

4.5.5 Word Embeddings . . . 34

4.6 Experimental Setup . . . 34

4.6.1 Model Parameters . . . 34

4.6.2 Comparing MMoE performance to baselines . . . 34

4.6.3 Investigating the training behaviour of the gating networks . . . 34

4.6.4 Using word embeddings extracted from the BERT model . . . . 35

5 Results 36 5.1 RQ 1: Vanilla Multi-gate Mixture of Experts Vs. Baselines . . . 37

5.2 RQ 2: Results of Applying various gating network training schemes . . . 42

5.3 RQ 3: Multi-gate Mixture-of-Experts Model with BERT embeddings . . 49

6 Conclusion 54 6.1 Vanilla Multi-gate Mixture-of-Experts Model . . . 54

6.2 Different expert balancing strategies . . . 54

6.3 BERT embeddings . . . 54

7 Discussion & Future Work 56 7.1 Discussion . . . 56

(6)

BERT Bidirectional Encoder Representations from Transformers. 7, 15, 16, 21, 23, 31, 32, 35, 37, 39, 43, 49, 51–57

CNN Convolutional Neural Network. 13, 31, 34, 35, 37–40, 43, 46, 49, 53, 54

GRU Gated Recurrent Unit. 7, 11, 23

LSTM Long Short Term Memory. 7, 11–15, 22, 31–34, 37, 38, 40, 53

MMoE Multi-gate Mixture-of-Experts. 7, 21, 23, 26–28, 31, 34, 35, 37, 39, 41, 43, 46, 49, 51, 54–57

MT-DNN Multi-Task Deep Neural Network. 7, 21

NLP Natural Language Processing. 10, 11, 23, 28

RNN Recurrent Neural Network. 12–14

(7)

INTRODUCTION

In this research a multitask learning method called the Multi-gate Mixture-of-Experts model[1] is evaluated for several text classification tasks involving conversational text, such as written dialogue and email messages. The performance of this model is compared to that of several existing single task and multitask models.

1.1 Text Classification

Text classification has long been an integral part of research efforts in the field of Natural Language Processing, where various problems such as spam detection in email and sen-timent classification in Tweets have been tackled. Although historically, methods such as TF-IDF and Bag-of-Words approaches have proven to be successful, increased success in the field of deep learning has sparked a wide variety of methods and techniques from that field to be developed and deployed in the field of text classification.

With the rise in digital communications and the use of personal assistants such as Alexa, Google Assisant and Siri, the ability to automatically extract key aspects such as intent or emotion from (relatively) short pieces of text has become increasingly important. This is not only prevalent in the setting of digital assistants, but also in for example the case of email classification in large corporations. These companies daily receive large amounts of emails from customers and the ability to accurately assess the needs of the customers from those emails could be a major improvement in their customer services. Because of the nature of these pieces of text this can be a challenging problem: the emails that come in at the customer service vary widely in length, tone and style, and words are quite frequently misspelled which adds to the challenge of classification. For both emails and written dialogue, the pieces of texts are often of limited length, meaning there is limited context that can be used in classification.

1.2 Multitask Learning

At the same time, there are multiple signals in text that can assist in the prediction of different aspects of the text, such as the emotion contained in a message or the general topic. For this reason, it can be very beneficial to try predicting multiple aspects of text simultaneously, with the possibility of increasing the performance on the individual

(8)

tasks by using information acquired while learning related tasks.

However, this approach faces its own set of problems, as not all tasks work equally well when combined for multitask learning and this can in practice actually harm the overall performance of the model. Another problem is that generally there is no straightforward way of determining a priori which sets of tasks will work well when learned simultane-ously, and which sets of tasks do not [1].

Previous approaches that have attempted to tackle the problem of multitask learning for text classification have made use of methods such as LSTM, GRU and BERT [2, 3, 4] where several task were learned simultaneously through some form of parameter sharing, such as the Multi-Task Deep Neural Network (MT-DNN) [5]. Although these approaches are successful in outperforming their single task counterparts for several datasets, they are limited in the sense that the models are not able to explicitly control how the in-formation from the tasks is combined when making a prediction. Therefore, in the case of learning uncorrelated or even conflicting tasks, the performance of the model can be hampered as the model has a limited ability to separate the tasks being learned.

1.3 The Multi-gate Mixture-of-Experts model

The Multi-gate Mixture-of-experts model[1](MMoE) alleviates this problem by extend-ing on the Mixture-of-Experts architecture[6], introducextend-ing separate gatextend-ing functions for each task that learn how the contribution of each expert should be weighted for that specific task. In theory, this allows the model to dynamically combine the information learned from other tasks on a sample by sample basis, with the ability to assign differ-ent experts for differdiffer-ent tasks depending on the compatibility between the tasks being learned.

1.4 Research Questions

In order to investigate the performance of a Multi-gate Mixture-of-Experts model and to further investigate the behaviour of its components for use in text classification, several research questions are formulated. The main research questions of this research and a short description of each of them is given below .

Research Question 1: Can the usage of a Multi-gate Mixture-of-Experts model lead to improved performance on conversational text classification tasks compared to other single task /multitask learning approaches?

The first step in answering this research question is evaluating the ‘vanilla’ model on the selected datasets and observing the strengths and weaknesses of the model and compar-ing it to the baseline methods. Based on the results of these initial experiments, several adjustments to the model can be made to further improve its performance.

Research Question 2: How does the training scheme used during the training of a Multi-gate Mixture-of-Experts model effect the performance of the model?

It is known in the literature that the Mixture-of-Experts model (and by extension the Multi-gate Mixture-of-Experts model) can suffer from problems regarding the gating

(9)

functions during training. Although several techniques exist in the literature to tackle these problem for a Mixture-of-Experts model using a single gating function, it is not known whether these schemes work in the more complex case of multiple gating func-tions. To answer this research question multiple techniques proposed in the literature are adapted for the Multi-gate model and evaluated both quantitatively and qualita-tively on the datasets used in this research.

Research Question 3: What is the influence of using contextual word embed-dings extracted from BERT on the Multi-gate Mixture-of-Experts model?

As the gating functions in the model use the representation of the input sample directly in the calculation of the weight assignments for experts, the type of representation can be of importance to the model. Using a more accurate and fine-grained representation might result in more accurate gating assignments, leading to increased performance. For this research question embeddings extracted from the BERT model are used, as these are contextual embeddings that can be more fine-grained than for example GloVe embeddings. To answer this research question the best settings of the model from re-search questions 1 & 2 will be used in combination with embeddings extracted from BERT and the scores will be both qualitatively and quantitatively evaluated and com-pared against the baseline models and the Multi-gate Mixture-of-Experts model that uses GloVe embeddings.

1.5 Scientific Contribution

This research focuses on evaluating several aspects of the Multi-gate Mixture-of-Experts model, particularly in regards to the gating functions. The main contributions of this research to the field of multitask learning for text classification are listed below.

• The usage of a Multi-gate Mixture-of-Experts model for multitask text classification In this research a Multi-gate Mixture-of-Experts model is imple-mented and evaluated on two datasets concerning multitask text classification. To the author’s knowledge, no papers employing the Multi-gate Mixture-of-Experts model for use in text classification exist. The performance of the Multi-gate Mixture-of-Experts model is compared to that of non-gated multitask methods and single task classification methods and the advantages and disadvantages of the use of the Multi-gate Mixture-of-Experts model for text classification are dis-cussed.

• The evaluation of several techniques to reduce gating network polar-ization in a Multi-gate Mixture-of-Experts model In this research various methods used to combat the problem of gating network polarization in single-gate Mixture-of-Experts models are evaluated on a Multi-single-gate Mixture-of-Experts model and the performances are compared. Although a multitude of papers re-garding training schemes for Mixture-of-Experts models exist, papers investigating problems such as gating network polarization for Multi-gate models do not yet ex-ist, to the knowledge of the author.

• The effect of using contextual embeddings extracted from BERT on the performance of the model One part of this research is combining the Multi-gate Mixture-of-Experts model with contextual embeddings extracted from BERT. Besides reporting the achieved accuracy of the model using these embeddings, the behaviour of the gating networks when using these embeddings is studied. This to evaluate whether the addition of contextual word embeddings improves the ability

(10)

of the model to dynamically adapt the expert distribution on a sample-by-sample basis.

The research presented in this paper is structured in the following manner: In Section 2 the background knowledge needed to properly understand the research described in this paper is presented. Section 2.1 will introduce some core concepts regarding Natu-ral Language Processing and Text Classification and Section 2.2 will cover some of the fundamental concepts regarding the field of multitask learning. Section 3 will discuss several related works in the literature that are important for the understanding of this research and its positioning in the current field. Section 4 will discuss the method used in this research to answer the research questions posted above. Section 5 will discuss the main findings of this research,with Section 5.1 discussing the results of applying the baseline models to the datasets, Section 5.2 discussing the various training mecha-nisms that were evaluated, and Section 5.3 discussing the results obtained by running the Multi-gate Mixture-of-Experts model with word embeddings extracted form BERT. In Section 6 the conclusions that can be drawn from the conducted experiments will be discussed. In Section 7 the research method and results are discussed and critically assessed and finally possible directions for future work are presented.

Part of the research was conducted at the Gemeente Amsterdam for the Erfpacht de-partment for classifying incoming customer emails. For that part of the research the same models and methods as shown in this thesis were implemented and evaluated, with the major difference being that these models were trained on a dataset of Dutch text. Because of the privacy concerns regarding the possibly sensitive content of the emails of the department and the fact that results of the models on the Dutch dataset might be difficult to compare to similar models in another language, the results of these experiments have not been included in this document.

(11)

BACKGROUND

This section describes the fundamental techniques and concepts used in this research. In section 2.1 several concepts and models from the field of Natural Language Processing and in particular text classification will be discussed in a somewhat chronological order. In Section 2.2 fundamental concepts and techniques relating to Multitask Learning and in particular the Mixture-of-Expert model will be discussed.

2.1 Neural Networks for Natural Language

Process-ing

Before the widespread use of neural networks in text classification, methods such as the Naive Bayes algorithm and the usage of TF-IDF vectors in combination with Logistic Regression and Support Vector Machines were among the models predominantly used for text classification [7]. Although these relatively simple methods have been able to achieve successes in several tasks [8] their simplicity can also limit the model as the assumptions that the models make might not hold for more complex tasks. An example of this is the independence assumption that the Naive Bayes method uses to predic-tion the probability of a sequence of words. It operates under the assumppredic-tion that the words are conditionally independent and so the probability of a sequence of tokens is the product of the probabilities of the individual tokens. Although this assumption has proven to be usable in practice, it can pose problems for tasks where this independence assumption is clearly not met, and the dependence is of importance to the task. More-over, the methods that the aforementioned algorithms use to represent words, either as word-counts in documented or more complex TF-IDF vectors, arguably fail to capture the rich semantic context of words present in natural language.

following the renewed research interest in neural networks, many new methods based on neural networks have attempted to overcome the problems stated above and have pushed the state of the art in various tasks in the field of Natural Language Processing (NLP), such as Machine Translation and Text-to-Speech tasks. In the next section various models and concepts regarding neural network model for NLP will be discussed.

(12)

2.1.1 Word Embeddings

One of the long time challenges of applying machine learning techniques to the field of Natural Language Processing has been the representation of words in a machine readable format while still maintaining the semantic meaning of the words in natural language. Early approaches tackled this problem by representing each word in a corpus by a one-hot vector where the size of the vector is determined by the number of unique words in the corpus (or a more complex vector in the case of TF-IDF). Despite the fact that this approach is easily interpret-able, in practice this method failed to capture the semantic meaning of words. A solution presented by Mikolov et al.[9] introduced the concept of word embeddings. The main idea used in word embeddings is that a large part of the meaning of words can be determined by observing the words around it. This insight is used in the construction of word embeddings by masking words in a sentence and training a neural network to fill in these blanks based on the surrounding words. The representations of these surrounding words are captured within the weight matrix of this network and the vectors are used to represent the words. These word vectors can then be used as input to other Natural Language Processing tasks as a substitute for the one-hot vector encoding. These word embeddings have shown to often lead to increased performance. As the process of training a word embedding model does not require separate labels, it can easily be trained on a wide variety of texts and languages, with several large corpora of pre-trained word embeddings available in a multitude of languages.

2.1.2 Recurrent Neural Networks

One class of neural networks that is particularly well suited for use within the field of NLP is that of recurrent neural networks. These types of networks are different from the traditional feed-forward neural networks in that they allow for the sequential feeding of inputs to the network and maintain a ‘memory’ of previous inputs. Among the most popular models are architectures such as LSTM [10] and GRU [11] models.

The main concept behind models such as the LSTM and GRU networks is the idea of recurrency. Instead of representing the input as a fixed size input to a neural network where the whole sequence of inputs is fed to the network simultaneously, inputs are fed into the network sequentially, where the moment of input of a token is a considered a timestep. The network activations are passed on to the next timestep , to be used for future predictions, keeping a partial representation of previous timesteps in the memory matrix of the network.

The mathematical representation of this recurrency can be stated as follows: for a given input x at timestep t, the output of the network for input sample xt is given by the

following equation:

yt= σy(W ∗ ht+ by) (2.1)

where Wy is a weight matrix , by is a bias weight vector, σy is a nonlinear activation

function and htis the hidden state of the network at timestep t.

The recurrent element of this formula is given by the definition of the hidden state ht

of the network:

(13)

Where Wh and Uh are weight matrices and σh is a non-linear activation function. As

can be seen from Formula 2.2 the calculation of htinvolves the multiplication of a weight

matrix with the previous hidden state ht−1.At the beginning of a new input sequence

the first hidden state h0is manually initialized.

At each timestep, the network outputs a prediction based on a combination of the current input token and the previous hidden state of the network. When not doing tasks that require an output for every timestep, such as sequence classification, these outputs can be ignored and the final output of the network can be used for tasks such as sentence classification and sentiment classification among others.

Figure 2.1: Conceptual representation of a recurrent neural network

(a) At timestep n the input x is fed into the network and an output is produced based on the input x at timestep t and the hidden state at timestep t − 1.

The ability of these models to handle input sentences of arbitrary length has meant that they are particularly well suited for Natural Language processing tasks, and they have been applied successfully in a variety Natural Language tasks such as text classification and machine translation [12, 13].

Although the architecture described above is well suited for variable length sequences like text, several problems exist. One of those problems is the occurrence of vanishing gradients or exploding gradients. The phenomena of exploding or vanishing gradients occur when gradients are backpropagated through a long sequence of time steps, with repeated multiplication of the gradients by small numbers causing the gradients to be-come too small to be represented as non-zero in computer memory. Exploding gradients are also caused by many backpropagation steps, where a gradient that is larger than 1 can lead the gradient to ‘explode’, i.e. become very large, possibly overflowing the storage for the gradient in computer memory. both these problems can severally hamper the learning abilities of the model as the gradient is no longer able to provide a usable signal for the network to adjust its weights.

There are several changes that can be made to the architecture of the standard Recur-rent Neural Network RNN to alleviate some of the issues mentioned. One of the models that incorporates mechanisms to deal with the issues of exploding and vanishing gradi-ents is the LSTM model. The model adds additional ‘gates’ to the RNN architecture to control what information is ‘remembered’ and what is ‘forgotten’ [10].

One of the other limitations of ‘vannila’ recurrent models like the RNN and even LSTM models is that the input sequence is represented as a fixed-size vector in the network.(the

(14)

last hidden state of the network after it has processed all input tokens). As sentences can vary greatly in length, this poses problems regarding the memory capacity of the net-work. In order to accommodate for arbitrarily long sentences while still fully capturing long term dependencies between words, the memory has to be substantial. Furthermore, the recurrency of the networks makes that the calculations have to be carried out in a sequential order, preventing usage of parallel computation techniques to accelerate the training times of the models.

2.1.3 Convolutional Neural Networks for Natural Language

Pro-cessing

Originally developed for Computer Vision tasks, the Convolutional Neural Network(CNN) structure [14] has proven to also be applicable to the field of Natural Language Process-ing. A convolutional neural network differs from a standard neural network in that it uses parameter sharing in its convolutional filters. These filters can be seen as small windows that slide over an input image and compute certain activations. During train-ing, these convolutional filters learn to detect certain low level features or patterns in the input image. In order to be able to use the Convolutional Network for use in Nat-ural Language Processing, several adjustments to its architecture have to be made as described in a 2014 paper by Yoon Kim[15]. Instead of images, the network now takes rows of word embeddings as input and computes convolutional filters of several sizes over it (3, 4, and 5 in the original paper). these convolutional are taken over 1 dimension meaning the convolutional filter runs over the entire embedding dimension of the words. In order to deal with varying sequence lengths, Max-over-Time pooling is used to ensure the input to the final linear layer is always of a constant size. In max-over-time pooling, only the absolute maximum of a convolutional filter is taken. when using for example 100 filters of size 3, this results in 100 output values. When using filters of multiple sizes, these outputs are concatenated before being fed into the final linear layer used for classification.

Figure 2.3: Visual representation of the CNN for sentence classification, picture taken from the original paper by Yoon Kim

There are several advantages of using a Convolutional neural network architecture com-pared to using a recurrent architecture like a RNN or LSTM network. One of the advantages is that the activations of the filters can be calculated in parallel which allows for faster training. The network also does not suffer from vanishing gradient problems (in most settings of the network). Because of the relatively small number of parameters and possibility for parallelization, the Convolutional neural network for text is an often used baseline for text classification tasks.

(15)

2.1.4 The Attention Mechanism

In a paper by Bahdanau et al.[16], the attention mechanism is introduced, with the authors of the paper using the mechanism in the context of machine translation. In contrast to models like an RNN or LSTM, the attention mechanism does not represent the sentence as a single fixed-size vector, but rather, a representation for each word is calculated by ‘attending’ to certain parts of the input sentence and use this representa-tion to make predicrepresenta-tions [16]. This approach circumvents some of the issues present in recurrent models, such as the fixed size representation of an input sequence and prob-lems regarding vanishing gradients.

Given a translation task with input sequence x = (x1, x2, . . . , xn) and output sequence

y = (y1, y2, . . . , yn) the output for a target word yt, depends on all the previous outputs

y, decoder hidden state stas well as a weighted combination of the input vectors, instead

of using a fixed size representation for the complete input sequence. In the attention mechanism scheme each token in the input sequence is represented with a context vec-tor, which in turn consists of weighted annotations of the words in the sentence, which are computed by an encoder network, which is a Bidirectional LSTM in the case of the example.

For a particular word xi where i indicates the timestep, the associated context vector

ci is given by Equation 2.3. ci= Tx X j=1 αijhj (2.3)

Here Txrepresents the length of the entire input sequence, h represents the annotation

vectors and αij represents the attention weight of token j w.r.t token i.As can be seen

from Equation 2.3, the context vector for a particular input xi is given by a weighted

sum of the other hidden states in the input sequence, weighted by attention weights α.

αij= exp(eij) Tx P k=1 exp(eik) (2.4)

The attention weights are calculated as a softmax distribution over the alignment scores eij(See Equation 2.4). These attention weights are usually calculated with a single layer

feed-forward neural network as shown in Equation 2.4 and are intended to represent how well the input token ‘aligns’ with the output token at timestep t.

eij= a(si−1, hj) (2.5)

Equation 2.5 shows the calculation of the alignment score eij indication how well the

token at position j aligns with the output token at timestep i.

The attention mechanism extends the traditional approach by changing the context vector from the encoder to a context vector that is calculated specifically for the output word ytby averaging over all the words in the input sequence, where the weights used for

each input word are calculated based on how well the representations of words ‘align’. In the case of tasks like sentiment classification, where the output is a label or set of labels, the representations of the words are calculated between the input tokens to the network, also known as self-attention. In the case of classification a special token for

(16)

classification is often prepended to the sentence, and its representation calculated by the network is used for the classification of the sequence.

2.1.5 Transformers

The Transformer is an architecture introduced by Vaswani et al.[17]. This model employs the concept of attention explained earlier as the main building block in the architecture of the neural network. It adopts the attention mechanism of [16] but does not use an LSTM or other recurrent network for the encoder and decoder networks, instead using only feed-forward neural networks with multi-head attention and positional encodings. The Transformer model does not make use of a single attention layer , but instead uses multiple attention modules, each using a different linear projection. According to the authors of the paper, this is done so that the model can use information from different representation sub-spaces.

As the transformer model does not use a recurrent architecture no inherent positional information is present in the hidden states. Because text has an inherent sequential structure the model should contain some notion of the word order in the sentence. the model adds positional information in the form of a position encodings. These encodings are calculated by by representing the position of a word in a sentence by the index and encoding it using sine and cosine functions of different frequencies to encode the relative position of the word. when the model is run, these positional encodings are added to the input embeddings.

2.1.6 Bidirectional Encoder Representations from Transfomers

(BERT)

First introduced in a 2018 paper by Devlin et al[4], BERT is a neural network architec-ture based on the Transformer architecarchitec-ture described above. The model improves on the original Transformer model by using bidirectional Transformers, and using different embeddings to encode the input tokens. At the time of its publication, BERT outper-formed all major algorithms on all tasks in the GLUE benchmark[18], which contains a variety of tasks like sentiment classification, question answering and textual entailment prediction. In the paper, the authors employ BERT in a setting where it is first trained on a large corpus of unlabelled text, after which it can be adapted for a specific task by fine-tuning. As the training of BERT is very computationally expensive the authors of the paper recommend using the version trained by Google and fine-tune the model for the specific task. In the paper, BERT is pre-trained on the tasks of Next Sentence Prediction (NSP) and a Masked Language Modelling Task (MLM). The Next Sentence Prediction tasks requires the model to, given two sentences, predict whether the second sentence follows the first sentence. The Masked Language Modelling task requires to model to fill in a sentence with several words masked with the most probable candidates for both positions. By using these two unsupervised learning tasks in the pre-training stage, the model is able to develop a very broad language understanding, without the need for large amount of labelled data.

The BERT model uses a bidirectional attention mechanism, meaning that in contrast to the original Transformer model, the model is also able to capture words to the right of the target word in its representation. Instead of using a set of pre-trained word embed-dings, The BERT model uses Wordpiece embeddings[19], which are subword embeddings learned from a large corpus of text. The use of these subword embeddings allows the model to handle out of vocabulary (OOV) words better than some other models,

(17)

be-cause the subword embeddings allow the model to still (partially) represent the word as a combination of known subword embeddings.

2.1.6.1 Extracting word embeddings from BERT

The BERT model outputs sequences that are of equal length as the original input exam-ple. These outputs can be used for further downstream tasks such as machine translation and sequence tagging, but they can also be used as contextual representations of the input elements. Because BERT uses the attention mechanism discussed previously, the representation of a word is dependent on the other words in the sentence. For this reason the word embeddings that can be retrieved from BERT are contextual word embeddings, meaning that they are not static for each word but that one word can have multiple representations based on the surrounding words.

(18)

2.2 Multitask Learning

The field of Multitask Learning is concerned with learning multiple tasks simultaneously, where some or all parameters are shared among the different tasks or the parameters of the models are influenced by the parameters of the model of another task. There are several reasons for the interest in the field of Multi-Task Learning. The first reason is that it has been found that when training several task together while sharing the parameters of a single model, the performance on the tasks can be higher than when those tasks were to be trained individually. The second reason is that the usage of multiple tasks means that tasks for which there is not much training data available can be combined with tasks with abundant training data, allowing the task with the limited samples to possible benefit from the information contained in the samples of the other task. In the field of multitask learning, a major distinction is made between hard parameter sharing [20] and soft parameter sharing[21], this will be explained in Section 2.2.2. Another important distinction is between the type of multitask learning, whether it concerns one dataset with multiple tasks, or several datasets with different (or equal) tasks. A more detailed explanation of this distinctions is given in Section 2.2.1. In Section 2.2.3 the architecture of the Mixture-of-Experts model is presented.

2.2.1 Inter-Task vs. Intra-Task Multitask Learning

One important distinction in multitask learning is whether one dataset with multiple classification tasks is used or whether several datasets with their own classification tar-gets are combined. The first variant assumes that there are multiple classification tasks available for each data-point, i.e. the sentiment and category of a piece of text in the case of text classification. In this variant the input is a single input sample with multi-ple outputs for each of the different tasks. In the second variant, multimulti-ple datasets are combined, where the tasks do not have to be of the same type (for example classification and sequence tagging). In this case, the input to the network is an input sample of each dataset fed to same shared layer, after which the classification is done through the task specific output layers. As mentioned above the second type can be used in situations where there are multiple datasets with the same or similar tasks, such as category clas-sification for different datasets of Tweets. In the first variant, the usage of multiple tasks can improve the performance of the model over learning each task separately, but it does require a dataset with these multiple tasks present, which might not always be available.

2.2.2 Parameter Sharing

Various methods in the field of multitask learning can be classified into the way in which they share the parameters of the model between the different tasks. The main distinction that is made is that between hard sharing and soft parameter-sharing. In the case of hard parameter sharing, the different tasks share the main part of the architecture, often referred to as the shared bottom layer. This shared bottom layer is then followed by task specific output layers for each task in the learning procedure. The benefit of using a hard parameter sharing architecture is mainly that it reduces the risk of overfitting for the separate tasks [21]. This is because all the tasks use the same network for their predictions, meaning that if the network were to overfit on one specific task it would no longer generalize well to the other tasks, which would decrease the networks performance on these tasks, leading to a bigger overall loss. The other variant in multitask learning is soft parameter sharing. In this case there is usually no shared bottom layer, but instead each task has its own architecture. In the case of soft parameter sharing the sharing of the parameters can be done through various ways, one

(19)

being l1regularization between the different parameter sets of the tasks. Section 3 will

(20)

2.2.3 Mixture-of-experts

The Mixture-of-Experts model as introduced by Jacobs et al.[6] is a type of network architecture that combines several sub architectures called experts through a gating function to make a prediction about the input. In a Mixture-of-Experts model, the model consists of several smaller structures called ‘experts’ and a gating function. When a sample is presented to the model, the sample gets fed to both the experts and the gating function. The role of the gating function is to, given the input, calculate the mixture weights for each experts, i.e. how much their output contributes to the final prediction. After the experts have outputted their values, these values are multiplied with the mixing weights and are summed to form the final prediction. Unlike a standard feed forward neural network, this allows for models where only some parts of the network are active for a given input sample, based on the output of the gating function.

Figure 2.4: The Mixture-of-Experts Architecture

The general structure of the Mixture-of-Experts model can be seen in Figure 2.4. When the model is fed an input sample, the sample if fed through all the experts (3 in this case) and is simultaneously fed through a gating network g. The gating network outputs a softmax distribution over the experts depending on how much the output of that expert contributes to correctly classifying the input sample. These activations are then used as mixing weights for the experts and the predictions of all experts are summed up to form the final prediction.

For a specific input sample x the output of the MoE model is given by the following formula. y = N X i=1 gi(x; θg)fi(x; θfi) (2.6)

where N is the number of experts fi is the i-th expert and g is defined as a single layer

(21)

expert.

For a specific expert network defined by the index i, the weight assigned to it by the gating function is given by the following equation:

gi(x) = Sof tmax(σ(Wg∗ x + bg))i (2.7)

Where Wg is the matrix with trainable weights of the gating function g.

The Mixture of Experts models as used by Jacobs et al. [6] is trained using Stochastic Gradient Descent for both the gating function and the expert networks.

(22)

RELATED WORK

3.1 Multitask Learning

3.1.1 Cross-Stitch Networks

The Cross-Stitch Network introduced by Misra et al.[22] is a variant of a convolutional neural network used for multitask learning in Computer Vision tasks. Unlike other attempts at using convolutional neural networks in multitask learning, the Cross-Stitch Network does not make a hard distinction between which convolutional layers are shared and which are task-specific. Instead, the architecture uses one network for each separate task, learning a linear combination of contributions for each layer in the convolutional architecture. For example, given that there are two tasks A and B and the prediction for task A is calculated, for a specific step in the network, the output is given by the following equation: ˜xij_A ˜ xij_B =αAA αAB αBA αBB xij A xij_B

The α parameters are present for each layer in the architecture and are trained together with the rest of the models parameters through backpropagation. This approach of using a linear combination of layer outputs to explicitly control parameter sharing between tasks bears a resemblance to the Multi-gate Mixture-of-Experts model used in this research, the difference being that the MMoE model does not control sharing for specific layers but rather between the shared and task specific parts of the model.

3.2 Multitask Learning for Text Classification

In the field of text classification, several methods that use forms of multitask learning have been proposed. One such model is the MT-DNN model introduced by Liu et al.[5] In their work, the authors propose a hard parameter sharing method that uses the BERT model as the shared bottom layer. The training procedure for their model consists of the pretraining stage as used in the original BERT model, combined with a multitask training stage. In this multitask training stage, a batch of data from one of the possible tasks is selected and fed through the network, after which the loss appropriate for that task is calculated and the network is updated using this loss.

(23)

In a paper by Liu et al.[2] a multitask learning method for text classification is proposed that makes use of an LSTM and experiments with different weight sharing schemes. In their paper, the propose three different weighting schemes:

• all tasks share the same LSTM

• Each task has its own LSTM, but the hidden units are shared through the use of a gating function that determines how the hidden states are mixed

• Each tasks has its own LSTM and a bidirectional LSTM together with a gating function is used to control the information sharing between the different tasks All three strategies are tested on the several text classification datasets (1, SST-2, IMDB). Their results indicate that the shared model architecture achieves the best performance and is able to outperform several single task model baselines.

3.3 Mixture-of-Experts

In a paper by Shazeer et al.[23] a sparsely gated mixture of experts model is introduced. This type of model uses the same mechanism as the ‘vanilla’ mixture of experts model, the only difference being the output of the gating network. In the original model, the output of the gating function would be a softmax distribution over the number of ex-perts. In the case of the sparsely gated mixture of experts however, the output is a n − hot vector where ones represent the expert networks that are active for that pre-diction and zeros represent expert networks that are not used for that prepre-diction. By outputting this distribution the gating function acts as a selector for the different parts of the network. The gating function selects only the top k values before taking the softmax and sets the rest to −∞, so that they become 0 after applying the softmax operation.

As the sparsity is manually controlled, this has the advantage that only a small part of the network has to be active for any given example. In addition to these sparse gating weights, the paper also uses adds Gaussian noise to the inputs to help with load bal-ancing.The load balancing tries to address the problem of the same expert being chosen for every example because of random initialization. If by chance one expert gets chosen significantly more than the others then this expert will be updated more frequently and therefore its predictions will be more accurate in term leading to this expert being chosen more often, leading it to dominate the others experts. Our work is similar because it also used the Mixture-of-Experts model as a basis and employs the load balancing strategy introduced in this paper, but is different because we use different experts and use more than one gating function to assist in the multitask learning. The model introduced in the paper by Shazeer et al. is tested on tasks including language modelling and machine translation. For the task of language modelling the Billion Word Language Modelling Benchmark dataset is used (Chelba et al., 2013). On this dataset the model achieved state-of-the-art results at the time of publication, where the number of parameters in the model was much bigger than that of its competitors, while actually having a lower inference time than its competitors.

The Multi-gate Mixture-of-Experts model[1] is a variant the Mixture-of-Experts model suited for multitask learning. It differs from the original Mixture-of-Experts model through its use if multiple gating functions, one for each task. This extension allows the model to not only select the network activations per sample, but also per task. The authors of the paper argue that this is beneficial for multitask learning scenarios, as it

(24)

allows different combinations of experts to be selected per task, making it more robust to possible conflicts between tasks, for which it has been shown that this can harm the performance of multitask learning models[1].

In a paper by Zhao et al. [24] the Multi-gate Mixture-of-Experts model is used in the setting of a video recommendation system. For this task, the Multi-gate Mixture of Ex-perts model is used to rank a selection of videos. The model is optimised for different, possibly conflicting, objectives, such as user engagement and user satisfaction by using the MMoE model in conjunction with a Wide-and-Deep Network [25]. The research paper is similar to this research because it also makes use of the Multi-gate Mixture of Experts model, however, the application of the technique is on multimodal data for a recommender system, whereas this research focuses on classifying textual data for all the tasks.

In a paper by Majumder et al. [3], a multitask learning framework is used to simulta-neously predict the sentiment and the presence of sarcasm in pieces of text. The model that was shared between tasks was a GRU, where several choices for parameter sharing and the usage of attention were tried. The approach taken in the paper is by Majumder et al. is similar to that used in this paper, as the task is also tackled using multitask learning, with both tasks sharing some parameters of the shared layer, controlled by an attention mechanism. However, the method does not using a gating function, and the input is first passed through a GRU unit before passed to the other parts of the network, where the model used in this paper uses word embeddings as input to all part of the model.

A paper by Mareddy et al.[26] has investigated the usage of a Mixture-of-Experts model combined with the usage of word embeddings obtained from the BERT model. In the paper, several types of word embeddings such as Word2Vec, Glove and BERT word em-beddings were experimented with. These features are then used as input to a Mixture-of-Experts model consisting of of several GcForest modules and the performance of the model with the different types of embeddings is compared on a dataset containing 50k Amazon product reviews. The dataset used contains about 20 different products, each review being labelled as either positive or negative. The results from the paper indi-cate that for the Amazon dataset the Glove and Word2Vec embeddings outperformed the BERT embeddings and appeared better at discriminating between certain classes in the dataset. However the authors of the papers argue that the effect varied between different categories and that the best type of embedding will depend strongly on the dataset. The research conducted in the paper by Mareddy et al. has similarities with this research, namely the usage of contextual embeddings retrieved from BERT and the usage of a Mixture-of-Experts model integrated into the architecture. However, the paper does not explore the performance of the model with regards to multitask learn-ing and is more aimed towards comparlearn-ing the performances of uslearn-ing different kinds of word embeddings, and therefore does not investigate the effect of the multitask learning component in detail, whereas this research focuses specifically on the aspects of a (Multi-gate) Mixture-of-Experts model in the setting of text classification, such as varying the types of gating- and expert networks used.

In a 2018 paper by Xiao et al.[27] a method similar to the one used in this research is proposed, where instead of using several expert networks a gating function is used within a CNN model to weight the different feature maps based on how well the model is able to learn the tasks simultaneously. Their method is able to achieve better scores compared to several multitask learning baselines on several NLP datasets. Besides from

(25)

the use of a single model in the shared bottom layer, a significant difference between the research by Xiao et al. and this research is that this research concerns multitask learning within one dataset i.e. classifying different aspects of input samples all belonging to the same dataset. In the research of Xiao et al., the multitask learning concerned the simultaneous classification of input samples from different datasets into their respective classes. Although not necessarily different, this means that the model might learn a general representation for classifying the category of inputs, but it does (probably) not learn to combine different aspects of text within datapoint to aid in classification.

3.3.1 Gating network Polarization

One problem that is known to occur when training Mixture-of-Experts models through stochastic gradient descent is the phenomenon of gating network polarization [24]. In the case of gating network polarization, one expert gets assigned (almost) all of the probability mass from the softmax output of the gating function, causing it to be the only model being selected, causing the model to revert to a single network for all or a subset of the tasks. The reason this occurs it that randomness during training can cause a single model to be selected more than average during the first stages of training. However this causes problems, as this means that this expert gets trained more, causing it to improve its performance, in turn causing it to get picked with higher probability, leading to the situation where it is almost always exclusively picked during training [28]. Although this does not always happen, this behaviour is still undesirable and should be counteracted if possible.

Compared to the case of a Mixture-of-Experts model with a single gate, the task of assuring that the gating network assigns the ‘right’ distribution to experts is more com-plex. The reason for this is that one of the theoretical strengths of the model is its ability to separate conflicting tasks during training if it harms the performance of the overall model. In a model with a single gate and a single learning task, a certain mixture ratio of experts is always desired, and the model resembles an ensemble method strategy. If in the case of the single task Mixture-of-Experts model only one expert is given non-zero probability, the model would revert to being a single expert model. However, in the case of the multitask model, the desired mixing ratio is also dependent on the compatibility between different tasks. For this reason, artificially forcing the network to mix experts can harm the performance of the model as the ‘optimal’ distribution might be to use different networks for different tasks to separate the tasks.

Several methods in the literature have been proposed to counteract the problem of gat-ing network polarization, such as usgat-ing dropout on the softmax outputs and balancgat-ing the expert assignments by enforcing roughly equal shares. The approach of Shazeer et al.[23] uses a constraint on the importance of experts, forcing each expert to be chosen roughly equally. Although this approach is effective in the case of single task learning, it is hypothesised that this approach is less suited for multitask learning with multiple gates, as this might force unrelated tasks to share experts, potentially harming the over-all performance of the model.

A technique used by Eigen et al.[28] uses the mean number of assignments among ex-perts to balance them. When during training one expert is chosen more than the mean assignment by a certain threshold, its activation is set to 0 for that next example. This way the expert activations will all lie around the mean. As this ‘mixed’ distribution might not be the optimal expert distribution, the model is only trained with this con-straint during the first n epochs of training, after which the concon-straint is lifted. In this

(26)

case the model should still be able to ‘correct’ the mixed distribution and converge to the optimal distribution for the specific task.Mathematically, the constraint mentioned above can be formulated like this:

If we let the total assignment of expert i during training step t (where t is one in-put sample) be given by

T

P

t=1

g(experti(x)) where T is the current time step and the

average assignment of experts be given by the following formula:

T P t=1 g(experti(x)) N P i=1 T P t=1 g(experti(x)) (3.1)

where N is the total of number of experts in the model. When expert i is assigned significantly more than the average assignments, its gets set to zero for that particular sample. This is represented by the below formula:

assignmentExperti− Average > m (3.2)

By setting the assignment to zero the average assignment of that particular expert is lowered and this leads to more balanced expert assignments over time.

(27)

METHOD

In this section the method used to answer the research questions posted in the intro-duction will be explained. Section 4.1 describes the details of the MMoE model for text classification. Section 4.2 describes the datasets that were used in this research, Section 4.3 discusses details about the code used, Section 4.4 describes the baselines used to which the performance of the model is compared. Section 4.5 describes several pre-processing steps applied to the datasets and specifics of the training procedure.

(28)

4.1 Model Architecture

The structure of the network used in this research is very similar to that of the network used in Ma et al.[1], with a shared bottom layer containing the expert networks, and the gating functions, one for each task to be learned. A detailed explanation of the operation of the model is given below.

Figure 4.1: General architecture of the Multi-gate Mixture-of-Experts model for Text Classification

In Figure 4.1 the architecture of the MMoE model used in this research is shown. When using the model, the input gets processed in several steps. In the first step the input sentence is tokenized and converted into word embeddings. After this step the sequence of word embeddings is fed into both the expert networks and the gating networks. For each task the corresponding gating network computes the distribution of experts based on the sequence of word embeddings. This distribution is then used to compute a weighted average of the outputs of the experts which is then fed into a final linear layer for classification for the appropriate task. The formal mathematical definition of the operation of the model is given below.

For a given input x and specific task k the output of the network is given by the following formula:

(29)

yk= lk XN i=1 g_ik(x; θgk)fi(x; θfi) (4.1)

Where gk is the gating function for task k that takes as input x and outputs a

prob-ability distribution over the expert networks f and N is the number of experts used in the model. Here lk is a task specific neural network which gets fed the weighted

contributions from the experts in the shared bottom layer.

If we let g define the network used by the gating function that takes as input x and produces and output of shape 1 × n where n is the number of experts, the output of the gating function is given by the following equation :

p(fi|x) = Sof tmax(gk(x))i (4.2)

In the usual Mixture-of-Experts setting the gating function is often a (small) feed for-ward neural network that takes an input of fixed sized and outputs mixing weights for each expert in the shared bottom layer. When using the MMoE for Natural Language Processing, this can be sub optimal, as text is of varying length and different architec-tures have shown to be more effective in learning from text (as described in Section 2). This is important as the gating network should be capable of picking up on rudimentary information present in the text in order to assign proper mixing weights to the experts. For this reason, a gating network that is suited for natural language processing will be used in this research, more details about the type gating function will be specified further on in this chapter.

When performing the backpropagation of the error through the model, the error of the network w.r.t the specific task t is fed through the network, where only the weights of the nodes active for classification in that task are adjusted. By utilizing multiple gates, the network can learn to assign different weights to the experts in the shared bottom layer based on the input example. In the case of unrelated tasks, the gating functions can assign almost all probability mass to one expert for a specific task, and the network could learn to use disjoint experts for the tasks, where the tasks are very separated, almost reverting back to using separate networks for each tasks. This alleviates the issue of multiple unrelated tasks hurting the performance of the model, and a minimal amount of sharing can still be present.

4.2 Datasets

For the experiments, two datasets will be used to test the model’s performance and compare it to the performance of several other models. Both datasets include multiple categories of labels for each datapoint and concern conversation-like text, making them suited datasets to investigate the research question.The datasets used in this research are described below.

4.2.1 DailyDialog Dataset

The DailyDialog dataset, introduced in Li et al [29], was constructed by crawling web-sites for non-native English speakers to practice their English by engaging in written conversation with native speakers. Each utterance in the dataset is labelled with Act,

(30)

Topic and Emotion, with 4, 3 and 7 classes respectively. The complete dataset consists of both parties of the conversation for each conversation. Because the research is focused on the extraction of the intent of a single person and not from conversation, one side of the conversation is selected for each conversation.

Class Number of examples Inform 21046 Question 18915 Directive 10179 Commisive 3651

Table 4.1: Act Labels for DailyDialog Dataset Class Number of examples Happiness 6321 Surprise 914 Anger 643 Sadness 449 Disgust 174 Fear 115

Table 4.2: Emotion Labels for DailyDialog Dataset Class Number of examples Relationship 16926 Ordinary Life 15708 Work 7493 Tourism 4443 School Life 2355 Attitude & Emotion 2255 Finance 2213 Health 1336 Politics 774 Culture & Education 288

Table 4.3: Topic Labels for DailyDialog Dataset

In the above Tables 4.1, 4.3 and 4.2 the distribution of labels for the different tasks in the DailyDialog dataset are shown. As can be seen from the figures, the labels for the tasks are imbalanced, which is most severe for the emotion classification task. In Section 4.5 the method used for compensating for this imbalance is discussed.

4.2.2 UC Berkeley Enron Dataset

UC Berkeley has released a dataset containing a subset of the Enron corpus[30], a dataset constructed from emails between employees, which was released to the public after the fall of the Enron company and an ensuing lawsuit. The dataset from Berkeley has been annotated with the general topic of the email which can be one of 6 classes, and the emotional tone of the email, which can be any one of 19 classes.1 _{Because some of the}

Emotion classes only had 1 or 2 examples, only the top 10 most frequent classes were selected from the dataset. In the original dataset 8 category labels are present, however, 2 of these classes are used to indicate an empty message or corrupted information and thus these were removed from the dataset yielding 6 different classes for the category classification task.

1

(31)

Class Number of examples Secrecy 98 Concern 34 Worry 23 Gratitude 15 Hope/ Anticipation 15 Sarcasm 15 Camaraderie 13 Humor 11 Friendship 10

Table 4.4: Emotion Labels for Enron Dataset Class Number of examples Company Business Strategy 648 Logistic Arrangements 434 Document editing / checking 148 Employment arrangements 83

Personal but in

professional context 68 Purely Personal 25

Table 4.5: Category Labels for Enron Dataset

In Tables 4.4 and 4.5 The distribution of the datapoints in the different classes per classification task can be observed. As can be seen, the dataset is unbalanced especially the emotion classification task. In Section 4.5 the method used for compensating for this imbalance is discussed.

4.2.2.1 Dataset Examples

Enron Dataset DailyDialog Dataset I shall talk to him on Monday.

The chances are he is a complete misfit if 2 organizations rejected him.

No , that’s OK . How much do I owe you ? We don’t have revised versions yet.

when we do, we will probably only want to put some (not all) of

the documents on the intranet.

Yes , I have called you three times . What makes you in a daze ?

I hope you have seen this.

Obviously since you are a director you have a keen interest in this.

how about recycling ? Does that actually help ?

Table 4.6: Examples pieces of text from the Enron and DailyDialog datasets

In Table 4.6 some example datapoints for both the Enron and the DailyDialog dataset are shown. As can be seen from the examples, the sentences from the DailyDialog dataset are much shorter and concern very different topics of conversation compared to the Enron dataset.

4.2.3 Statistics of Datasets

Dataset Num examples Tasks Avg number of words Train size Test size DailyDialog 53,791 Act Emotion Topic 13 37,653 16,138 Enron 1,406 Class Emotion 327 984 422

(32)

In Table 4.7 several statistics of the datasets used in the research are shown. Several observations about the datasets can be made about the table, most notably that the datapoints in the DailyDialog dataset are much shorter in comparison to the Enron Dataset. It can also be seen that the total size of the DailyDialog dataset is much larger than that of the Enron dataset. This is of importance, as the relatively small size of the Enron dataset combined with its class imbalance might make that the obtained results are less generalizable. Therefore a difference in performance on the DailyDialog set caused by changing parts of the model may be more significant evidence than when a difference is found on the Enron dataset.

4.3 Code

All the models used in this research were implemented in PyTorch. For the pre-trained BERT model and the code for fine-tuning an existing BERT model the HuggingFace Transformers library2_{was used. All the code used in this research is publicly available}

on GitHub. 3

4.4 Baselines

In order to evaluate the performance of the developed model and answer the research questions posted in Section 1, the developed model will be compared against several existing methods for text classification. In order to test the possible increase in per-formance by using a multitask learning method compared to a single task method, the model will be compared to both single task models and multi task baselines. In order to evaluate the influence of multitask learning on the performance of the individual tasks, the performance of the model is compared to LSTM, CNN, Transformer and BERT models trained on only one task.

Through its gating functions, the MMoE model allows per sample selections of experts to combine knowledge from different tasks. This allows it to (partly) decouple possibly conflicting objectives between tasks, meaning a theoretical increase in performance over simpler multitask learning methods as those do not possess this ability. In essence the added value of the gates is to be examined. In order to test this hypothesis, the model is compared to both LSTM and CNN models for Multitask Learning. This is done by feeding the input sequences to the same base model and using two separate linear layers that convert the output of the network to the appropriate size for the specific task.

4.4.1 TF-IDF

As one of the more ‘simple’ methods in this research, the TF-IDF is a classic non-neural method that provides a strong baseline for comparing more complicated methods. All the datapoints in the datasets are converted to TF-IDF vectors, after which these vectors are used in a classification method to predict the classes of the text. After experimentation it was found that using a Linear SVM model yielded the best results, therefore it is chosen is baseline model in this research.

2_{https://github.com/huggingface/transformers} 3_{https://github.com/RubenvanHeusden/MasterThesis}

(33)

4.4.2 LSTM and Bidrectional LSTM

The LSTM model by Hochreiter et al.[10] is used as one of the baselines models in this research. In addition to the standard LSTM model, an extension to this model, the Bidirectional LSTM model is also included as a baseline. Multiple sets of hyperpa-rameters were tested for both the LSTM and the Bidirectional LSTM, where the final models used a hidden dimension of 512 neurons, in the case of the Bidirectional LSTM this means 512 neurons for both directions separately. For both the LSTM-based mod-els a learning rate of 0.1 was used, where learning rate scheduling was used to adjust the learning rate during training, Stochastic Gradient descent was used to adjust the network parameters.

4.4.3 CNN

For this research, the CNN model for text classification as introduced in (Kim, 2014)[15] is used. The implementation of this CNN is converted to PyTorch code and the hyper-parameters from the paper are used. For both the Enron and the DailyDialog dataset a learning rate of 0.1 was used, and Adam was used as the network optimizer.

4.4.4 BERT

Because BERT is one of the current state of the art models, it is used as a base-line to compare the results of our model against to evaluate whether the multitask method developed in this research can match / surpass the scores of the BERT model on the selected datasets. as mentioned above, the Huggingface Transformers Library is used, which contains several readily available pretrained BERT models. The ’bert-base-uncased’ model was used and the model was fine-tuned on the Enron and DailyDialgog datasets until convergence. The learning rate of the model was set to 0.00002 and a batch size of 8 was used for both datasets.

4.4.5 Transformer

The Transformer model from Vaswani et al.[17] is also used in this research and is trained on all the tasks separately. The transformer model used in this research is based on the PyTorch implementation, it consists of 6 transformer layers, 12 attention heads and a feedforward layer of size 512. A learning rate of 0.0001 was used for both datasets.

4.4.6 Multitask CNN

The Multitask CNN model used as a baseline in this research is almost identical to that of its single task counterpart for text classification as described above, with the only difference being the addition of task specific layers for the different tasks being learned. At training time, each example is fed through a single CNN model and the output state of the network before the softmax would normally be applied is fed to the task specific layers, after which the classification error for each task is calculated and the gradient is backpropagated for the network.

4.4.7 Multitask LSTM

The multitask learning variant of the LSTM that will be used as a baseline in this paper is based upon Collobert et al[31], where each task shares the same LSTM model, where the LSTM model has a number of stacked layers equal to the amount of tasks and the final hidden state of the model is used for classification. After experimentation

(34)

with different sizes of the hidden layers in the multitask LSTM, the same settings were chosen as that of the LSTM-based models in the single task case, using a hidden layer size of 512 neurons.

4.5 Training Details

4.5.1 Class weighting

For both datasets the class distribution of labels in the classification tasks is not always balanced, especially in the case of the emotion classification task. To compensate for this, class weighting is used to weight the losses of the classes with their relative occur-rences in the dataset, meaning the loss for samples of an infrequent classes is weighted more than that of frequent classes. For this research, the class weighting implementation of scikit-learn is used, which uses a class weighting scheme based on a paper by King et al.[32].

For a given class c the weight of this class within dataset D can be calculated using the following formula: class weightc= |D| C ∗ D P i=1 yi= c (4.3)

Where C is the total number of unique classes in the dataset and y is the label associated with a data-point.

4.5.2 Stratified sampling

Furthermore, a stratified sampling approach was used for constructing the train and test sets so that the distribution of examples for both sets was similar. This to prevent possible cases where labels are present in the test set but not in the training set. This was particularly relevant for the Enron dataset because of the small size of the dataset.

4.5.3 Model evaluation

In order to be able to quantitatively score and evaluate the various models the precision recall and f1 scores of the models are reported. Because of the class imbalance present in some of the tasks being learned, instead of reporting the raw precision, recall and f1 scores, the weighted precision, recall and f1 scores are calculated, meaning the statistics are calculated for each class and weighted by the support of the respective classes.

4.5.4 Fixed size of training samples

Because the Transformer and BERT models require the input to be of a fixed size the input samples from all datasets were truncated/padded to a fixed length. As the sentences in the DailyDialog dataset are relatively short, the length of of the longest sentence in the training data was selected as the length to which all sentences are padded. (This padding is ignored in both models with the attention masks). For the UC Berkeley Enron Dataset, the fixed length of the input sequences was set to be 500 words, as the average length of the emails is about 327 words per emails so by choosing 500 words as maximum it is expected that most emails are contained in the dataset in their entirety. This strategy was chosen over taking the length of the longest sentence as the raw dataset contained several sentences that contained images that were not properly added