Neural Hierarchical Text Classification with Transfer Learning Approaches

(1)

MSc Artificial Intelligence

Master Thesis

Neural Hierarchical Text Classiﬁcation with Transfer

Learning Approaches

by

Kai Liang

12046477

June 11, 2020

48 EC November 2019 - June 2020

Supervisors:

MSc. Shaojie Jiang

Dr. Yury Kashnitsky

Assessor:

Prof. Dr. Christof Monz

(2)

Abstract

Text classification has long been a classical and fundamental problem in the field of Natural Language Processing which is currently dominated by Deep Learning models. However, the task is also becoming more challenging in some applications due to the conflict between the increase of fine-grained categories and the shortage of labelled data. In these scenarios, the categories typically form a hierarchical structure, which turns the problem into the hierarchical classification task. Existing methods either flatten the hierarchy and straightly predict the leaf-level labels, or train one to multiple classifiers from scratch to exploit the class taxonomy. However, such methods do not work well on small datasets or low-resource languages because deep neural networks usually require lots of data to generalize. In this study, we improve upon a previously proposed attention-based hierarchical classifier by combining with transfer learning approaches. Our model outperforms the state-of-the-art flat and hierarchical baselines on both internal and benchmark datasets with limited data. Furthermore, our model is more robust and interpretable compared to flat classifiers in the sense that errors generated by our model are more likely to remain in the correct subtree of the parent category, and that the predictions of our model can be interpreted at each level.

(3)

Acknowledgements

First and foremost, I would like to thank my parents for their continuous support and help. They might not be the greatest people in the world, but they absolutely are the best parents in my eyes. I would also like to thank my two supervisors: Shaojie Jiang and Yury Kashnitsky. Their knowledge, kindness and enthusiasm have made this study a precious experience of my student life. Moreover, I would like to thank UvA and KPN for providing me with this interesting thesis project and the opportunity to meet all the great colleagues at the DSL team.

Lastly, I would like to thank Xinru Liu. Her company and encouragement have allowed me to cope with the stress from study and work with conﬁdence and complete my Master’s.

Once again I want to express my sincere appreciation to the people and institutions above and all those who helped, it would not have been possible for me to overcome all the obstacles in the past and ﬁnish the thesis on time without them.

Kai Liang June 2020

(4)

7 Conclusion and future directions 22 Bibliography 23 A Category distributions of the ZM dataset 27 B Equations for ablation studies 28 B.1 Score functions . . . 28 B.2 Multi-hop attention . . . 28 B.3 Decoder variants . . . 28 B.3.1 no-attn + MLP . . . 28 B.3.2 no-attn + LSTM + MLP . . . 28 C Additional study 30 C.1 Extra ﬁne-tuning of BERTje LM . . . 30

(6)

List of Figures

1.1 (a) The flat classification approach which always predicts the leaf-level categories (the dashed square repre-sents a text classifier and each node reprerepre-sents a different category, same hereinafter). (b) An example of the local hierarchical classification approach which builds a classifier per parent node. (c) The global

classiﬁca-tion approach which has an end-to-end classiﬁer to predict the whole class hierarchy. . . 2

1.2 Search interest on the topic of ‘transfer learning’ worldwide over the past ﬁve years. Data source: Google Trends (https://www.google.com/trends). . . 2

2.1 General steps of applying transfer learning approaches for text classiﬁcation. . . 5

3.1 Architecture of our model. . . 7

5.1 Training loss and L2 validation weighted-F1 curves of models with BERT(je) LM on the ZM dataset. . . 14

5.2 Validation weighted-F1 curves of our model with diﬀerent score functions on the ZM dataset. . . 15

5.3 Validation weighted-F1 curves of our model using diﬀerent numbers of attention hops on the ZM dataset. . 16

5.4 Validation weighted-F1 curves of our model with diﬀerent decoder architectures on the ZM dataset. . . 17

5.5 Validation weighted-F1 curves of our model with diﬀerent pre-trained language models as the encoder on the ZM dataset. . . 18

6.1 Top features of the LR model for ﬁrst-level categories on the ZM dataset. . . 20

6.2 Attention heatmap at diﬀerent levels of an example transcript in the ZM dataset. For each token, the back-ground color intensity is proportional to the degree of attention. In this example, more tokens are focused by our model (strongly) at the ﬁrst level compared to at the second level. . . 20

A.1 First-level (L1) category distribution of the KPN business market dataset. . . 27

A.2 Second-level (L2) category distribution of the KPN business market dataset. For simplicity, category names are shortened. . . 27

(7)

List of Tables

2.1 Comparison of several Dutch pre-trained models in terms of pre-training data. . . 5 4.1 Main hyperparameter settings of diﬀerent models. . . 11 5.1 Test results of diﬀerent models on the ZM dataset in terms of micro-F1 (M-F1) and weighted-F1 (W-F1)

scores. The results at the first level reflect the robustness of each model. The flat baselines are trained to directly predict the leaf-level (L2) categories and hence only have results in the inferred scenario. . . 13 5.2 Test accuracies (a.k.a. micro F1-scores) of different models on the WOS dataset. Results marked with † and

‡are taken from Sinha et al. (2018) and Kowsari et al. (2017) respectively. . . 14 5.3 Test results of our model with diﬀerent alignment score functions in terms of micro and weighted F1-scores

on the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level. . . 15 5.4 Test results of our model with diﬀerent numbers of attention hops in terms of micro and weighted F1-scores

on the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level. . . 16 5.5 Test results of our model with diﬀerent decoder architectures in terms of micro and weighted F1-scores on

the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level. . . 17 5.6 Test results of our model with diﬀerent pre-trained encoders in terms of micro and weighted F1-scores on

the ZM dataset. For simplicity, we only report the predicted results at the first level. . . 18 C.1 Test results of models with different degrees of LM fine-tuning on BERTje on the ZM dataset. . . 30

(8)

1 |

Introduction

1.1 Rising challenges in text classiﬁcation

A tremendous amount of text is generated every day from all aspects of our lives (Kowsari et al., 2017; Allahyari et al., 2017), making automatic text classification an increasingly important task in Natural Language Processing (NLP) with a wide range of research and industrial applications (Joulin et al., 2016; Howard and Ruder, 2018). On the other hand, the task of text classification has become more challenging than ever before (Kowsari et al., 2017; Sinha et al., 2018), and one of the reasons is that as the interest in a certain domain deepens, it is no longer sufficient to merely categorize the text data to some coarse classes. Consequently, the number of categories also increases and becomes more fine-grained (Kowsari et al., 2017; Sinha et al., 2018), hence resulting in the hierarchical text classification task. Although a straightforward means to settle the problem is flattening the class taxonomy and treating the task as a typical text classification problem (see Figure 1.1a), it might lose a lot of vital information and thus make the predictions less robust (Sinha et al., 2018).

Alternatively, one may exploit the category hierarchy and instead tackle the problem in a hierarchical fashion. In general, there are two types of hierarchical classification approaches: local (or ‘top-down’, see Figure 1.1b) and global (or ‘big-bang’, see Figure 1.1c) (Silla and Freitas, 2011). A local approach requires to have multiple classifiers at the same time, which has the disadvantage that the number will grow exponentially as the hierarchy expands. A global approach, on the other hand, trains a single classifier to predict the entire hierarchy, which is beneficial both for deployment and maintenance. Within this range, Sinha et al. (2018) proposed a global attention-based classifier which they claimed to be the state-of-the-art for hierarchical classification. However, in their approach, all the network parameters are randomly initialized, which is not suitable for small datasets.

Another prominent challenge in text classification is that, while Deep Learning has recently dominated many NLP tasks (Young et al., 2018), and novel Deep Neural Networks (DNNs) with complex architectures are being proposed continuously, these models are generally trained from scratch, requiring lots of training data and might take days or even weeks to converge (Howard and Ruder, 2018). Nevertheless, in some real-world scenarios, it is too expensive and time-consuming to have so much data labelled (Ratner et al., 2017), and training a huge model from scratch is not feasible because such training could be computationally and financially heavy (Jouppi et al., 2017). As a result, it has sometimes been difficult to achieve significant improvements over traditional machine learning algorithms (van der Burgh and Verberne, 2019) like support vector machines (SVMs) (Joachims, 1998) and logistic regression (Zhang and Oles, 2001) with DNNs on limited data.

As a way to alleviate the small data issue, transfer learning has become popular in recent years (see Figure 1.2) to re-use low-level features of long-trained models in similar tasks where the training data are diﬃcult to collect (Weiss et al., 2016). In other words, we can extract the weights of a large pre-trained model and ﬁne-tune them slightly so that the model also works for the target task with limited labelled data.

Presumably, the most promising pre-trained language models for fine-tuning text classification models include ULMFiT (Howard and Ruder, 2018), BERT (Devlin et al., 2018), ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), etc. These large-scale models are similar in a sense that they provide the merit of having a common and expensive pre-training stage, followed by some lighter fine-tuning steps to solve the specific tasks (Delobelle et al., 2020). Furthermore, the language model pre-training and fine-tuning are self-supervised (e.g. predicting the next token in a sequence), meaning that no explicit labels are needed. Therefore, the models can make use of lots of general-purpose and in-domain unlabeled texts to learn the semantic information of the desired language (de Vries et al., 2019), which thereby helps to resolve supervised learning problems like text classification. These approaches thus improve the state-of-the-art performance on a lot of NLP tasks and work much better than to train context-free models like word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) with a large

(9)

R

1 2

1.1 1.2 2.1 2.2

(a) The ﬂat approach.

R

1 2

1.1 1.2 2.1 2.2

(b) The local approach.

R

1 2

1.1 1.2 2.1 2.2

(c) The global approach.

Figure 1.1: (a) The flat classification approach which always predicts the leaf-level categories (the dashed square represents a text classifier and each node represents a different category, same hereinafter). (b) An example of the local hierarchical classification approach which builds a classifier per parent node. (c) The global classification approach which has an end-to-end classifier to predict the whole class hierarchy.

corpus and then re-use the trained vectors (i.e. word embeddings) for classiﬁcation or regression.

2016

2017

2018

2019

2020

Time

0

20

40

60

80

100 Po

pu

lar

ity

Figure 1.2: Search interest on the topic of ‘transfer learning’ worldwide over the past ﬁve years. Data source: Google Trends (https://www.google.com/trends).

However, the success of pre-trained language models on NLP tasks has long been limited to the most widely used lan-guages such as English and Chinese (Devlin et al., 2018). Although there exist some multilingual models like Multilingual-BERT which is trained on 104 languages, a monolingual model usually performs better for that language (Martin et al., 2019; Virtanen et al., 2019). In terms of Dutch, several pre-trained monolingual Dutch language models are published just recently, which include Dutch-ULMFiT (van der Burgh and Verberne, 2019), BERTje (de Vries et al., 2019) and RobBERT (Delo-belle et al., 2020). These models all have great potential of improving the performance on Dutch NLP tasks but are currently only tested on several tasks such as sentiment classiﬁcation and part-of-speech tagging. Testing and experimenting with them more thoroughly can hence also promote the research in the ﬁeld and make full use of their utilities. Moreover, as the largest

(10)

telecommunications company in the Netherlands, KPN1receives a lot of calls and chats from the customers every day. To foster quicker serving customers with their requests, the automatic classiﬁcation of these data into hierarchical categories is of great signiﬁcance. However, labelled data are scarce for these tasks due to privacy concerns (Voigt and Von dem Bussche, 2017) and the requirement of professional knowledge. Thus, Dutch pre-trained models are suitable for adoption in our task among others.

Currently, KPN is using flat logistic regression models in production for solving the hierarchical classification problems on customer and business data (CM & ZM). For commercial applications, on the one hand the classification performance of a model is important as it determines how fast the requests can be forwarded to the correct (sub-)departments, while on the other hand the interpretability is also important as it determines how reliable the predictions are. In terms of hierarchical classification, it means that the model should provide as much interpretability as possible while not sacrificing performance, thus not only for the leaf level, but also for all preceding levels. For researchers and developers, providing a specific interpre-tation at each level makes it easier to reason about the flaws when errors arise, for instance, depending on the level at which an error occurs, the interpretation at the specific level can be inspected.

1.2 Contributions of this work

In this work, we address the aforementioned issues of text classification by improving upon a previously proposed attention-based global hierarchical classifier (Sinha et al., 2018) with transfer learning approaches. Through a series of experiments on our model and other state-of-the-art baselines on both internal and benchmark datasets, we demonstrate the superiority and robustness of our method in the field of hierarchical classification. To the best of our knowledge, we are the first to combine hierarchical classification models with transfer learning approaches. Specifically, the contributions of our work are as follows: 1. We propose a neural attention-based global classifier based on the model of Sinha et al. (2018) with transfer learning approaches, which achieves new state-of-the-art performance on the benchmark WOS-46985 dataset (Kowsari et al., 2017).

2. We compare several recently proposed pre-trained Dutch language models on the non-public KPN business market dataset, which provides a better image of their usefulness in the industrial applications.

3. We show from qualitative analysis that our hierarchical model provides better interpretability than the ﬂat approaches as it allows interpretation at each level.

(11)

2 |

Related Work

2.1 Hierarchical text classiﬁcation

2.1.1 Flat classiﬁers

Traditional text classiﬁers mainly rely on diﬀerent machine learning algorithms such as naïve Bayes (McCallum et al., 1998), decision trees (Apté et al., 1994), logistic regression (LR) and SVMs. Among these methods, LR is still used as one of the most accurate non-neural baselines (Pranckevičius and Marcinkevičius, 2017). However, a major shortcoming of such models is that they focus heavily on feature engineering and feature selection, for instance, TFIDF (Salton and Buckley, 1988) with the removal of stop words (Lai et al., 2015).

In recent years, with the advent of Deep Learning, DNNs are dominating the field of text classification. While initial models mostly use CNN-based (Kim, 2014) or RNN-based (Hochreiter and Schmidhuber, 1997) structures to learn sequence representations automatically via the recurrent components or some pooling operations, more recent work has incorporated and adapted the attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015) to obtain better and more interpretable results (Yang et al., 2016; Lin et al., 2017). Although these models no longer require manual feature engineering, they are generally task-specific and need to be trained from scratch if not using pre-trained word embeddings, hence demanding lots of training data to generalize (Howard and Ruder, 2018). Besides, traditional word representations such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) lack contextual information (Delobelle et al., 2020), as they only train a single embedding for each word type (e.g. ‘bank’), which therefore limits their accuracy.

2.1.2 Hierarchical classiﬁers

In terms of hierarchical classifiers, most of the existing work adopts the local approaches where multiple classifiers are learnt from top to bottom (Vens et al., 2008; Kowsari et al., 2017). Hence, despite that these models exploit the external knowledge from the class taxonomy, they normally require a high computational cost for training and maintenance (Sinha et al., 2018). By contrast, methods under the global approach trains an end-to-end unified classifier to predict the entire taxonomy. Within this scope, Aly et al. (2019) propose a variant of the capsule networks (Hinton et al., 2011) to solve the hierarchical classifi-cation task. However, their model is outperformed by simple neural networks such as TextCNN (Kim, 2014) and LSTM (Hochreiter and Schmidhuber, 1997) on the benchmark WOS dataset (Kowsari et al., 2017). On the other hand, Sinha et al. (2018) propose an attention-based global hierarchical classifier which is the current state-of-the-art in hierarchical classifica-tion. The main idea of their method is similar to Seq2Seq modelling where the input word sequence is used to generate the output category sequence. In addition, they revise the attention mechanism to produce dynamic input representations for predicting the labels at different levels. However, a major disadvantage of their model is that all the network parameters are randomly initialized, meaning that it might not generalize well on small datasets.

2.2 Transfer learning in NLP

Assume a source taskTSand a related target taskTTwhereTS 6= TT, transfer learning is a process to improve the performance onTTby re-using the relevant information ofTSinTT(Weiss et al., 2016). Transfer learning is proved to work superiorly in Computer Vision (CV) where models trained on, for instance, ImageNet (Deng et al., 2009) can be adopted to resolve other

(12)

visual recognition tasks (Sharif Razavian et al., 2014; Oquab et al., 2014). However, until recent years, transfer learning has not worked for NLP as good as for CV (Mou et al., 2016), and the idea of a general-purpose language model extracting good textual features (i.e. embeddings) transferable to down-stream NLP tasks is still a popular domain being investigated.

2.2.1 Common approaches

Currently, some of the most widely adopted transfer learning approaches for fine-tuning text classification models are ULM-FiT (Howard and Ruder, 2018), BERT (Devlin et al., 2018) and ALBERT (Lan et al., 2019). Although there exist some differences between these approaches which will be discussed later, they share a similar pipeline which mainly contains the following steps: 1) general-purpose language model (LM) pre-training; 2) task-specific language model fine-tuning; and 3) classifier fine-tuning (Howard and Ruder, 2018). For the first two steps, the models are trained on some language modelling tasks where no explicit labels are required, thus the models can utilize large unlabeled text collections to learn contextual word representations of the language. Then, the third step only needs a small labelled dataset for fine-tuning the last few layers to perform the classification task.

General-Purpose Language Model Pre-training In-Domain Language Model Fine-tuning In-Domain Classiﬁer Fine-tuning

Figure 2.1: General steps of applying transfer learning approaches for text classiﬁcation.

In terms of the technical differences between the aforementioned methods, ULMFiT is trained on the straightforward language modelling task which uses AWD-LSTM (Merity et al., 2017) as the model architecture. On the other hand, BERT uses the transformer architecture (Vaswani et al., 2017) which is fully based on self-attention without recurrences or convo-lutions, and it is pre-trained on two tasks: masked language modelling (MLM) and next sentence prediction (NSP). The MLM task is different from the above straightforward language modelling task in the sense that the masked tokens can ap-pear in any position of the input. As for NSP, the model needs to predict if two sentences are contextual or just randomly selected. These two tasks are designed for the BERT model to capture semantic information both on the token level and between sentences. Nevertheless, the NSP task is shown to be ineffective as it fails to model sentence relationships (Liu et al., 2019). Therefore, ALBERT replaces the NSP task with the sentence order prediction (SOP) task, where the two sentences are either in the correct order or swapped (Lan et al., 2019). As another variant of BERT, RoBERTa (Liu et al., 2019) drops the NSP and uses dynamic masking to mask out different tokens across different epochs with a larger batch size.

2.2.2 Dutch pre-trained language models

Table 2.1: Comparison of several Dutch pre-trained models in terms of pre-training data.

LM Dataset Total Size

Wikipedia Books TwNC1 SoNaR-5002 Web News OSCAR3

Dutch-ULMFiT 3 7 7 7 7 7 92M

BERTje 3 3 3 3 3 7 2.4B

RobBERT 7 7 7 7 7 3 6.6B

Although the above pre-trained language models were initially only available for some most common languages such as En-glish and Chinese (Devlin et al., 2018), some concurrent work has also released monolingual Dutch versions, which includes, for instance, Dutch-ULMFiT (van der Burgh and Verberne, 2019), BERTje (de Vries et al., 2019) and RobBERT (Delobelle et al., 2020). While Dutch-ULMFiT was pre-trained on the direct language modelling task as the original ULMFiT model on Dutch Wikipedia, BERTje was pre-trained with the MLM and SOP objectives on a much larger text collection (Table 2.1).

1_{Ordelman et al. (2007).} 2_{Oostdijk et al. (2013).} 3_{Suárez et al. (2019).}

(13)

In addition, RobBERT was pre-trained on the largest data in comparison with 6.6 billion tokens using the RoBERTa architec-ture. Nevertheless, despite its huge data size, RobBERT was pre-trained on the Dutch OSCAR dataset (Suárez et al., 2019) alone, which was obtained from running the language recognition task on Common Crawl4with fastText (Joulin et al., 2016; Grave et al., 2018).

2.3 Towards more interpretable models

Although deep learning models have shown impressive performance on a wide range of tasks, their usefulness for real-world problems has been generally limited by the lack of transparency and interpretability due to the non-linear structures (Samek et al., 2017). With the increasing interest of building explainable AI systems, many previous and recent works have contributed diﬀerent strategies. Thrun (1995) proposes to extract if-then rules from neural networks, similarly, Craven and Shavlik (1996) propose to induce decision trees for approximating the networks. More recently, Faruqui et al. (2015) propose to use sparser word vectors that better encode the features of lexical semantics. In addition, Lei et al. (2016) and Bastings et al. (2019) extract informative pieces of texts (known as rationales) using the generative models. As another approach, attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015) encode the input sequence by learning a weighted alignment distribution between the output and each of the input elements, which have shown superior performance on many problems such as image cap-tioning (Xu et al., 2015) and machine translation (Bahdanau et al., 2014).

(14)

3 |

Method

3.1 Model architecture

Our proposed model, as illustrated in Figure 3.1, consists of three parts: 1) a pre-trained BERT encoder which encodes each token in the input sequence to the corresponding contextual vector representations; 2) an attention layer which helps to generate dynamic sequence representations for diﬀerent levels of classiﬁcation; and 3) multi-layer perceptions (MLPs) which predict the label at each level based on the sequence representations.

BERT Transformer Layer 1 BERT Transformer Layer 2

Token 1 Token 2 Token 3 Token 4 ••• Token N

•••

BERT Transformer Layer 12

h1 h2 h3 h4 ••• hn

S₁ S₂ Sm

c1 c2 cm

Figure 3.1: Architecture of our model.

Similar to the work of Sinha et al. (2018), our model is essentially a sequence-to-sequence model which solves the hierar-chical classification problem in a global manner. Unlike typical Seq2Seq tasks in NLP such as machine translation or question answering (Sutskever et al., 2014) where both input and output sequences are sentences, the output of our model is a sequence of labels and each time-step denotes a different level. Besides, to avoid the difficulty of representing the entire input sequence with a fixed-length vector, we adopt the attention mechanism proposed by Sinha et al. (2018) to allow for dynamic sequence representations across different time-steps of the decoder. Nevertheless, instead of randomly initializing the encoder param-eters or using static word embeddings such as word2vec or GloVe, our model exploits contextual token representations from the large-scale pre-trained language model, namely BERT (Devlin et al., 2018; de Vries et al., 2019). Therefore, as we will further elaborate in the next sections, the number of parameters trained from scratch in our model is minimised.

3.2 Algorithm

In general, our algorithm contains four diﬀerent steps to predict hierarchical categories, which are namely: 1) encoding input sequences; 2) computing attention scores; 3) calculating sequence representations; 4) predicting categories. While the ﬁrst

(15)

step is only performed once for each input, the rest need to be repeated for each level. An overview of the training procedure of our model is shown in Algorithm 1.

Algorithm 1 Training steps of our model.

Input: token sequence{w1, w2, . . . , wn}, label sequence{l1, l2, . . . , lm} 1: Initialize decoder parameters (incl. category embeddings)

2: Encode the token sequence .Equation 3.1

3: for j=1to m do

4: Look up the category embedding of lj−1

5: Compute attention scores .Equation 3.2 and Equation 3.5

6: Calculate (dynamic) sequence representations .Equation 3.6

7: Predict the category at level j .Equation 3.7

8: Compute the cross-entropy loss between the predicted category and lj .Equation 3.8 9: end for

10: Update encoder and decoder parameters according to the sum of cross-entropy loss at each level .Equation 3.8

3.2.1 Encoding input sequences

Formally, suppose we have a sequence of n tokens{w1, w2, . . . , wn}and its labels at m levels{l1, l2, . . . , lm}1, a pre-trained BERT model is ﬁrst used to obtain the bidirectional contextual representations of the tokens. Since BERTBASEemploys 12 transformer blocks in its encoder (Devlin et al., 2018; de Vries et al., 2019), we take the sequence of hidden states at the output of the last layer of the BERT model H∈Rn×dh_(d_h_{: LM hidden dimension):}

H= (ho₁, ho₂, . . . , hon) (3.1)

where each ho_i ∈Rd_h_{and o stands for the last layer.}

3.2.2 Computing attention scores

As our model learns dynamic sequence representations for the classification at different levels, we do not simply take the final hidden state of the special [CLS] token as the sequence representation. Instead, similar to Sinha et al. (2018), when classifying the category at level j, we first convert the category ID at the previous level lj−1to its embedding cj−1 ∈Rdc(dc: category embedding dimension) which is randomly initialized and learnable, and then concatenate with every token representation in H2:

ˆ

Hj = ([ho1, cj−1],[h2o, cj−1], . . . ,[hon, cj−1]) (3.2)

For the special case when j=1, we feed the decoder with a predeﬁned <s> category which is the same for every input. Then, we use this aggregated representation ˆHj ∈Rn×(dh+hc)to compute the attention score for each input token at level j with some randomly initialized and learnable weight matrix Wa1 ∈ Rda×(dh+hc)and vector wa2 ∈ Rda (da: attention hidden dimension)3: a_j =align H, cj−1 (3.3) =softmax score H, cj−1 (3.4) =softmaxw_a2tanhWa1Hˆj> (3.5) We use the softmax function to ensure that the attention scores sum up to 1. Note that when training in batches, we need to account for the attention masks (indicating the padding positions) before applying the softmax function in Equation 3.5 to avoid attending to the padding ([PAD]) tokens.

1_{We assume that each wi}_{and li}_{correspond to the token ID and category ID of size}_R.

2_{In Sinha et al. (2018), this concatenation operation is represented by the}_⊕_{symbol. However, to avoid possible confusions with the tensor product, we} adopt the [ ] notation.

3_{This is similar to the concat score function as introduced in Luong et al. (2015), while the decoder’s hidden state is replaced by the category embedding} to compute the attention scores.

(16)

3.2.3 Calculating sequence representations

The unique sequence representation sj ∈Rdhat level j can therefore be obtained through the matrix multiplication between a_jand the encoder’s hidden states:

s_j =a_jH (3.6)

As we need to repeat this and the previous steps for each level, we ensure that our model produces speciﬁc sequence represen-tations at diﬀerent levels.

3.2.4 Predicting categories

Finally, the category at level j is predicted by a single-layered MLP with batch normalization (Ioﬀe and Szegedy, 2015) and dropout (Srivastava et al., 2014), followed by a softmax over all the classes at level j4_:

yj=softmax

Wss>j +bs

(3.7) The classiﬁcation lossL_totalis then the summation of cross-entropy loss over all m levels (Sinha et al., 2018):

L_total =

m

∑

i=1

CELoss(yi, li) (3.8)

(17)

4 |

Experiments

4.1 Research questions

Our experiments are guided by the following research questions:

For ﬂat classiﬁers,

1. Does using data augmentation methods (e.g. oversampling) improve the classification performance? 2. Does using neural models without transfer learning approaches improve the classification performance? 3. How does LM fine-tuning improve the classification performance using the pre-trained model BERTje? 4. How does BERT(je) compare to (Dutch-)ULMFiT in terms of the classification performance?

For hierarchical classiﬁers,

5. Does using the global hierarchical classifier of Sinha et al. (2018) improve the classification performance? 6. Does using our global hierarchical classifier with transfer learning improve the classification performance?

4.2 Experimental setup

4.2.1 Datasets

The KPN business market (ZM) dataset is used for the majority of our experiments. The ZM dataset is a two-level hierarchi-cal dataset in Dutch which contains a total of 8,904 labelled hierarchi-call and chat transcripts between agents and customers. As shown in Appendix A, the dataset has 6 ﬁrst-level (L1) classes and 36 second-level (L2) classes. We randomly divide the dataset into 75%and 25% for training and testing, which therefore amount to 6,678 and 2,226 samples respectively. Due to the small size of the ZM dataset and to be consistent with the current scheme of KPN, the test set is also used for validation. In addition, we ensure that every class is present in the test set for fair assessment.

Moreover, as the ZM dataset contains sensitive information which is thus not publicly available, we also test our model on the benchmark Web of Science (WOS) dataset (Kowsari et al., 2017) to allow reproducibility. The WOS dataset is a two-level hierarchical dataset in English which has 46,985 documents in total with 7 ﬁrst-level (L1) classes and 134 second-level (L2) classes. Alike Sinha et al. (2018), we randomly split the dataset into 80%, 10% and 10% for training, validation and testing.

(18)

4.2.2 Baselines

The state-of-the-art flat classifiers, including both non-neural and neural models, such as LR (Zhang and Oles, 2001), ULM-FiT (Howard and Ruder, 2018) and BERT (Devlin et al., 2018) are used for comparison. Specifically, to answer Question 1 to 4 in section 4.1, we report the results of the following flat baselines. For LR, we train the model either using the original training set or oversampling the minority classes. For BERT, we train the model either from scratch on the training set or using pre-trained model BERTje (with or without LM fine-tuning on the training set). For ULMFiT, we train the model with LM fine-tuning either on the training set or also on an extra unlabelled ZM dataset with 37,137 samples. Besides, we compare our model to the state-of-the-art global hierarchical classifier ‘hier-class’ (Sinha et al., 2018).

4.2.3 Hyperparameters

Table 4.1: Main hyperparameter settings of diﬀerent models.

Parameter LR BERT ULMFiT hier-class Our model

Inverse of regularization strength 1e2 - - -

-Class weights balanced - - -

-Maximum iterations 1e3 - - -

-Number of epochs - 10 15 10 10

Optimizer L-BFGS1 Adam2 Adam Adam Adam

LM learning rate (where applicable) - 1e-5 7e-2 - 1e-5

Classiﬁer learning rate - 1e-3 5e-2 1e-3 1e-3

Batch size - 16 32 16 16

Dropout rate - 0.3 0.3 0.3 0.3

Word embedding dimension - - 400 256

-LSTM hidden dimension - - 1150 256

-Category embedding dimension - - - 64 64

Attention hidden dimension - - - 256 256

LM hidden dimension - 768 - - 768

LR

We use the scikit-learn module (Pedregosa et al., 2011) for the implementation of our LR models, where TF-IDF on an n-gram range of(1, 3)with the removal of Dutch stop words are extracted as features. In addition, we set the main hyperparameters of our LR models as listed in Table 4.1. Besides, for the LR model with oversampling, we oversample all minority classes with less than 1,000 samples in the training set to the minimum between 200 and twice of the original size.

BERT

For our BERT models, we apply different learning rates for fine-tuning the LM and the classifier. Specifically, we adopt a higher learning rate for the classifier because the parameters of the classification layers are trained from scratch on our task. In contrast, the learning rate for fine-tuning the LM should be smaller as recommended by Devlin et al. (2018). Besides, following BERT (Devlin et al., 2018) and BERTje (de Vries et al., 2019), we apply WordPiece (Schuster and Nakajima, 2012; Wu et al., 2016) for tokenization, which is effective for handling rare words. The detailed hyperparameter settings of our BERT models are listed in Table 4.1.

1_{Nocedal (1980).} 2_{Kingma and Ba (2014).}

(19)

ULMFiT

We adopt the Fast.ai package (Howard and Gugger, 2020) for fine-tuning the ULMFiT models, where a three-layered AWD-LSTM is used as the LM. Furthermore, we employ techniques such as discriminative fine-tuning, slanted triangular learning rates (STLR) and gradual unfreezing proposed by Howard and Ruder (2018) during our fine-tuning. The main hyperparam-eter settings of our ULMFiT models are listed in Table 4.1.

hier-class

Following the implementation of Sinha et al. (2018), for the hier-class model a two-layered Bi-LSTM is employed as the en-coder which is randomly initialized and updated during training. Besides, we use 1 attention hop in the attention mechanism instead of multiple hops because it gives the best validation performance on our ZM dataset. For other hyperparameters, the values are set as listed in Table 4.1.

Our model

We set the hyperparameters of our model similar to the ones of BERT and hier-class for fair comparison, which are listed in Table 4.1. In addition, we also use one attention hop in the attention mechanism as hier-class because it performs better than using high attention hops (as shown in subsection 5.3.2).

4.2.4 Evaluation

To evaluate the classification performance, considering that our ZM dataset is very imbalanced and the majority classes are more important, we report on two different metrics: micro-F1 and weighted-F1. In addition, because in our task each instance is assigned to exactly one category (per level), the micro F1-score is also equivalent to the accuracy. For hierarchical models, we employ different mechanisms for training and testing. In the training process, we use teacher-forcing where the true parent label is provided to predict the next-level labels. However, during inference, we present the overall results (Sinha et al., 2018) where the model takes its own prediction as an input to the next level.

(20)

5 |

Results

5.1 Experimental results

As listed in Table 5.1 and Table 5.2, our model considerably outperforms the state-of-the-art hierarchical classifiers on both the ZM and WOS datasets. Specifically, we improve the performance of hier-class on the ZM dataset by around 3.2% at the first level and around 7.2% at the second level. On the benchmark WOS dataset, we improve the performance of HDLTex (a local baseline) by 0.6% and 5.1%, and hier-class by 1.8% and 4.2% at the first and second levels. Furthermore, compared with the flat baselines, our model achieves the best performance on the WOS dataset in terms of the leaf-level accuracy and on the ZM dataset in terms of the weighted F1-score. However, we also notice that our model is slightly worse than the best flat baseline by 0.1% on the leaf-level micro F1-score on the ZM dataset (Table 5.1). Therefore, we additionally evaluate the robustness of all the models on the ZM dataset in terms of their first-level performance. As indicated in Table 5.1, our model is more robust to errors because 84.5% to 84.6% of the classified data have the correct predicted or inferred parent classes, while both the flat and hierarchical baselines perform worse. In other words, the higher scores at the first level of our model indicate that hierarchically classifying the data is beneficial to primary levels.

Table 5.1: Test results of different models on the ZM dataset in terms of micro-F1 (M-F1) and weighted-F1 (W-F1) scores. The results at the first level reflect the robustness of each model. The flat baselines are trained to directly predict the leaf-level (L2) categories and hence only have results in the inferred scenario.

Model Level-1 Level-2 Predicted Inferred M-F1 W-F1 M-F1 W-F1 M-F1 W-F1 Flat classiﬁers

LR + TFIDF (w/o oversampling) - - 0.814 0.806 0.570 0.547

LR + TFIDF (oversampling) - - 0.816 0.807 0.570 0.545

BERT (from scratch) - - 0.776 0.782 0.508 0.488

BERTje (w/o LM ﬁne-tune) - - 0.775 0.767 0.511 0.490

BERTje (small LM ﬁne-tune) - - 0.845 0.843 0.631 0.615

Dutch-ULMFiT (small LM ﬁne-tune) - - 0.826 0.822 0.597 0.576 Dutch-ULMFiT (extra LM ﬁne-tune) - - 0.839 0.836 0.614 0.589

Hierarchical classiﬁers

hier-class (Sinha et al., 2018) 0.819 0.813 0.817 0.811 0.558 0.543 Our model (hier-class + BERTje) 0.845 0.845 0.846 0.845 0.630 0.619

Among all our flat baselines, the BERTje model with its LM fine-tuned on the labelled training data performs the best for both metrics1(see Table 5.1), followed by the Dutch-ULMFiT model with its LM fine-tuned on both the labelled and extra unlabeled datasets. In contrast, the BERT model trained from scratch on the ZM dataset performs the worst, with micro and weighted F1-scores that are around 6.2% and 5.9% lower than the non-neural LR models, respectively. Besides, the BERTje model without LM fine-tuning slightly improves the results over the BERT model trained from scratch, but is still far worse than the LR models. Although it might be surprising that the BERT model trained from scratch performs comparably with the BERTje model without LM fine-tuning on the ZM dataset, we do observe that models using the pre-trained BERTje LM 1_{Initially, we did not experiment with all possible variants of the models such as ‘BERTje (extra LM fine-tune)’ and ‘Our model (hier+ extra BERTje)’, as} the experiments reported on Table 5.1 suffices to answer our research questions. However, for reference, we list the rest results in Appendix C.

(21)

Table 5.2: Test accuracies (a.k.a. micro F1-scores) of diﬀerent models on the WOS dataset. Results marked with † and ‡ are taken from Sinha et al. (2018) and Kowsari et al. (2017) respectively.

Model Accuracy

Level-1 Level-2 Flat classiﬁers

FastText (Joulin et al., 2016)† _- _0.613 CNN (Lee and Dernoncourt, 2016)‡ _- _0.705

BiLSTM + MLP + Maxpool† _- _0.777

BiLSTM + MLP + Meanpool† _- _0.731

Hierarchical classiﬁers

HDLTex (Kowsari et al., 2017)‡ _0.905 _0.766 hier-class (Sinha et al., 2018)† _0.893 _0.775 Our model (hier-class + BERT) 0.911 0.817

are more stable during training and validation (Figure 5.1), which indicates the benefit of using pre-trained weights. In addition, despite that our model has similar training loss and validation weighted-F1 curves as the BERTje model with LM fine-tuning, as we will discuss in chapter 6, our model provides better interpretability than the flat approaches in general while keeping or even improving the classification performance compared to the best flat baseline.

0 1000

2000

3000

4000

Number of iterations

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5 Tra

ini

ng

lo

ss

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6 Le

af-lev

el

va

lid

ati

on

w

eig

hte

d-F

1 BERT (from scratch)

BERTje (w/o LM fine-tune)

BERTje (small LM fine-tune)

Our model (hier-class + BERTje)

Figure 5.1: Training loss and L2 validation weighted-F1 curves of models with BERT(je) LM on the ZM dataset.

5.2 Preliminary summary

With respect to the research questions in section 4.1, our corresponding preliminary ﬁndings are as follows:

1. We show from the LR models that using data augmentation methods (e.g. oversampling) does not improve the classi-ﬁcation performance on the ZM dataset.

2. We show from the BERT model trained from scratch that using neural models without transfer learning approaches does not improve the classiﬁcation performance on the ZM dataset.

3. We show from the BERTje models that LM ﬁne-tuning is crucial for improving the classiﬁcation performance on the ZM dataset.

(22)

5. The original hier-class model does not improve the classiﬁcation performance on the ZM dataset.

6. Our model outperforms the hierarchical baselines and most of the ﬂat baselines, while performs comparably or slightly better than the best ﬂat baseline on the ZM dataset.

5.3 Ablation studies

Apart from the above experiments, we conduct extensive ablation studies to better understand the importance of each com-ponent for our model. Specifically, we study the effect of 1) adopting different score functions in our attention mechanism; 2) using different numbers of attention hops; 3) using different decoder architectures; 4) using different pre-trained models as the encoder.

5.3.1 Eﬀect of score functions

As introduced in section 3.2, our model employs the adapted attention mechanism (Sinha et al., 2018) which adopts an score function similar to the concat (or additive2) attention (Luong et al., 2015). In addition, we also compare with using the dot (or dot-product) attention and the general (or multiplicative) attention (Luong et al., 2015). The detailed equations can be found in section B.1.

For the dot attention, we use 768-dimensional category embeddings as it directly computes the dot product between the previous category embedding and the encoder’s hidden states without learnable weight matrices. The other hyperparameters of the two variants are kept the same as our model in Table 4.13_.

Table 5.3: Test results of our model with diﬀerent alignment score functions in terms of micro and weighted F1-scores on the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level.

Alignment Level-1 Level-2

M-F1 W-F1 M-F1 W-F1 dot 0.831 0.832 0.607 0.598 general 0.836 0.835 0.607 0.596 concat 0.845 0.845 0.630 0.619

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.60

0.65

0.70

0.75

0.80

0.85 Fir

st-lev

el

va

lid

ati

on

w

eig

hte

d-F

1

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.2

0.3

0.4

0.5

0.6 Le

af-lev

el

va

lid

ati

on

w

eig

hte

d-F

1 dot

general

concat

Figure 5.2: Validation weighted-F1 curves of our model with diﬀerent score functions on the ZM dataset. 2_{Referred to as additive attention in Vaswani et al. (2017), and as concat in Luong et al. (2015).}

(23)

We show the end results in Table 5.3 and Figure 5.2. As listed in Table 5.3, our model with the concat attention achieves the best performance on the ZM dataset, while dot attention and general attention perform comparably. Moreover, as revealed in Figure 5.2, our model with the concat attention performs considerably better at the initial stage and converges to the local optimum at around 1,200 iterations for both levels4, which is faster than dot attention (at around 2,800 iterations) and general attention (at around 2,100 iterations).

5.3.2 Eﬀect of attention hops

As using a single attention hop at each level might only allow the model to focus on a specific aspect of information (Lin et al., 2017), and that customers might mention different topics in a single call or chat, we also explore the effect of the number of attention hops on the classification performance by comparing our model using 1 hop against multiple hops ranging from 2 to 4 on the ZM dataset. The detailed equations can be found in section B.2.

Table 5.4: Test results of our model with diﬀerent numbers of attention hops in terms of micro and weighted F1-scores on the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level.

Hop Level-1 Level-2

M-F1 W-F1 M-F1 W-F1 1-hop 0.845 0.845 0.630 0.619 2-hop 0.839 0.833 0.621 0.606 3-hop 0.837 0.834 0.620 0.608 4-hop 0.831 0.828 0.615 0.600

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.74

0.76

0.78

0.80

0.82

0.84 Fir

st-lev

el

va

lid

ati

on

w

eig

hte

d-F

1

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.40

0.45

0.50

0.55

0.60 Le

af-lev

el

va

lid

ati

on

w

eig

hte

d-F

1 1-hop

2-hop

3-hop

4-hop

Figure 5.3: Validation weighted-F1 curves of our model using different numbers of attention hops on the ZM dataset. We show the results in Table 5.4 and Figure 5.3. As listed in Table 5.4, while the difference is minor, our model performs the best using 1-hop attention. In addition, we observe from both Table 5.4 and Figure 5.3 that the test results tend to worsen with the increase of attention hops, indicating that multi-hop attention might not be suitable for small datasets, or that there might not be multiple topics in the samples of the ZM dataset. In addition, compared with choosing different score functions, the effect of adopting multiple attention hops is negligible at both levels.

5.3.3 Eﬀect of decoder architectures

As introduced in chapter 3, the decoder of our model consists of an attention mechanism and some MLP layers. In addition, we compare with the following decoder architectures:

(24)

1. no-attn + MLP: We drop the attention mechanism introduced in subsection 3.2.2 and use the average of the last-layer encoder’s hidden states as the sequence representation5. However, to allow the conditioning of categories across levels, for each level, we concatenate the ﬁxed sequence representation with the category embedding at the previous level. The detailed equations can be found in subsection B.3.1.

2. no-attn + LSTM + MLP: We again drop the attention mechanism in subsection 3.2.2 and use an one-layered unidirec-tional LSTM network as our decoder. However, since LSTM takes care of the sequence memory itself, we do not per-form the concatenation as for the per-former variant. Besides, we initialize the first hidden state and cell state of the LSTM with the average of the last-layer encoder’s hidden states. The detailed equations can be found in subsection B.3.2. Table 5.5: Test results of our model with different decoder architectures in terms of micro and weighted F1-scores on the ZM dataset. For simplicity, we only report the predicted results at the first level.

Decoder Level-1 Level-2

M-F1 W-F1 M-F1 W-F1 no-attn + MLP 0.844 0.843 0.619 0.610 no-attn + LSTM + MLP 0.841 0.841 0.623 0.610 attn + MLP 0.845 0.845 0.630 0.619

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.74

0.76

0.78

0.80

0.82

0.84 Fir

st-lev

el

va

lid

ati

on

w

eig

hte

d-F

1

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.40

0.45

0.50

0.55

0.60 Le

af-lev

el

va

lid

ati

on

w

eig

hte

d-F

1 no-attn + MLP

no-attn + LSTM + MLP

attn + MLP

Figure 5.4: Validation weighted-F1 curves of our model with different decoder architectures on the ZM dataset. We show the results in Table 5.5 and Figure 5.4. As listed in Table 5.5, while the final difference is minor at the first level, our current model with the use of attention performs the best on the ZM dataset. In addition, as demonstrated in Figure 5.4, the three models also differ in terms of their convergence speed. Specifically, the gaps between the three models in terms of the second-level weighted F1-scores are clearly visible at the early epochs. In general, the better results at both levels indicate that our model learns more meaningful and discriminative sequence representations using the attention mechanism.

5.3.4 Eﬀect of pre-trained encoder models

As our model is not designed for a specific type of pre-trained language models, we compare using BERTje with RobBERT (Delobelle et al., 2020) as the encoder of our model on the ZM dataset. Aside the difference of the two pre-models introduced in chapter 2 in terms of their pre-training data and training objectives, they are also different in their tokenization techniques. Specifically, while BERTje adopts WordPiece (Schuster and Nakajima, 2012) for tokenization like BERT (Devlin et al., 2018), RobBERT uses Byte Pair Encoding (BPE) (Sennrich et al., 2015) like RoBERTa (Liu et al., 2019). However, since RoBERTa shares a similar model architecture as BERT, we use the same hyperparameters for training the two models (see Table 4.1).

5_{As recommend by HuggingFace’s Transformers (Wolf et al., 2019), it is often better to average or pool the sequence of hidden states for the whole input} sequence as the sequence representation.

(25)

Table 5.6: Test results of our model with diﬀerent pre-trained encoders in terms of micro and weighted F1-scores on the ZM dataset. For simplicity, we only report the predicted results at the ﬁrst level.

Encoder Level-1 Level-2

M-F1 W-F1 M-F1 W-F1 RobBERT 0.829 0.831 0.605 0.594 BERTje 0.845 0.845 0.630 0.619

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.74

0.76

0.78

0.80

0.82

0.84 Fir

st-lev

el

va

lid

ati

on

w

eig

hte

d-F

1

0 500 1000 1500 2000 2500 3000 3500 4000

Number of iterations

0.40

0.45

0.50

0.55

0.60 Le

af-lev

el

va

lid

ati

on

w

eig

hte

d-F

1 RobBERT encoder

BERTje encoder

Figure 5.5: Validation weighted-F1 curves of our model with diﬀerent pre-trained language models as the encoder on the ZM dataset.

We show the end results of the two models in Table 5.6 and Figure 5.5. To our surprise, BERTje considerably outperforms RobBERT as the encoder of our model, with ﬁrst-level F1-scores that are 1.6% and 1.4% higher and second-level F1-scores that are 2.5% higher (see Table 5.6). As also shown in Figure 5.5, the gaps between the two pre-trained models remain during the entire training process in terms of their weighted F1-scores at both levels. In addition, the gap at the second level further deviates towards the end of training. The notable diﬀerence of the two pre-trained models might indicate that the pre-training data of BERTje are more similar to the ZM dataset or that BERTje learns the Dutch language better during pre-training.

(26)

6 |

Discussion

6.1 Answers to research questions

After the analysis of our results in chapter 5, the previously introduced research questions can be answered. Firstly, using data augmentation techniques such as oversampling on our LR model does not appear to be beneficial for improving the classi-fication performance in terms of our evaluation metrics. Likewise, given the size of our ZM dataset, using complex neural architectures (e.g. BERT) without transfer learning approaches does not improve but might instead harm the classification performance, which also confirms the finding of Adhikari et al. (2019). On the other hand, we show that LM fine-tuning is useful and vital for improving the classification performance, and that using the pre-trained BERTje model alone is not sufficient for our task and/or dataset. In other words, our results do not confirm the efficiency of using the feature-based approaches with BERT, as opposed to Devlin et al. (2018). Besides, we show that the BERTje model is superior to the Dutch-ULMFiT model (van der Burgh and Verberne, 2019) for our task, since our BERTje model with its LM fine-tuned on only the labelled training data still beats the Dutch-ULMFiT model which is fine-tuned on both the labelled and unlabelled datasets. To explain this, we suppose that it is partly because BERTje was pre-trained on a much larger text collection than Dutch-ULMFiT (de Vries et al., 2019; van der Burgh and Verberne, 2019).

In terms of the hierarchical classifiers, the original ‘hier-class’ model (Sinha et al., 2018) does not improve the second-level classification performance over the LR models on the ZM dataset, however, its predictions are shown to be more robust in comparison. By contrast, the flat classifiers which perform worse than the LR baselines at the second level are worse at the first level as well. These observations thus highlight the advantage of exploiting the class taxonomy for the hierarchical classification task. Furthermore, our model which combines the global attentive hierarchical classifier with transfer learning approaches is not only able to improve the state-of-the-art performance but also the robustness of classification.

In addition, from the ablation studies we first show that the addictive attention works best for our model, which confirms Britz et al. (2017) that with large values of dk(i.e. the dimension of encoder’s hidden states), addictive attention outperforms dot-product attention and multiplicative attention (Vaswani et al., 2017). Besides, we show that for limited data such as the ZM dataset, using multi-hop attention does not improve the classification performance, which we conjecture it is because of the data sparsity. In addition, without the use of penalization terms, there might be redundant information between different hops of attention. We also show that using dynamic sequence representations at different levels generated by the attention mechanism helps to improve the performance over other decoder architectures. Lastly, regarding the fact that BERTje works better than RobBERT as the encoder of our model with smaller pre-training data in size, our surmise is that it is either due to the nature of our dataset, or that RobBERT was pre-trained on a corpus obtained from language recognition on web pages, which might thus contain documents of other languages or low-quality texts.

6.2 Interpretability analysis

6.2.1 LR model

In real-world applications, the interpretability of a model is sometimes as important as its performance. Currently, KPN uses LR models in production for the hierarchical classiﬁcation of the business market data, not only because of their competitive performance on small datasets, but also due to the interpretability of traditional machine learning models. For example, in

(27)

Figure 6.11we show some of the most contributing words for each first-level category of our LR model without oversampling. However, a drawback of the flat approach in the hierarchical classification task is that the interpretability is only available for the level at which the model is trained on, while as shown in Figure 6.2, a model generally needs to focus on different semantic information for predicting the categories at different coarse or fine-grained levels. As a result, the interpretation provided by the flat approaches to analyze the model predictions or assist real-time inference is limited, hence decreasing their usefulness in industrial applications.

6/8/2020

ﬁle:///var/www/html/pdf-magic/storage/app/uploads/html-to-pdf/31f6e704-7548-1dae-e92a-1e533375410c/tmp/index.html 1/1

y=Billing top features y=Contract End top_features y=Niet Classificeerbaar_{top features} _{Fulfillment top features}y=Order Capture & y=Orientation & Sales_{top features} y=Service top features Weight? _Feature +20.579 factuur +15.378 facturen +9.379 bedrag +8.791 rekening +7.822 kopie +7.634 euro +7.536 opgezegd +7.061 specificatie +7.012 nota +6.941 afgeschreven … 19697 more positive … … 30294 more negative … Weight? _Feature +21.515 opzeggen +8.790 opgezegd +8.288 opheffing +7.665 beeindigen +6.782 opgeheven +6.655 bevestiging +6.540 opheffen +6.451 opzegging +6.381 beeindigd +5.964 opzeggen_agent … 14098 more positive … … 35893 more negative … Weight? _Feature +6.103 oke +5.791 ken +5.173 yes +4.129 hai +3.838 juli +3.606 nou +3.358 toe +3.239 cancel +3.133 from +3.043 order … 4508 more positive … … 45483 more negative … Weight? _Feature +12.314 besteld +10.681 bestelling +8.501 aanvraag +7.947 aangevraagd +7.743 gekregen +7.699 ontvangen +7.114 order +6.317 geleverd +5.657 installatie +5.480 afgeleverd … 14474 more positive … … 35517 more negative … Weight? _Feature +8.264 aanvragen +7.897 bestellen +6.486 mogelijk +6.284 afsluiten +6.004 wilt +5.725 abonnementen +5.476 rijbewijs +5.070 mogelijkheden +4.722 verlengen +4.721 4g … 18016 more positive … … 31975 more negative … Weight? _Feature +8.955 storing +6.368 wijzigen +5.751 wijziging +5.727 probleem +5.409 monteur +5.330 aangepast +5.290 wachtwoord +5.172 ticket +4.722 werkt +4.671 melding … 26929 more positive … … 23062 more negative …

Figure 6.1: Top features of the LR model for ﬁrst-level categories on the ZM dataset.

6.2.2 Our model

[ agent ] welkom u chat met . . . ik lees uw bericht even door heeft u een moment voor mij

[ customer ] ja hoor

[ agent ] goede ##middag ik begrijp dat u het extra telefoonnummer wilt op ##zeggen van klant ##nummer . . .

[ customer ] klopt

[ agent ] ik ga even uw gegevens erbij pakken

[ agent ] wat is het telefoonnummer wat u wilt opheffen [ agent ] er zijn 2 nummers

[ customer ] oeps dat weet ik zo niet meer dat was ons oude fax ##nummer het nummer dat behouden moet blijven is . . .

[ agent ] eindigen ##d op . . . en . . .

[ agent ] prima dan weet ik zeker welke ik ophef [ customer ] het extra nummer eindig ##t op . . . [ customer ] deze kunt u dus opheffen

[ agent ] mag ik van u het juiste e mailadres alstublief ##t [ customer ] ma ##ske ##de ##mail

[ agent ] ik heb hem voor u opgeheven [ customer ] super dank u wel . . . [ agent ] graag gedaan

(a) L1 category: Contract End

[ agent ] welkom u chat met . . . ik lees uw bericht even door heeft u een moment voor mij

[ customer ] ja hoor

[ agent ] goede ##middag ik begrijp dat u het extra telefoonnummer wilt op ##zeggen van klant ##nummer . . .

[ customer ] klopt

[ agent ] ik ga even uw gegevens erbij pakken

[ agent ] wat is het telefoonnummer wat u wilt opheffen [ agent ] er zijn 2 nummers

[ customer ] oeps dat weet ik zo niet meer dat was ons oude fax ##nummer het nummer dat behouden moet blijven is . . .

[ agent ] eindigen ##d op . . . en . . .

[ agent ] prima dan weet ik zeker welke ik ophef [ customer ] het extra nummer eindig ##t op . . . [ customer ] deze kunt u dus opheffen

[ agent ] mag ik van u het juiste e mailadres alstublief ##t [ customer ] ma ##ske ##de ##mail

[ agent ] ik heb hem voor u opgeheven [ customer ] super dank u wel . . . [ agent ] graag gedaan

(b) L2 category: klant wil opzeggen [...] (Dutch) / customer wants to cancel [...] (translation)

Figure 6.2: Attention heatmap at diﬀerent levels of an example transcript in the ZM dataset. For each token, the background color intensity is proportional to the degree of attention. In this example, more tokens are focused by our model (strongly) at the ﬁrst level compared to at the second level.

In comparison, as our model generates different attention distributions at different levels of classification which can be separately interpreted, the results of our model are more interpretable than the flat classifiers. As visualized in Figure 6.2 of an example transcript in the ZM dataset2_{, our model attends to multiple aspects of the conversation at the first level to get} an overview of the content, while at the second level the attention is mostly put on certain keywords such as ‘opzeggen’ and ‘opheffen’, which both mean ‘cancel’ in English. In general, this re-reading effect (Sinha et al., 2018) not only enables our model to focus on different semantic information based on the granularity and scope at different levels for its own predictions, but

1_{We visualize using the eli5 package (https://eli5.readthedocs.io/).}

(28)

also provides more insights that better explain the results and contribute to the model analysis and error reduction in industrial applications. Therefore, for instance, if an error occurs at some previous level, we can easily refine our model by analyzing the interpretation specific to that level, as opposed to the flat approaches where only the leaf level can be inspected. In addition, although one might still argue that the interpretation at the leaf level suffices to explain the overall results, it is mainly based on the assumption that either the prediction is correct, or when an error is made, it happens at or close to the leaf level. However, for complex hierarchical classification problems with deep class taxonomies, if an error occurs already at the first few levels, it is more reasonable to interpret those levels than the leaf level which is further away.

(29)

7 |

Conclusion and future directions

In this study, we dive into the problem of hierarchical text classification on small datasets by proposing a global hierarchical attention-based classifier with transfer learning approaches. Through the experiments of our model and other state-of-the-art flat and hierarchical classifiers, we demonstrate that the joint use of pre-trained models and fine-tuning methods is beneficial for boosting the performance on small datasets. Besides, we conclude that the use of class taxonomy in hierarchical classifica-tion tasks helps to generate predicclassifica-tions which are more robust and interpretable than flat classifiers. Moreover, while using transfer learning approaches and exploiting the class taxonomy are useful on their own, our model which integrates the two achieves the state-of-the-art performance on both the internal ZM dataset and the benchmark WOS dataset.

However, there are also limitations in this study. For instance, although we show that our model considerably outperforms the non-neural LR baselines on the ZM dataset which has only 8.9K labeled examples, it might still be too expensive for some domains or applications. Therefore, a possible future direction is to more thoroughly test our model on datasets of diﬀerent scales, or alternatively, to investigate the learning curve of our model on some speciﬁc dataset. The goal is to determine how much less data our model needs to achieve a similar performance as the non-neural baselines on the original dataset, thereby reducing the manual labelling work.

Another limitation is that, in this study we have employed the BERT(je) models (Devlin et al., 2018; de Vries et al., 2019) and RobBERT (Delobelle et al., 2020) as the sequence encoder of our model. However, our model is not designed for some speciﬁc types of pre-trained language models. Therefore, another future direction is to experiment with other state-of-the-art transfer learning methods such as ELMo (Peters et al., 2018) and GPT-2 (Radford et al., 2019). We expect that depending on the language and data, the pre-trained model that is more suitable for the task will be diﬀerent.

Finally, due to the usage of large pre-trained language models, while our model achieves superior performance on lim-ited data, it requires more inference time and computational resources than the non-neural baselines, which is unfavourable for real-time production. Hence, future work could explore possible compression techniques such as knowledge distillation (Hinton et al., 2015) for our model, where an example would be replacing BERT (Devlin et al., 2018) with DistilBERT (Sanh et al., 2019) as the encoder.

Neural Hierarchical Text Classification with Transfer Learning Approaches

MSc Artificial Intelligence

Master Thesis

Neural Hierarchical Text Classiﬁcation with Transfer

Learning Approaches

Kai Liang

June 11, 2020

Supervisors:

MSc. Shaojie Jiang

Dr. Yury Kashnitsky

Assessor:

Prof. Dr. Christof Monz

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

1

|

Introduction

1.1 Rising challenges in text classiﬁcation

2016

2017

2018

2019

2020

Time

0

20

40

60

80

100

Po

pu

lar

ity

1.2 Contributions of this work

2

|

Related Work

2.1 Hierarchical text classiﬁcation

2.1.1 Flat classiﬁers

2.1.2 Hierarchical classiﬁers

2.2 Transfer learning in NLP

2.2.1 Common approaches

2.2.2 Dutch pre-trained language models

2.3 Towards more interpretable models

3

|

Method

3.1 Model architecture

3.2 Algorithm

3.2.1 Encoding input sequences

3.2.2 Computing attention scores

3.2.3 Calculating sequence representations

3.2.4 Predicting categories

∑

4

|

Experiments

4.1 Research questions

4.2 Experimental setup

4.2.1 Datasets

4.2.2 Baselines

4.2.3 Hyperparameters

4.2.4 Evaluation

5

|

Results

5.1 Experimental results

0

1000

2000

3000

4000

Number of iterations

0.0

0.5

1.0