Auto-Tagging Articles and Pages Based on Their Content

(1)

Auto-Tagging Articles and Pages Based on Their Content

submitted in partial fulfillment for the degree of master of science

Marta Aliu Carrascosa

13288512

master information studies

data science

faculty of science

university of amsterdam

2021-06-24

Internal Supervisor External Supervisor 3rdExaminer

Title, Name Gabriel Bénédict Michael Metternich Dr. Maarten Marx

Affiliation UvA, IVI, IRLAB Bloomreach UvA, IRLAB, ILPS

(2)

Auto-Tagging Articles and Pages Based on Their Content

Marta Aliu Carrascosa

University of Amsterdam

13288512

marta.aliu.carrascosa@student.uva.nl

ABSTRACT

Finding the right content on a site based on a user’s interest can be difficult. Content providers often struggle with bridging the gap to understand user behaviour and provide personalised content. This is frequently resolved by categorizing web content such as articles and blog posts into topical categories. Using machine learning and NLP to classify web content into a set of classes would alleviate the burden of manual classification, which can be very time-consuming and costly. This research studies the performance of BERT family models, which are state-of-the-art Transformer encoder models that allow transfer learning, in multi-class text classification of web content in terms of effectiveness (accuracy, F1-score) and effi-ciency (effectiveness / computational effort). Specifically, this thesis analyses the results of BERT and DistilBERT and delves into the specifics of their model architectures to fully understand the im-pact of each hyperparameter in fine-tuning, which fields of the text lead to better results, and whether zero-shot classification is a viable alternative. The findings indicate that these models are both suitable for web content classification with DistilBERT being more efficient. Partially freezing the pre-trained model increases efficiency with comparable performance, and it is recommended to use the title and the summary of the articles as input. Zero-shot learning, although with worse results than supervised learning, has various benefits and is an area worth exploring. Furthermore, a novel formula is introduced to measure efficiency in an objective manner. This thesis project was conducted as part of an internship

with Bloomreach1.

KEYWORDS

Multi-class text classification, BERT, DistilBERT, web content, zero-shot learning, NLP

1 INTRODUCTION

Users tend to search for content within a specific category that they are interested in. Content providers such as Bloomreach find it valuable to gather user behaviour data, including the topics that the users visit more often, in order to understand the users’ interests, recommend similar content from the same categories, and provide a more personalized experience. As such, tagging web content such as articles, blog posts, and news articles with tags or categories allows content creators to launch more tailored and personalized content to site visitors.

Content is generally classified into categories manually, either by the author of the article or by a website administrator. However, this can be very time-consuming and costly; especially if there is a large number of articles to tag retrospectively. This is the reason why classifiers are becoming increasingly popular for automatic 1_{https://www.bloomreach.com/}

categorization; either by using the entire document, the title, or a summary of the document.

The task of automatically classifying documents into a defined set of classes is referred to as text classification, which is divided into binary or multi-class based on whether the goal is to classify into one of two or one of three or more classes, respectively. Text classification is one of the main tasks within Natural Language Processing (NLP), and it is generally considered a supervised learn-ing problem. This topic has been researched for many years since being able to categorize textual data can help in understanding, structuring, analysing and extracting information from groups of documents. It can also make other NLP tasks such as search and retrieval more efficient. Hence, multiple text classification models have been proposed and developed. The current state-of-the-art

models include BERT [5] and other Transformer-based models.

However, just like with other supervised learning applications, the performance of these algorithms depends on the specific area of interest, the availability and the quality of the training data, the classes to classify into, and the experimental setup. For this reason, several questions arise in relation to multi-class text classi-fication for which there does not seem to be a general consensus. Especially not when dealing with web content that covers a very broad range of topics and that could be classified into dozens if not hundreds of different categories (depending on how granular the classification needs to be). The main objective of this research relies on studying how state-of-the-art BERT family models perform in regard to tagging textual web content not just in terms of effective-ness (e.g. accuracy) but also in terms of efficiency as proposed in [20]. In this thesis, efficiency is described as the ability to minimize computational effort whilst keeping effectiveness reduction to a minimum. There are various measures of efficiency such as memory use, model size, training time and inference time; although none of them take into account both effectiveness and effort simultaneously. This thesis aims to tackle this research gap.

1.1 Research Question and Sub-Questions

This thesis will aim to answer the following research question and sub-questions, where performance consists of effectiveness, com-putational effort and efficiency:

Research question: “To what extent can models based on Trans-formers and transfer learning be successfully used in supervised learn-ing to classify web content such as articles and blog posts into cate-gories based on their title and / or short description?”

Sub-question 1: “How do different models from the BERT family (BERT and DistilBERT) perform on the selected datasets for the task of multi-class text classification in a supervised learning setting?” Sub-question 2: “How do the various hyperparameter configura-tions in the fine-tuning phase of the supervised learning training modify the performance of the model on the selected datasets and

(3)

what is the importance of each feature?”

Sub-question 3: “How does the performance of the supervised learn-ing model vary when only the title of the articles is used to train and classify them, compared to using both the title and the short descrip-tion, or only using the short description?”

Sub-question 4: “How does zero-shot learning perform compared to supervised learning for multi-class text classification of articles and blog posts, when using BERT-based models for both approaches?”

The rest of this paper is divided as follows. In Section2, a review of the relevant work is performed, and any gaps in the current liter-ature are pointed out. Section3describes the methodology, which is subdivided into data description, implementation, evaluation met-rics, and experimental setup. Section4presents the results obtained, while Section5consists of a discussion of the results presented.

Fi-nally, Section6includes the conclusion and future work, where the

research questions are addressed and answered, and suggestions are made with regards to potential future directions.

2 RELATED LITERATURE

Extracting topics and categories from documents can be performed in two distinct ways: classification and clustering, which are super-vised and unsupersuper-vised learning methods, respectively. Although the end goal of both of these approaches is the same, text classifica-tion aims to automatically apply a set of seen labels to documents after training the model with labelled data; whereas clustering seeks to group related documents together according to their similarity by using unlabelled data. These distinctions are clearly exemplified by Croft et al. [3]. In this thesis, focus will be placed on supervised learning algorithms due to the great advances in the field of text classification within NLP and the desire to classify documents into a set of pre-defined classes rather than trying to detect similarities between the different documents.

2.1 Word Embeddings in Text Classification

Within neural networks and machine learning, there have been

recent advances [11,12] in the representation of words as

low-dimensional dense vectors, called word embeddings, where words that are similar in meaning have similar representation in the vector space. Word embeddings have opened a new range of opportunities in the field of transfer learning and the use of pre-trained models for deep learning in NLP that had not been possible before. This is due to their ability to capture semantic and syntactic information. These pre-trained language representations can be used in deep learning model architectures for multiple downstream tasks such as text classification via feature-based or fine-tuning strategies, removing the need for generating high-dimensional Bag of Words or Tf-Idf

sparse vectors [8] from the target corpus. Some of the most

popu-lar pre-trained vector models are Word2Vec [11], GloVe [15], and

fastText [7]. However, these word representation approaches only

generate context-free word embeddings rather than contextualized word embeddings, which means that each word only has one vector even if that word can have multiple meanings depending on its use and context. Instead, in contextualized word embeddings a word can have different vectors based on the context of each sentence, which is a huge improvement.

To generate contextualized word embeddings, several

state-of-the-art models have been proposed such as ELMo [16] and

ULM-FiT [6] that incorporate recurrence in the form of Convolutional

Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or Long Short-Term Memory models (LSTMs) to deal with dependen-cies in a sequential manner. On the other hand, the Transformer

[25] is a model that uses self-attention mechanisms and position

embeddings to deal with relationships between words without be-ing sequential, which poses many advantages such as allowbe-ing parallelized computation and reduced information loss through improved handling of long-term dependencies. The model architec-ture of the Transformer [25] is composed of a set of encoders and a set of decoders, each of which has a multi-head self-attention layer, a point-wise fully connected layer, and residual connections across layers. The Transformer model’s innovative use of the self-attention mechanism revolutionized the field of NLP and has become the basis of many state-of-the-art models.

2.1.1 BERT. The most popular Transformer-based model is

ar-guably BERT [5], which uses a stack of Transformer encoders and

produces deep bidirectional representations by looking at the entire sentence that contains the word, both from left and right. Addition-ally, BERT inserts a classification token [CLS] at the beginning of each input sequence, which is used as the sequence representation for classification tasks, and a token [SEP] at the end of each se-quence that is used to separate sese-quences in sentence pairs. BERT has two stages: pre-training and fine-tuning. In pre-training, the model is trained on a Masked Language Model Task and a Next Sen-tence Prediction Task using a massive unlabelled English language

corpus consisting of BookCorpus [31] and English Wikipedia [5].

The Masked Language Model Task aims to predict words in each se-quence that are randomly masked, making the model bidirectional; whereas the Next Sentence Prediction Task helps understand sen-tence relationships by feeding two sensen-tences as a pair to the model and attempting to predict if the second sentence goes after the first sentence in the original document. During fine-tuning, the model is trained with data labelled for the downstream task. In this way, the pre-trained model can be used for multiple NLP tasks, such as in text classification where the [CLS] token is fed into an output layer, by only fine-tuning for the specific task needed, rather than having to build the model from scratch with only the task-specific data. This is the equivalent of transfer learning from large models pre-trained with ImageNet [4,29] in Computer Vision. BERT’s bidi-rectional model architecture and two-step training strategy that allows transfer learning in multiple NLP tasks are the reasons why it is incredibly powerful.

Devlin et al. [5] proposed all the parameters in the BERT model to be fine-tuned, which in transfer learning is regarded as unfreez-ing the entire pre-trained model. On the other hand, freezunfreez-ing the model involves freezing the first few layers of the model so that the parameter weights on those layers are not updated, thus retaining their pre-trained values. However, since the BERT model has 110M parameters, fine-tuning all the parameters might not be suitable as it could be very computationally expensive. The authors did not offer a thorough explanation on why this is the recommended ap-proach, nor did they present the results of different hyperparameter configurations (other than model size). A study on the fine-tuning 2

(4)

phase of BERT for text classification was performed by Sun et al. with regards to dealing with long text, layer selection, learning rate, and decay factor to avoid catastrophic forgetting (i.e. when the pre-trained knowledge is lost after training on downstream tasks); on top of researching other fine-tuning methods such as further pre-training and multi-task fine-tuning [21]. In any case, Sun et al. still fine-tuned all the model parameters by unfreezing the entire model, rather than solely unfreezing the top layers and freezing the rest. Additionally, the number of epochs, dropout rate, and op-timizer were always kept the same throughout their experiments. Lee et al. [9] analysed the effect of freezing layers during BERT fine-tuning for multiple NLP tasks from the GLUE benchmark, although multi-class topic classification was not one of the tasks evaluated. As such, there is still room for investigation into the impact of the hyperparameters in multi-class text classification using BERT.

2.1.2 DistilBERT. DistilBERT was proposed by Sanh et al. [18]

in order to tackle the growing concern that state-of-the-art NLP models have to be extremely large to provide good results, which imposes restrictions on developers and how these models can be used in a production setting. As such, knowledge distillation or teacher-student learning was employed by the authors to generate a compressed version of BERT called DistilBERT (the student) that was trained to replicate the behaviour of the original BERT model (the teacher). This knowledge distillation task was done during the pre-training stage rather than in downstream tasks, which had been the focus of previous distillation work.

The model architecture of DistilBERT is the same as BERT [5]

except for a few changes that the authors considered had the largest impact on computation efficiency, both during training and infer-ence. These consisted of removing the token-type embeddings and the pooler layer, which is used for the Next Sentence Prediction Task. In addition, the number of Transformer layers was halved from 12 layers in BERT to 6 layers in DistilBERT, as Sanh et al. [18] found this to be a major factor affecting computation time. Other changes with respect to BERT and how it was pre-trained were

taken from RoBERTa [10]: pre-training was done on much bigger

batches via gradient accumulation (where gradients from multiple mini-batches are accumulated before updating the model weights), dynamic masking (where masking is generated every time a se-quence is fed to the model instead of during data pre-processing) was used instead of static masking, and the Next Sentence Predic-tion Task was eliminated.

In summary, DistilBERT has 66M parameters, making it 40% smaller than BERT, has been found to have comparable perfor-mances to BERT retaining 97% of BERT’s language understanding capabilities, and is 60% faster than BERT [18]. Moreover, fine-tuning DistilBERT for downstream tasks is performed in the same manner as in BERT, which facilitates the comparison between these two model architectures. However, it has to be noted that the hyperpa-rameters used are not listed on the paper [18], only that the same hyperparameters were employed for both BERT and DistilBERT for their experiments and that a hyperparameter search was not carried out for downstream tasks. Additionally, no reference is made to whether freezing any layers during fine-tuning would provide any benefits but instead the same approach of unfreezing the entire model as done by Devlin et al. [5] is followed. This thesis aims to fill in this research gap.

2.2 Addressing Document Length in

BERT-Based Models

The main limitations of BERT-based models are related to the input text length. Firstly, the maximum sequence length in BERT is set at 512 tokens, which is taken from the original Transformer model [25]. Secondly, all input vectors must have the same size. Thus, all of the input sequences have to be either padded with padding to-kens [PAD] to a fixed length (if they are shorter than the stipulated length) or truncated to this same length (if they are longer than the defined length). The reason why it is limited to a length of 512 tokens is because the self-attention layers that the Transformer [25], BERT [5], and DistilBERT [18] use have a computational complexity

of 𝑂 (𝑛2∗ 𝑑) where 𝑛 is the sequence length and 𝑑 is the

represen-tation dimensionality (the vector size of each token) [25]; which

means that it is quadratic with regards to sequence length. The

representation dimensionality is 512 for the Transformer [25] and

768 for BERT [5] and DistilBERT [18]. Vaswani et al. [25] argue that when sequence length is smaller than representation dimension-ality, self-attention layers are faster than recurrent layers. Hence, having a stipulated maximum sequence length (512 tokens) that is smaller than the representation dimensionality ensures this.

Several authors have tried to address this issue, either by mod-ifying the self-attention mechanism so that it scales linearly [1],

by truncating part of the document [21], or by using hierarchical

methods [14,21]. However, another approach that could be

fol-lowed to make the model more efficient by reducing the amount of data going into the model would be to only use the title and / or a short summary of the document to train and classify the document. Although short-text classification with BERT has been shown to be satisfactory for other tasks and the impact of sequence length using only titles and abstracts of texts has been investigated [30], major research in this area has not been found and therefore has been identified as a gap in the literature.

2.3 Zero-Shot Learning

Zero-shot learning, although very promising, had not been explored in great detail until recently, especially in NLP. In multi-class text classification, zero-shot learning entails being able to predict the class of a document out of a set of classes that the model has not been trained to classify into. Therefore, the model does not learn to classify into a set of learned classes like in supervised learning but rather learns to generalize and predict whether a document is related to a label or not.

Research into this limited but growing field indicates that sup-plementing the model with descriptions of the unseen classes and re-formulating the task problem as a different problem are two ap-proaches that have provided good results. With regards to the latter

method, the two most successful approaches were proposed by [17]

and [28], who treated zero-shot learning in multi-class text classifi-cation as a binary classificlassifi-cation problem and a textual entailment problem, respectively. The latter, which is the method employed in this thesis, is also known as Natural Language Inference (NLI) [2]. The benefit of these methods is that they do not pre-define the number of classes nor what these should be during training, which is why they are suitable for zero-shot learning. Despite the fact that 3

(5)

this research area is still in its infancy, its generalisation capabilities and initial results with BERT models [28] are incredibly promising.

3 METHODOLOGY

3.1 Data Description

In this section, the three datasets used are described in detail. These consist of two publicly available datasets and a small domain-specific dataset provided by Bloomreach. The two public datasets were chosen for the following reasons: they are in English, are labelled into multiple categories (one per entry), are large datasets with over 100,000 entries each, cover a wide range of topics from popular media to AI and history, the content is fairly recent (they do not consist of documents written decades ago, for instance), and are therefore similar to the sort of content that Bloomreach would like to classify into categories. Additionally, the rationale for choosing datasets that only contain title and subtitle rather than full articles was because one of the objectives of this thesis is to study whether these fields are enough to accurately classify the articles. This would be advantageous in terms of efficiency as just a small portion of the text would need to be stored and accessed for the text classification task. Furthermore, as explained in2.2, this would address the maximum sequence length limitation found in BERT and other models that utilize self-attention layers.

In all three datasets, the categories that the model is trying to predict were annotated manually by the articles’ authors at the moment of publishing the articles. Histograms of the categories per

dataset are shown in AppendixB.

Table 1: Main information about the datasets. L = length (number of words).

Title Subtitle Text Num. of Entries

Dataset Categories Avg. L Max. L Avg. L Max. L Avg. L Max. L Train Val Test

News 40 10 44 23 243 32 245 138,548 17,319 17,318

Medium 93 8 23 17 34 24 56 92,076 11,510 11,510

Couchbase 13 8 20 842 3,507 850 3,517 369 47 46

3.1.1 News Dataset. The main dataset used is the News dataset

[13], a publicly available dataset in JSON format that contains

200,853 labelled news articles headlines from HuffPost2from 2012

to 2018. The fields in this dataset are: ’category’, ’headline’, ’short description’, ’authors’, ’link’ (to the post), and ’date’; of which only the first three fields were kept. After dropping the entries contain-ing empty fields and those where the total number of words in the short description was less than 5, the number of available entries was 173,185. There were 41 categories in total, although two of them (’THE WORLDPOST’ and ’WORLDPOST’) were merged into one resulting in 40 imbalanced categories. Of these, ’POLITICS’ is the most popular category with 28,388 entries, whereas ’ARTS’ is the least popular one with only 863 articles. A new field called ’text’ was created by adding up the ’headline’ and the ’short description’ fields for each entry, which was utilized in those experiments that required both fields as input text to the model.

3.1.2 Medium Dataset. The second dataset used in this thesis is

a publicly available dataset of Medium3blog post titles and

sub-titles [24] that has 126,095 entries and is provided in CSV format. The fields that are included in this dataset are: ’category’, ’title’, 2_{https://www.huffpost.com/}

3_{https://medium.com/}

’subtitle’ (i.e. summary or short description of the blog post), and ’subtitle_truncated_flag’, which indicates whether the subtitle has been truncated or not. The number of entries with truncated sub-titles is 43,216. The number of categories is 93, which are heavily imbalanced. The most and least popular categories after cleaning the data are ’culture’ with 4,711 entries and ’venture capital’ with 36 entries, respectively. The entries containing empty fields or less than 5 words in the subtitle field were dropped, resulting in 115,096 entries. Similarly to the News Dataset, a field named ’text’ combin-ing the ’title’ and the ’subtitle’ fields was added.

3.1.3 Couchbase Dataset. This dataset was extracted from a

data-base provided by Bloomreach and it consists of Couchdata-base blogs4

about the Couchbase database from 2011 to 2016. The extracted file with the data was in YAML format and the process to put the data into tabulated form was more extensive. Each blog post (i.e. entry) had three versions: published, unpublished, and draft. The published versions, which totalled 1,081 entries, were chosen as these are the ones that have been made public. The data contained multiple fields, of which the following were kept: ’title’, ’date’, ’tags’, ’categories’, ’summary’, and ’content’. ’Summary’ and ’content’ were merged since only 99 entries had a summary after cleaning the data. Ad-ditionally, just like with the rest of the datasets, a ’text’ field was created by combining the ’title’ with the ’summary’ + ’content’ merged field. After dropping entries with empty fields, the number of entries was 462.

This dataset did not only contain one category per blog but rather multiple categories, making it a multi-label multi-class classification problem. In order to use this data in a multi-class text classification setting, the first category listed was selected for each entry, since it could be argued that the first category chosen by the author of the blog post is the one that they consider to be the most representative. This resulted in 13 categories, of which ’Couchbase Server’ and ’Couchbase Mobile’ account for 58% of the data with 151 and 117 entries, respectively. In contrast, ’Data Modeling’ and ’GoLang’ are the least represented categories with only 8 entries each.

3.2 Implementation

The general approach and steps pursued in this thesis are explained in the following subsections.

3.2.1 Data Pre-Processing and Tokenization. Since BERT and

Dis-tilBERT are contextual models, certain pre-processing steps that are required for other NLP algorithms such as removing stopwords, removing punctuation, and stemming were not necessary. As such, the only operations carried out on the input text consisted on re-moving web links in all of the datasets and rere-moving commonly used HTML entities and HTML angle brackets in the Couchbase dataset.

The data was split into train, validation and test sets in a 80/10/10

ratio (see Table1) and in a stratified manner; meaning that each

split had the same proportion of entries per class. Since the three datasets were very imbalanced, stratification ensured that all the classes were present both in the training set (so that the model could learn to predict them) and in the validation and test sets (to check 4_{https://blog.couchbase.com/}

(6)

whether the model could correctly classify the minority classes just as well as the more popular classes).

HuggingFace’s model-specific tokenizers were used to process the data in the same way as the text of the pre-trained models. Both the BERT and the DistilBERT tokenizers carry out the following operations: clean whitespaces, lowercase text, truncate and pad text, add special tokens that the model needs such as [CLS], tokenize

into sub-words using WordPiece [19], and convert tokens to IDs.

For all three datasets, the input text was truncated or padded to a maximum length of 256 tokens since in both public datasets the number of words when adding the ’title’ and the ’subtitle’ fields together was under this value, as seen in Table1. Thus, setting the maximum length to 512 tokens would only increase memory usage and lead to longer computation times without any added benefits.

3.2.2 Baselines. For each dataset, the Scikit-learn Dummy

Classi-fier was employed to produce baseline performances to compare the models to. Several strategies were chosen: most frequent, which always predicts the most common class in the training dataset, strat-ified, which generates random predictions by considering the class distribution of the training dataset, and uniform, which produces predictions at random using a uniform distribution.

100 runs with different seeds were ran for each strategy, after which their accuracy and weighted F1-score were averaged.

3.2.3 BERT-Based Models Comparison. In this study, which aimed

to answer research sub-question 1, BERT and DistilBERT were fine-tuned with default parameters to compare the performance be-tween these two models for each dataset. The combined ’text’ field was used as input text. Two scenarios were considered: when all the layers are unfrozen and when all the layers of the pre-trained model are frozen, in which case only the layers of the classification head are fine-tuned. The hyperparameters chosen for both models were: epochs = 2, batch size = 16, learning rate = 5e-5, optimizer = Adam, weight decay = 0, learning rate scheduler (decreases the learning rate from the set value to 0) = none, learning rate warm-up (number of warm-up steps to linearly increase the learning rate from 0 to the set value) = none. These

hyper-parameters were selected following Devlin et al. [5] and Sanh et

al. [18] recommendations. It has to be noted, however, that the

number of epochs to train the model for the Couchbase dataset was increased to 10, since the lack of enough training data would not allow this model to converge with the recommendation by Devlin et al. [5] of only training for 2 to 4 epochs.

3.2.4 Study on Fine-Tuning Hyperparameters. This study focused

on answering research sub-question 2. In it, the best-performing and most efficient model from the previous exercise (i.e. BERT or DistilBERT) was chosen to study the impact of certain hyperparam-eters during the fine-tuning phase for multi-class text classification. The hyperparameters picked and their range of values are shown

in Table2. In total, this resulted in 30 different hyperparameter

combinations. These parameters were chosen due to their impact not only on the effectiveness but also on the computational effort of the model in terms of training time and memory usage.

When training on a GPU, the batch size is limited by the RAM of

the GPU. For the GPU used, listed in Section3.4, the maximum

rec-ommended batch size for the BERT model when using a maximum

sequence length of 256 is 165. Since similar benchmarks were not

given for DistilBERT, the BERT recommendations were followed. The number of possible epochs were chosen as per Devlin et al. [5]. With regards to the number of layers to freeze in the pre-trained base model, the actual number depends on whether the model archi-tecture used is BERT or DistilBERT, since BERT has 12 Transformer encoder blocks (or layers) whereas DistilBERT has 6 Transformer encoder layers. As such, in BERT the following number of layers would be frozen: 0, embeddings layer, 4, 8, and 12. For DistilBERT this would be: 0, embeddings layer, 2, 4, and 6 frozen layers. The embeddings layer is the first layer of the model and precedes the Transformer encoder blocks. This layer contains the word embed-dings, the position embedembed-dings, and the segment embeddings. The latter are only found in BERT, also called token-type embeddings. Table 2: Range of hyperparameter values to be studied. The frozen layers field indicates the number of layers of the pre-trained base model to freeze.

Hyperparameter Values

Batch Size 16, 32

Epochs 2, 3, 4

Frozen Layers (or Blocks) 0%, Embeddings, 33%, 66%, 100% The rest of the hyperparameters were kept the same as in Section

3.2.3except for the learning rate decay, where a scheduler was used to linearly decrease the learning rate from the set value to 0. Furthermore, just like in the previous section, the input text consisted of the ’text’ field created earlier.

A formula that takes into account both training time and ac-curacy (or F1-score) to find the optimal hyperparameters was not found in the literature, since the majority of the research focuses on achieving the highest accuracy without considering efficiency. As such, an equation to select the model with the optimal hyper-parameter configuration was derived. This equation, presented in

Section3.3, maximizes accuracy or F1-score (depending on the

metric selected) whilst minimizing training time.

Due to the limited size of the Couchbase dataset, it was not used for this study as it would not have produced reliable results.

3.2.5 Text Field Analysis. Research sub-question 3 was the focal

point of this analysis, in which the model was trained and eval-uated using either the title, the subtitle (also referred to as short description or summary, depending on the dataset), or both. The hyperparameters used were those identified as the optimal

hyper-parameters in the study carried out in Section3.2.4and shown in

Section4.3.

3.2.6 Zero-Shot Learning. The aim of this section was to answer

research sub-question 4. To do so, the test set of each dataset was passed through a BERT-based model for sequence classification, indicated in3.4, to predict the class of each entry. The pre-trained

models that were used for this study, listed in Section3.4, had

been pre-trained on the Multi-NLI (MNLI) dataset [26] achieving

accuracy scores of 82.0% and 84.0% for DistilBERT and MobileBERT [22], respectively6. DistilBERT was chosen as a pre-trained model in order to compare zero-shot classification against supervised 5_{https://github.com/google-research/bert#out-of-memory-issues}

6 https://discuss.huggingface.co/t/poor-performance-in-zero-shot-learning-when-using-the-model-typeform-distilbert-base-uncased-mnli/6374

(7)

learning classification with the same model architecture. On the

other hand, MobileBERT [22] was selected as it is a compressed

BERT model that has 25M parameters and yields better results in the MNLI textual entailment task than DistilBERT. Having two BERT-based models for this task allows for a more objective analysis of zero-shot learning versus supervised learning classification that does not uniquely depend on a single model architecture.

This approach followed the methodology described by Yin et al. [28], where zero-shot learning in multi-class text classification is treated as an NLI task for fully-unseen classes. As proposed by Yin et al. [28], for multi-class text classification the process consists in passing each text together with each possible class as a premise / hypothesis pair into a model trained for textual entailment tasks and predicting whether there is entailment or not, after which the logit for entailment is taken as the logit for that possible class being valid. Finally, the class that has the highest probability out of all the classes given to the model as hypotheses is the class predicted for that text entry. As such, fine-tuning with the data from the selected datasets was not required for this task due to the nature of zero-shot learning. This approach using DistilBERT is shown in

Figure1. AppendixCcontains the rest of the schematics for both

supervised learning and zero-shot learning.

Figure 1: Schematic of zero-shot classification as a textual entailment task using DistilBERT. This is the process for one article. The 3 classes of the MNLI model are: C = contradiction, N = neutral, E = entailment.

The results obtained from this exercise were compared with those obtained from the previous supervised learning experiments by ensuring that the data in the test dataset was split in the same way as in the supervised learning scenario through the use of the

same random seed. Similarly to Section3.2.5, the impact of using

either the title, the subtitle or both fields as input text was also evaluated in the context of zero-shot learning.

3.3 Evaluation Metrics

In terms of evaluation, the overall performance of the models was evaluated not just in terms of accuracy, F1-score, precision, recall, and confusion matrices but also in terms of computational effort via training time, inference time, and GPU memory usage. These metrics were selected to provide a comprehensive framework where both effectiveness and effort are taken into account to evaluate performance, on top of the derived efficiency metric.

Accuracy and weighted F1-score were calculated globally for all the classes. The weighted-average F1-score was selected instead of the micro or macro F1-scores because all of the datasets are

very imbalanced, which weighted-averaging takes into account by calculating the F1-score for each class independently and then weighting each score by the number of true instances for each class when averaging them. Additionally, precision, recall, and F1-score were calculated for each class individually. These metrics are

described in more detail in AppendixA.

In this thesis, a metric of efficiency is proposed. This metric is used to select the hyperparameters that offer an optimal balance be-tween performance metric (effectiveness) and training time (effort).

Equation1displays the metric, where the performance measure

can be either accuracy or F1-score (for the validation or test set) expressed in percentage and the training time is in seconds. This formula is chosen as it is not restricted to only one specific per-formance measure. The perper-formance measure is squared to ensure that the model achieves high results and it is divided by the training time to penalize performance based on how long it takes to train the model. The logarithm of the training time is taken to reduce the effect of this variable on the efficiency metric, as otherwise it would give high scores to models that are extremely fast but that have poor effectiveness. The set of hyperparameters that gives the highest value for this equation represents the optimal and most effi-cient model, providing an objective approach to measure efficiency.

Efficiency Metric =𝑃 𝑒𝑟 𝑓 𝑜𝑟 𝑚𝑎𝑛𝑐𝑒 𝑀 𝑒𝑎𝑠𝑢𝑟 𝑒 2

ln (𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑇 𝑖𝑚𝑒) (1)

3.4 Experimental Setup

All of the experiments were ran on the Lisa Compute Cluster system

maintained by SURFsara [23], which runs on the Linux operating

system. Specifically, they were ran on the GPU-shared partition of Lisa, which assigns 1 NVIDIA GeForce GTX 1080Ti GPU with 11GB GDDR5X, 3 CPU cores from 1.7GHz six-core Intel Xeon Bronze 3104 processors, and 64 GB UPI 10.4 GT/s of memory per job.

With regards to software, Python 3.8.2, CUDA 11.0.2, cuDNN 8.0.3.33, NCCL 2.7.8, and Anaconda3 2020.02 were the main tools used. Additionally, the following Python libraries were used: Py-Torch 1.8.1 (with torchvision, torchaudio and cudatoolkit 10.2), trans-formers 4.5.1, pandas, NumPy, Matplotlib, seaborn, nbconvert (to con-vert Jupyter Notebook .ipynb files to Python .py files), Scikit-learn, os, time, datetime, statistics, SciPy, random, re, yaml, and json.

The HuggingFace transformers library [27] was used to load

the pre-trained models and the tokenizers for each of the models.

The pre-trained models used were ’bert-base-uncased’ for BERT [5],

’distilbert-base-uncased’ for DistilBERT [18], and ’typeform/distilbert-base-uncased-mnli’ and ’typeform/mobilebert-uncased-mnli’ for

zero-shot classification with DistilBERT and MobileBERT [22],

re-spectively. All the models used were uncased, meaning that cased and uncased words are deemed the same since the pre-processed in-put text is always made lowercase; resulting in a smaller vocabulary than if the cased models were used. The transformers models Bert-ForSequenceClassification and DistilBertBert-ForSequenceClassification were used for the supervised learning multi-class classification task, whereas DistilBertForSequenceClassification and MobileBert-ForSequenceClassification were used for the zero-shot learning task. These models consist of pre-assembled PyTorch models with the pre-trained model as base and a classification head on top for 6

(8)

text classification tasks. It has to be noted, however, that the clas-sification heads are not exactly the same for the BERT and the DistilBERT versions: the BERT version has dropout followed by a fully connected layer; whereas the DistilBERT version has a fully connected layer, a ReLU activation function, dropout, and a fully connected layer. This is because the base DistilBERT model does not

have a pooler layer while BERT does, as discussed in Section2.1.2;

so it needs to be added for classification tasks. The MobileBERT classification head is exactly the same as in the BERT version.

In the supervised learning models, the loss function used is cross-entropy loss, which includes a Softmax activation function, that the model aims to minimize through stochastic gradient descent. Finally, the Argmax function is applied to the raw output to obtain the predicted class that has the highest probability.

Specific seeds were set when splitting the data, tokenizing, and running the models (both in CPU and GPU) so that randomization was prevented as much as possible. Additionally, PyTorch was configured to avoid using nondeterministic algorithms, which could lead to different results even on the same machine. This allows different model runs to be compared under similar conditions and guarantees reproducibility of results. The seed used was 42 unless stated otherwise.

4 RESULTS

4.1 Baselines

The baseline performance results can be seen in model (0) in Tables

3,4and5for each dataset. These correspond to the highest baseline results. For all three datasets, the highest accuracy results are given by the most frequent method, whereas the highest F1-scores are obtained with the stratified approach.

4.2 BERT-Based Models Comparison

The comparison results can be seen in models (1) to (5) in Tables

3,4and5for each dataset. The differences in accuracy and

F1-score between freezing the entire base BERT model including the pooler layer (model (1)) and freezing all the layers of the base model except for the pooler layer (model (2)) are significant, ranging between 17.39% and 34.10% for both metrics. Training time, however, does not increase considerably. Comparing fully frozen BERT to DistilBERT excluding the pooler layer (models (2) and (4)), BERT performs better than DistilBERT for all three datasets with the highest absolute differences in scores being 1.05% in F1-score for the News dataset, 1.30% in F1-score for the Medium dataset, and 13.04% in accuracy in the Couchbase dataset. Additionally, GPU memory usage is 1.1x higher for BERT. Comparing fully unfrozen BERT to DistilBERT (models (3) and (5)), BERT outperforms DistilBERT for the News dataset by a maximum of 0.40% in accuracy, DistilBERT performs slightly better than BERT by 0.49% in F1-score for the Medium dataset, BERT surpasses DistilBERT for the Couchbase dataset by a maximum of 6.97% in F1-score, and GPU RAM usage is 1.7x higher for BERT. In all instances, training time is halved when using DistilBERT instead of BERT. As such, for both fully frozen and fully unfrozen the performances of both pre-trained models are comparable, especially for the first two datasets, with DistilBERT being more efficient in terms of training time and GPU memory use. This is why DistilBERT is used in the next experiments.

Table 3: News dataset results, split by study (Sections3.2.2to3.2.6). Only the best model according to Equation1was selected from Section3.2.4. Dist = Dis-tilBERT, Mobi = MobileBERT, Class = supervised classification, Zero = zero-shot classification. The subscript values indicate the number of frozen layers in the base model. All-P indicates that all the layers except for the pooler layer were frozen. T states that only the title was used, S indicates that only the sub-title was used. Inf. time is the inference time. The efficiency metric (Equation

1) uses the performance measure in parenthesis. Weighted F1-score.

ID Model Trainable_Params. Test Acc_(%) Test F1_(%) Train Time_(hh:mm:ss) _(hh:mm:ss)Inf. Time GPU RAM_{Use (MiB)} Eff. Metric_{(Test F1)}

(0) Baseline - 16.39 06.05 - - - -(1) BERTClass𝐴𝑙 𝑙 30,760 33.16 21.65 00:40:56 00:02:08 1,449 60.06 (2) BERTClass𝐴𝑙 𝑙−𝑃 621,352 59.18 55.75 00:41:39 00:02:10 1,459 397.25 (3) BERTClass0 109,513,000 70.01 69.02 01:51:06 00:02:11 6,923 541.05 (4) DistClass𝐴𝑙 𝑙 621,352 58.56 54.70 00:20:15 00:01:03 1,315 421.27 (5) DistClass0 66,984,232 69.61 68.93 00:56:19 00:01:03 4,065 584.80 (6) DistClass4 14,797,096 69.67 68.65 00:31:06 00:01:03 2,043 625.76 (7) DistClass-T4 14,797,096 64.01 62.47 00:31:18 00:01:04 2,043 517.68 (8) DistClass-S4 14,797,096 57.03 54.70 00:30:46 00:01:03 2,043 397.89 (9) MobiZero - 16.08 18.61 - 00:52:45 1,235 -(10) MobiZero-T - 13.65 15.89 - 00:50:54 1,235 -(11) MobiZero-S - 13.02 15.41 - 00:51:52 1,235

-Table 4: Medium dataset results, as in -Table3.

ID Model Trainable_Params. Test Acc_(%) Test F1_(%) Train Time_(hh:mm:ss) _(hh:mm:ss)Inf. Time GPU RAM_{Use (MiB)} Eff. Metric_{(Test F1)}

(0) Baseline - 04.09 01.94 - - - -(1) BERTClass𝐴𝑙 𝑙 71,517 12.60 06.89 00:26:59 00:01:24 1,449 6.43 (2) BERTClass𝐴𝑙 𝑙−𝑃 662,109 39.76 36.14 00:27:43 00:01:27 1,461 176.07 (3) BERTClass0 109,553,757 53.25 52.15 01:13:51 00:01:26 6,923 323.91 (4) DistClass𝐴𝑙 𝑙 662,109 39.06 34.84 00:13:27 00:00:42 1,315 181.32 (5) DistClass0 67,024,989 53.59 52.64 00:37:28 00:00:42 4,065 358.98 (6) DistClass4 14,837,853 53.71 52.49 00:20:24 00:00:42 2,045 387.56 (7) DistClass-T4 14,837,853 46.28 45.09 00:20:29 00:00:42 2,045 285.81 (8) DistClass-S4 14,837,853 40.56 39.16 00:20:50 00:00:43 2,045 215.05 (9) MobiZero - 22.92 24.00 - 01:17:56 1,737 -(10) MobiZero-T - 18.02 19.83 - 01:16:29 1,737 -(11) MobiZero-S - 16.86 18.44 - 01:16:09 1,737

-Table 5: Couchbase dataset results, as in -Table3. The supervised learning mod-els were ran for 10 epochs.

ID Model Trainable Params. Test Acc (%) Test F1 (%) Train Time (hh:mm:ss) Inf. Time (hh:mm:ss) GPU RAM Use (MiB) Eff. Metric (Test F1) (0) Baseline - 32.61 18.29 - - - -(1) BERTClass𝐴𝑙 𝑙 9,997 34.78 20.26 00:00:32 00:00:00 1,463 118.43 (2) BERTClass𝐴𝑙 𝑙−𝑃 600,589 52.17 38.68 00:00:33 00:00:00 1,465 427.88 (3) BERTClass0 109,492,237 76.09 74.07 00:01:28 00:00:00 7,031 1,225.37 (4) DistClass𝐴𝑙 𝑙 600,589 39.13 28.89 00:00:16 00:00:00 1,321 301.13 (5) DistClass0 66,963,469 71.74 67.10 00:00:45 00:00:00 4,119 1,182.62 (6) DistClass4 14,776,333 65.22 58.23 00:00:25 00:00:00 2,063 1,053.41 (7) DistClass-T4 14,776,333 69.57 62.11 00:00:25 00:00:00 2,063 1,198.45 (8) DistClass-S4 14,776,333 60.87 53.69 00:00:25 00:00:00 2,063 895.63 (9) MobiZero - 47.83 48.13 - 00:00:07 981 -(10) MobiZero-T - 50.00 52.39 - 00:00:03 981 -(11) MobiZero-S - 28.26 32.06 - 00:00:07 981

-4.3

Study on Fine-Tuning Hyperparameters

Now that the model has been chosen, the study of hyperparameters is refined for that model. Figures2and3display the weighted test F1-scores against training time using DistilBERT for the News and the Medium datasets, respectively. In these, the set of hyperparameters chosen to study are also shown. The rest of the graphs for this study, such as test accuracy against training time, can be found in

AppendixD.

For both datasets, the distribution of performance against train-ing time with regards to each hyperparameter combination is almost identical. Freezing all of the layers, regardless of the number of epochs and the batch size, results in a big drop in performance. Other than that, partially freezing the model does not have a big impact in accuracy and F1-score compared to fully unfreezing the model, yet training time and GPU RAM use decrease considerably. Only freezing the embeddings layer yields more similar results to not freezing the model both in terms of performance metrics and training time than between any other sets of frozen layers. Increas-ing the batch size whilst keepIncreas-ing the other parameters constant leads to a slightly shorter training time, an increase in GPU memory 7

(9)

use, and a drop in performance; although effectiveness does not de-crease if only the embeddings layer or none of the layers are frozen and the models are trained for at least 3 epochs. Increasing the number of epochs increases training time linearly, although GPU memory use does not change and performance does not improve considerably. In fact, when none of the layers are frozen or when only the embeddings layer is frozen, test F1-score and accuracy start decreasing with each subsequent epoch after 2 epochs.

Figure 2: Test F1-score (weighted) against training time for the News dataset with DistilBERT. The continuous curve indicates the optimal hyperparame-ters based on the efficiency metric (Equation1).

Figure 3: Test F1-score (weighted) against training time for the Medium dataset with DistilBERT. The continuous curve indicates the optimal hyper-parameters based on the efficiency metric (Equation1).

The efficiency metric proposed (Equation1) is applied to select the optimal hyperparameters for the two public datasets. This equa-tion is also plotted in Figures2and3by multiplying it by a scalar, where the optimal hyperparameter configurations are those closest to this curve. As such, the optimal hyperparameters for both the News and the Medium datasets with DistilBERT are: 4 frozen lay-ers, 2 epochs, and batch size of 16. Although a full study was not conducted for the Couchbase dataset, the model was trained on the Couchbase dataset with the top 3 optimal hyperparameter configurations found from the study with the other two datasets (none, 2, and 4 frozen layers; batch size of 16; and 10 epochs) and the optimal hyperparameters for this dataset were selected using

Equation1. This resulted in the same parameters as for the other

datasets except for the number of epochs, which was increased to

10 as described in Section3.2.3. The results of the selected model

for each dataset are shown in model (6) in Tables3,4and5. The

rest of the results can be seen in AppendicesE,FandG.

4.4 Text Field Analysis

In this section, the influence of varying the input on the metrics is evaluated. The results of this section can be seen in models (7) for title, (8) for subtitle, and (6) for both fields in Tables3,4and5. These

three models had the same hyperparameters, listed in Section4.3,

with the only difference being the input text. This comparison was also conducted with other hyperparameters, which are displayed

in AppendicesE,FandG. The behaviour for both the News and

the Medium dataset is very similar: in both cases, using both fields results in the highest test accuracy and F1-score values, followed by using the title only. Only using the subtitle yields the lowest scores. For the Couchbase dataset, however, the highest results are obtained when only the title is used, followed by using both and lastly by only using the subtitle. With other hyperparameters, this order does not change for the two public datasets but it varies

considerably for the Couchbase dataset, as shown in AppendixG.

Training time and GPU RAM use are the same regardless of the input text since the maximum input length is kept at 256 tokens. Computational effort could be reduced when only using the title by setting a shorter maximum input length, since the maximum title length in all three datasets is 44 words (Table1).

4.5 Zero-Shot Learning

In this section, zero-shot learning for text classification is studied. The key results of this study are shown in models (9), (10) and (11) in Tables3,4and5. The rest of the results are displayed in

AppendicesE,FandG. These key results were selected based on

accuracy and weighted F1-score as there is not a training stage and inference time is only up to 1.2x longer for MobileBERT compared to DistilBERT, whilst GPU RAM usage is 1.2x higher for DistilBERT. The zero-shot learning results are quite poor in comparison with the supervised learning results in terms of accuracy and F1-score, in particular for the two public datasets. In fact, the zero-shot models for the News dataset have a worse accuracy than the most frequent baseline accuracy. This is also the case for the MobiZero-S model for the Couchbase dataset, which has a lower accuracy than the baseline accuracy obtained through the most frequent approach.

Inference time is much longer for zero-shot classification. For the same model architecture, DistClass and DistZero, inference time increases by a factor of the number of classes given to the model to classify into. This is because as a reformulated NLI task, each entry has to be passed to the model as many times as there are possible classes.

With regards to the impact of only using the title, the subtitle or both fields, the order from best-performing to worst-performing

per dataset is the same as in Section4.4for supervised learning.

5 DISCUSSION

In this section, the results and findings obtained are discussed. Overall, the accuracy obtained for the News dataset was in line with the results reported by other users on this dataset7. It was also 7_{https://www.kaggle.com/rmisra/news-category-dataset}

(10)

expected that the large number of classes in the Medium dataset would lead to lower performance, as the baseline results for this dataset are also lower than for the other datasets. For the Couchbase dataset, it was not predicted to do excessively well due to the lack of data. Training for more epochs seemed to alleviate this issue, although the results are generally more variable.

DistilBERT was expected to be more efficient than BERT whilst achieving comparable performance on downstream tasks, as that

is the main claim made by Sanh et al. [18]. This was indeed the

case for all datasets both for fully frozen models (except for the pooler layer) and for fully unfrozen models. The only exception was the fully frozen DistilBERT model for the Couchbase dataset, where the difference in absolute performance was over 10% lower than with the BERT version. This big difference in results between the two architectures could be due to the small size of the dataset, which leads to less robust and reliable results, especially since in the other two larger datasets the performances were comparable as expected. BERT gave the best overall results for the Couchbase dataset whilst also performing better across all classes, indicating that BERT might be more appropriate for small datasets than Dis-tilBERT; although more work should be carried out to verify this statement. In all instances, DistilBERT was found to be 50% faster than BERT, both in training time and inference time. Sanh et al.

[18] reported DistilBERT to be 60% faster than BERT in inference

time, although their experiments were conducted uniquely on CPU. In terms of the hyperparameters studied, larger batch sizes were expected to reduce training time, increase memory use, and lower accuracy; more epochs were expected to increase training time but lead to better results; and freezing layers of the base model was predicted to decrease training time and memory usage whilst also decreasing the performance of the model. The results obtained backed the first claim about batch sizes. However, raising the num-ber of epochs did not improve accuracy or F1-score but rather stayed the same or even started decreasing, with training time increasing linearly as expected. Freezing some or all of the layers reduced the training time and the memory usage considerably as expected, although fully freezing the model performed very poorly whilst partially freezing the model was demonstrated to be efficient and perform well. In fact, the efficiency metric derived pointed to freez-ing 4 out of the 6 Transformer encoder blocks in the DistilBERT model as the optimal value.

Table 6: Percentage of entries that have their category in the title, in the subti-tle or in the text for the full dataset and the test set. For the Couchbase dataset, these fields were truncated to a length of 256 first.

Category in Title Category in Subtitle Category in Text Dataset Full Dataset Test Set Full Dataset Test Set Full Dataset Test Set News 03.43% 03.47% 03.71% 03.85% 05.78% 05.83% Medium 10.14% 10.50% 09.75% 09.70% 16.14% 16.26% Couchbase 34.63% 43.48% 53.68% 50.00% 55.63% 54.35%

With regards to the text field analysis, it was expected that using both the title and the subtitle would lead to better results, followed by the subtitle and lastly by the title since it is the shortest field. Using both fields was indeed better for the two public datasets. However, using only the title resulted in the highest performances for the Couchbase dataset and the second best results for the other datasets. This indicates that the title is a highly informative field for text classification of web content that is also much shorter than the subtitle, meaning that if this field was to be used the maximum

sequence length could be further reduced to decrease memory usage and computation time.

Table 7: Percentage of entries predicted correctly for the test set of each dataset, split by whether they have their category in the input text (title, sub-title or combined field, depending on the model) or not. For the Couchbase dataset, these fields were truncated to a length of 256 first. Cat. = Category.

News Dataset Medium Dataset Couchbase Dataset

Model Has Cat. No Cat. Has Cat. No Cat. Has Cat. No Cat.

DistClass-T4 02.81% 61.20% 08.71% 37.57% 34.78% 34.78% MobiZero-T 02.60% 11.05% 07.85% 10.17% 32.61% 17.39% DistClass-S4 02.89% 54.14% 07.54% 33.01% 32.61% 28.26% MobiZero-S 03.14% 09.89% 07.30% 09.57% 10.87% 17.39% DistClass4 04.62% 65.05% 12.87% 40.84% 36.96% 28.26% MobiZero 04.52% 11.55% 12.00% 10.92% 28.26% 19.57%

The expectation with regards to zero-shot learning was that more classes to classify into would lead to worse performance. However, the models for the Medium dataset outperformed those for the News dataset whilst having over double the amount of classes. As such, two reasons have been identified that could explain these results. Firstly, having the category in the input text correlates with higher

performances in this task. As shown in Table6, the Couchbase

dataset has a high percentage of entries with the category in the title, subtitle and text fields; whereas only between 3.47% and 5.83% of the

entries in the News dataset do. Table7displays how many of these

were predicted correctly, both for supervised learning and zero-shot learning, showing that most of the entries with the category in the input text were classified properly. For the Couchbase dataset, as mentioned earlier, using only the title yields better performance than using the other fields. In many instances, the subtitle and the text fields of this dataset contain other categories in the text due to how interrelated the topics of these categories are, which might confuse the model and lead to selecting the incorrect class.

Secondly, using abstract classes leads to poorer performances in this task than in supervised learning classification. This is because the model is not trained to learn the classes in the context of the dataset, thus relying on the general meaning of the words learned through pre-training and fine-tuning on the MNLI dataset. This is the case for the ’IMPACT’ category in the News dataset. As shown

in AppendixH, the broad meaning of this word caused the model to

misclassify multiple entries as this category. For instance, in the test set there are only 303 entries with ’IMPACT’ as their class yet 3,281 were given this class by the MobiZero model, of which only 93 were correctly classified. Several other common misclassifications, such as confusing ’PARENTS’ with ’PARENTING’, were found in both tasks. Therefore, zero-shot classification could be useful to identify those classes that are not clear without context, which could also lead to better user experience.

6 CONCLUSION AND FUTURE WORK

In this thesis, web content was classified into categories using multi-class text multi-classification and zero-shot learning with state-of-the-art NLP models, where various experiments were performed. The out-comes of the research questions are:

Research question: Supervised learning models based on Trans-formers were able to successfully classify web content using their titles and / or short descriptions only, especially when enough training data was available and the number of categories was not excessive, specifically for less than 50 categories.

(11)

Sub-question 1: BERT and DistilBERT had comparable perfor-mances both fully frozen and fully unfrozen, albeit DistilBERT was proven to be more efficient by having reduced training time, infer-ence time, and GPU memory use. Training and inferinfer-ence time were 50% faster with DistilBERT on a single GPU. For the smaller dataset, however, BERT outperformed DistilBERT by a larger margin. Sub-question 2: Out of the three parameters studied, partially freezing the base model layers was found to be key to make the models more efficient whilst barely jeopardizing accuracy or F1-score. Batch size had a minimal impact on both effectiveness and training time, whereas raising the number of epochs surprisingly did not lead to better results for large datasets. Additionally, in this thesis an efficiency metric was derived to objectively determine the optimal hyperparameters based on performance measures and training time, which is one of the main contributions of this thesis. Sub-question 3: Using both the title and the subtitle as input text gave the best performance overall, followed by using the title. The subtitle was not found to be greatly informative on its own, lead-ing to poorer performances in supervised learnlead-ing and zero-shot learning. The title, although very short, proved to yield much better results for the Couchbase dataset than the other fields.

Sub-question 4: Compared to supervised learning, zero-shot clas-sification as an NLI task with BERT-based models performed poorly, in particular with datasets containing very broad or abstract cate-gories. However, it is an interesting approach that might be useful when training data is not available or when the flexibility of being able to dynamically add or remove categories is desired. Addition-ally, this could be combined with supervised learning classification to maximize the benefits of both approaches.

With regards to future work, the hyperparameter study could be expanded by adding learning rate as another parameter to investi-gate, since the entire study was done with the same learning rate. Another suggestion would be to perform a similar investigation of freezing layers for other state-of-the-art models such as RoBERTa, with multiple GPUs, or for other NLP tasks (including binary clas-sification and multi-label clasclas-sification) to determine whether the efficiency metric derived would also be applicable. Finally, another future direction would be to first fine-tune the model for zero-shot classification on a separate topic classification dataset, turning it into a few-shot learning problem. There are also larger models fine-tuned on the MNLI dataset that could be used for this task to obtain more accurate results, although it is expected that inference time would increase massively.

REFERENCES

[1] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).

[2] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015).

[3] W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search engines: Information retrieval in practice. Addison-Wesley Reading.

[4] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848 [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:

Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).

[7] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016). [8] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From

Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 957–966. [9] Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. What would elsa do? freezing

layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090 (2019). [10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer

Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).

[13] Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2. 20331.18729

[14] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 838–844.

[15] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. [16] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher

Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).

[17] Pushpankar Kumar Pushp and Muktabh Mayank Srivastava. 2017. Train once, test anywhere: Zero-shot learning for text classification. arXiv preprint arXiv:1712.05972 (2017).

[18] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis-tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

[19] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5149–5152.

[20] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2019. Green AI. arXiv preprint arXiv:1907.10597 (2019).

[21] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification?. In China National Conference on Chinese Computational Linguistics. Springer, 194–206.

[22] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020).

[23] SURFsara. [n.d.]. Lisa Compute Cluster. https://userinfo.surfsara.nl/systems/lisa/ [24] Turbo. 2019. Dataset of 125,000 Medium Blog Post Titles and Subtitles (with

Categories).https://github.com/turbo/medium125k.

[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

[26] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceed-ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana). Association for Computational Linguistics, 1112–1122. http://aclweb.org/anthology/N18-1101

[27] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6 [28] Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text

classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161 (2019).

[29] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792 (2014). [30] Shanshan Yu, Su Jindian, and Da Luo. 2019. Improving BERT-Based Text

Classifi-cation With Auxiliary Sentence and Domain Knowledge. IEEE Access 7 (11 2019), 176600–176612. https://doi.org/10.1109/ACCESS.2019.2953990

[31] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceed-ings of the IEEE international conference on computer vision. 19–27.

(12)

A

EVALUATION METRICS

Precision indicates the number of correctly classified instances into a class out of all the predicted instances for that class. On the other hand, recall consists of the number of correctly classified instances into a class out of the number of true instances belonging

to that class. They are calculated as shown in equations2and3,

where TP is True Positive (instances that are classified as positive and are actually positive), FP is False Positive (instances that are classified as positive but that are actually negative), and FN is False Negative (instances that are classified as negative but that are ac-tually positive). The F1-score is defined as the harmonic mean of precision and recall as seen in Equation4.

Precision = 𝑇 𝑃 𝑇 𝑃+ 𝐹 𝑃 (2) Recall = 𝑇 𝑃 𝑇 𝑃+ 𝐹 𝑁 (3) F1-score = 2 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+ 𝑅𝑒𝑐𝑎𝑙𝑙 (4)

B

CATEGORY DISTRIBUTION PER DATASET

Figure 4: Entries per category - News dataset.

Figure 5: Entries per category - Medium dataset.

Figure 6: Entries per category - Couchbase dataset.

(13)

C

MODEL ARCHITECTURES - SCHEMATICS

Figure 7: Schematic of multi-class text classification (supervised learning) using BERT. This is the process for one article.

Figure 8: Schematic of multi-class text classification (supervised learning) using DistilBERT. This is the process for one article. 12

(14)

Figure 9: Schematic of zero-shot classification as a textual entailment task using DistilBERT. This is the process for one article. The 3 classes of the MNLI model are: C = contradiction, N = neutral, E = entailment.

Figure 10: Schematic of zero-shot classification as a textual entailment task using MobileBERT. This is the process for one article. The 3 classes of the MNLI model are: C = contradiction, N = neutral, E = entailment.

(15)

D

STUDY ON FINE-TUNING

HYPERPARAMETERS - GRAPHS

Figure 11: Test accuracy against training time for the News dataset with Dis-tilBERT. The continuous curve indicates the optimal hyperparameters based on the efficiency metric (equation1).

Figure 12: Validation accuracy against training time for the News dataset with DistilBERT. The continuous curve indicates the optimal hyperparam-eters based on the efficiency metric (equation1).

Figure 13: Training accuracy against training time for the News dataset with DistilBERT.

Figure 14: Training accuracy against epochs for the News dataset with Distil-BERT.

Figure 15: Training F1-score (weighted) against epochs for the News dataset with DistilBERT.

Figure 16: Validation accuracy against epochs for the News dataset with Dis-tilBERT.

(16)

Figure 17: Validation F1-score (weighted) against epochs for the News dataset with DistilBERT.

Figure 18: Test accuracy against training time for the Medium dataset with DistilBERT. The continuous curve indicates the optimal hyperparameters based on the efficiency metric (equation1).

Figure 19: Validation accuracy against training time for the Medium dataset with DistilBERT. The continuous curve indicates the optimal hyperparame-ters based on the efficiency metric (equation1).

Figure 20: Training accuracy against training time for the Medium dataset with DistilBERT.

Figure 21: Training accuracy against epochs for the Medium dataset with Dis-tilBERT.

Figure 22: Training F1-score (weighted) against epochs for the Medium dataset with DistilBERT.

(17)

Figure 23: Validation accuracy against epochs for the Medium dataset with DistilBERT.

Figure 24: Validation F1-score (weighted) against epochs for the Medium dataset with DistilBERT.