Quality Prediction of Scientiﬁc Documents Using Textual and Visual Content

(1)

M ASTER ’ S T HESIS

U

NIVERSITY OF

G

RONINGEN DEPARTMENT OFARTIFICIALINTELLIGENCE

Quality Prediction of Scientific Documents Using Textual and Visual Content

Author:

Thomas Anton van Dongen

First Supervisor:

Prof. dr. L.R.B. Schomaker Second Supervisor:

Dr. G.E. Maillette de Buy Wenniger

March 22, 2021

(2)

Scholarly document prediction is an upcoming task which concerns automatic prediction of important features related to scientific documents, such as the quality, which is the focus of this thesis. With the increasing number of submissions to scientific journals and venues, it is becoming harder for publishers to keep up with the demand for adequate reviewers. Scholarly document quality prediction (SDQP) can help by automatically selecting papers that are likely to be of high and low quality. The task of SDQP is ambitious, requiring processing of very long documents. In this thesis, a combination of methods is proposed which aim to improve results on two sub-tasks related to SDQP, namely accept/reject prediction and citation prediction. All models learn solely from the textual and visual content of documents. A textual model called SChuBERT is proposed which uses a chunking method to extract embedding from long documents using a pre-trained BERT model. In addition, a visual model called INCEPTIONGU is proposed, which is a modified version of an existing model. The INCEPTIONGUuses gradual unfreezing during the training process to improve performance. These two models are combined in a model called SChuBERTJOINTwhich uses joint embeddings to further improve results. For the SChuBERTJOINT

model, extensive experiments are performed to find the optimal concatenation method. Furthermore, experiments are performed to assess whether training in a multi-task learning setting can improve results by using both accept/reject and citation information. The accept/reject prediction task models are evaluated on the PeerRead dataset, while for the citation prediction task a new dataset called ACL-BiblioMetry is proposed that contains citation information for a large number of scientific documents. The results show that the SChuBERTJOINTmodel significantly improves performance on both tasks when compared to previous baselines. This shows how contextualized word embeddings from BERT and regularization techniques such as gradual unfreezing can boost performance for the SDQP task. The choice of concatenation method is important for the performance of the joint models. Multi-task learning does not improve results.

Keywords: scholarly document processing, accept/reject prediction, number of citations prediction, joint models, multi-modal learning, multi-task learning, BERT

i

(3)

A C K N O W L E D G E M E N T S

The day after I submitted my proposal for this thesis, the university shut down due to the rising number of Covid-19 cases in the Netherlands. It was not possible to write at the university facilities, and most of my meetings were online. In short, the circumstances during which I worked on my thesis were special.

However, while the circumstances were special, they did not feel detrimental to my thesis due to the excellent supervision that I got during the project. I would like to thank Lambert for introducing me to this subject and for his refreshing ideas during our meetings. I would like to thank Gideon for being the best possible supervisor I could have had during these special times. Gideon has supported me during my entire thesis and regularly helped in areas where I got stuck or ran out of ideas. We published two papers together, which I am also very grateful for as I am sure this opportunity would not have been given to me by every supervisor. I learned a lot of new skills from him, both in terms of programming as well as academic research and writing. Hopefully, Gideon and I can work together again sometime in the future.

I am also very grateful to my parents, friends and girlfriend who have always supported me, especially in these last few months were I was very busy with my thesis and my job and some distraction was greatly appreciated. Lastly, I would like to thank the staff of the Peregrine HPC cluster, who helped me several times when I encountered problems during my experiments which I performed on the cluster.

ii

(4)

Abstract i

Acknowledgements ii

1 I N T R O D U C T I O N 2

2 T H E O R E T I C A L B A C K G R O U N D 4

2.1 Machine Learning . . . 4

2.1.1 Supervised Learning . . . 4

2.1.2 Unsupervised Learning . . . 4

2.1.3 Transfer Learning . . . 5

2.1.4 Multi-task Learning . . . 5

2.2 Artificial Neural Networks . . . 7

2.2.1 Perceptrons . . . 8

2.2.2 Activation Functions . . . 9

2.2.3 Loss Functions . . . 11

2.2.4 Optimization . . . 12

2.2.5 Convolutional Neural Networks . . . 13

2.2.6 Recurrent Neural Networks . . . 15

2.3 Natural Language Processing . . . 17

2.3.1 Word Representations . . . 17

2.3.2 Encoder/Decoder . . . 18

2.3.3 Attention . . . 19

2.3.4 Transformers . . . 21

2.3.5 BERT . . . 22

2.4 Related Work . . . 25

3 M E T H O D S 27 3.1 Datasets . . . 27

3.1.1 PeerRead Dataset . . . 27

3.1.2 ACL-BiblioMetry Dataset . . . 28

3.1.3 AAPR Dataset . . . 28

3.2 Citation Labels . . . 28

3.3 Evaluation Metrics . . . 29

3.3.1 Accept/reject Prediction . . . 29

3.3.2 Citation Prediction . . . 29

3.4 Textual . . . 30

3.4.1 Textual Baseline Models . . . 30

3.4.2 SChuBERT . . . 30

3.5 Visual . . . 32

3.5.1 Building the dataset . . . 32

3.5.2 Visual Baseline Models . . . 33

3.5.3 INCEPTION_GU . . . 34

iii

(5)

CONTENTS 1

3.5.4 Gradual Unfreezing . . . 34

3.6 Joint . . . 35

3.6.1 Joint Baseline Models . . . 35

3.6.2 SChuBERTJOINT . . . 35

3.7 Multi-task Learning . . . 36

3.8 Hyper-parameters and Training Techniques . . . 37

3.9 Experiments . . . 38

3.9.1 Accept/Reject and Citation Prediction . . . 38

3.9.2 Sequence Length and Overlap . . . 39

3.9.3 Concatenation Method . . . 39

3.9.4 Multi-task Learning . . . 39

4 R E S U LT S 41 4.1 Textual results . . . 41

4.1.1 Accept/reject prediction . . . 41

4.1.2 Citation prediction . . . 41

4.1.3 Sequence length and overlap . . . 42

4.2 Visual results . . . 43

4.3 Joint results . . . 44

4.3.3 Concatenation method . . . 44

4.4 Multi-task learning . . . 45

5 D I S C U S S I O N A N D C O N C L U S I O N 46 5.1 Discussion . . . 46

5.1.1 Accept/reject Prediction . . . 46

5.1.2 Citation Prediction . . . 46

5.1.3 Adapting the Transformer for Long Documents . . . 47

5.1.4 Leveraging Longer Documents and Larger Datasets . . . 47

5.1.5 Effects of Joint Models . . . 48

5.1.6 Concatenation Method . . . 48

5.1.7 Effects of Multi-task Learning . . . 49

5.2 Conclusion . . . 50

5.2.1 Contributions to AI . . . 50

5.2.2 Future Research . . . 51

A A P P E N D I X 60

(6)

In recent years, the number of published scientific papers has continued to increase. According to the STM report [42], the number of English scientific papers published each year is now around 3 million and continues to increase by 4% per year. The number of academic journals is also increasing by 5%

per year. However, while an increasing amount of scientific papers can be seen as a good thing, it also has several downsides. Since every paper needs to be peer-reviewed, it’s hard to keep up with the need for adequate reviewers. Consequently, research journals and conferences have to expand the number of reviewers or the same reviewers have to review more papers. This can both lead to a decrease in quality for peer reviews as well as cases where good papers are neglected. As mentioned in [20], it can also lead to reviewers resorting to heuristics, which in turn leads to papers which present ideas that do not resemble more established works being neglected. In the most extreme cases, it can cause journals to stop accepting submissions for a period of time. In 2018, Review of Higher Education posted a message stating that they were no longer accepting submissions for a period of time due to a two year backlog of scientific papers which had not been reviewed yet, showing how problematic this trend is, especially for top-tier venues and journals which receive the most submissions.

A solution to this problem is automatic prediction of the quality of scientific papers. By predicting the quality of papers, reviewers do not have to review each paper but can instead focus on papers which are predicted as high quality and neglect papers which are predicted as low quality. Scientific quality prediction is a novel task in the field of artificial intelligence (AI). While there are many indicators of "quality", there is no straightforward way to indicate the quality of a paper, making the task both challenging and interesting. Intuitively, a paper can be defined as high quality when it presents a novel idea, is written in a clear and correct manner and is formatted correctly. While this definition seems simple, these indicators are highly subjective. For example, the norm for writing style and formatting can differ per scientific field. For this reason, most previous research instead focuses on scientific impact, which can be more objectively defined by measures such as the number of citations a paper receives, or whether the paper was accepted at a top-tier venue.

In most of the current research, the scientific impact is predicted based on meta-features, such as the current citations or the authors h-index ([14], [84], [2], [9]). However, while meta-features can be used to accurately predict several measures of impact, we believe that quality depends on more than just meta-features. While using meta-features might lead to good results, it is important to consider two things. Firstly, some meta-features are not yet available when a paper is published. For example, a method which predicts citations based on previous citations only works for papers which already have citations. Secondly, quality should not depend on meta-features such as the h-index of the authors, but rather on the content of the document. For this reason, in this thesis, predictions are made purely based on the textual and visual content of papers. The task is thus referred to as "Scientific document quality prediction", where quality is defined based on two available types of data which are easily accessible for a large amount of documents: the number of citations, and whether a paper was accepted at a top-tier venue.

2

(7)

I N T R O D U C T I O N 3

To illustrate how a "good", or "high quality" paper is defined, we take the most cited Google Scholar paper of 2020, which is "Adam: A Method for Stochastic Optimization" [46]. The paper was also accepted at ICLR, which is arguably the most prestigious conference for AI. Based on these two measures, this should be a "good", or "high quality" paper. When looking at this paper, without knowledge of the field, it can already be concluded that the graphs and tables are clear, the paper is divided in clear sections, and the authors provide proof of their method. When reading the paper, the method is clearly explained and several advantages of the proposed algorithm are given. The results show that the algorithm works better than previous algorithms. These are all indicators of quality. To illustrate how a "bad", or "low quality" paper is defined, we take the paper "App2Check and Tweet2Check: machine learning-based tools for Sentiment Analysis of Apps Reviews and Tweets", of which a page is shown in the Appendix.

The paper was not accepted for any venue and does not have any citations. Formatting wise, the graphs are almost unreadable and the tables are very cluttered. When reading the text, the authors remain very vague about their methods, stating that "It is not possible to give more details on the engine due to non- disclosure restrictions.". These two examples should provide some idea of how the quality of papers can be linked to the number of citations and whether the paper was accepted at a (top-tier) venue.

In this thesis, the goal is to answer the following research-question: Can the quality of scientific documents be predicted by only using textual and visual content of the documents? To answer this question, the following sub-questions are answered:

• How can the Transformer be adapted to process long documents?

• How can we leverage longer documents and larger datasets for improving scientific document quality prediction?

• Can joint models improve performance for predicting scientific document quality?

• Which joint embedding concatenation method is best for the scientific document quality prediction task?

• Can multi-task learning improve performance on the scientific document quality prediction task?

The structure of this thesis is defined as follows. In chapter2, the theoretical background is described.

This includes sections on machine learning, neural networks and natural language processing, which are relevant for this thesis. The related work relevant for this thesis is described in 2.4. In chapter3, the methods and experiments are described. This includes a description of the used datasets, as well as detailed descriptions of all the used baselines and proposed models. In chapter4, the results for the described experiments are shown. Finally, in chapter5, the results are interpreted with regards to the research questions in the discussion and a conclusion is given. In addition, the contributions to the field of AI of this thesis are given and possibilities for future research are described.

(8)

In this chapter, the theoretical background is described for the techniques used in this thesis. First, a brief general overview is given for the basic concepts of machine learning in section2.1. After this, a detailed explanation is given for the underlying techniques used in artificial neural networks in section 2.2 on which most of the models used in this thesis are based. An extensive explanation is given for natural language processing techniques in section2.3 as these are the most relevant part for this thesis. Lastly, the related work is described in section2.4.

2.1

M A C H I N E L E A R N I N G

Machine learning is a sub-field of AI which focuses on computer algorithms that learn from experiences.

[57] define the learning process as follows: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” Broadly speaking, machine learning algorithms can be categorized into three categories: supervised learning, unsupervised learning and reinforcement learning. Of these three, supervised and unsupervised learning are relevant for this thesis. Furthermore, two relevant types of machine learning are relevant: transfer learning and multi-task learning.

2.1.1 Supervised Learning

Supervised learning is a sub-field of machine learning in which the learning task is to map an input x to an output y. x can be anything such as an image, a text or a list of numbers. y is a label which is usually created by a human annotator or supervisor, hence the name "supervised" learning. There are many forms of supervised learning of which classification and regression are the most relevant for this thesis.

In classification tasks, the goal is for the algorithm to determine what category an input belongs to for a finite number of categories. Classification has been widely adopted for many applications such as object recognition (where the input is an image and the output is a description of what object is displayed in the image) and movie text classification (where the input is a text and the output is the genre of the movie).

In regression tasks, the goal for the the algorithm is to predict a continuous (numerical) output given an input. The task is similar to classification in the sense that both can have the same input, however the output is different. An example of regression is predicting the price of a house in multiple years, which is a continuous output.

2.1.2 Unsupervised Learning

Unsupervised learning is a sub-field of machine learning where, unlike in supervised learning, the task does not require a label. The goal is to learn and extract useful patterns from the input x. An example

4

(9)

2.1 M A C H I N E L E A R N I N G 5

of this is clustering, where the algorithm attempts to divide the data into meaningful clusters based on similarities between examples. Another is Principal Component Analysis (PCA), where the goal is to learn a representation of the data with a lower dimensionality. Recently, unsupervised learning has been applied for numerous natural language processing tasks such as distributed word representation learning, which is explained in detail in section2.3.1.

2.1.3 Transfer Learning

Transfer learning attempts to improve performance on a target task by leveraging learned information from a source task. Assuming that two tasks are sufficiently related, knowledge from the source task can be used to optimize the target task. This is usually done in two steps: first, the transfer learning model is pre-trained. This can be done in an unsupervised manner, which is one of the main benefits as this allows for very large amounts of data to be used during training. After this, the pre-trained model is used to improve performance on a new task. This can be done in multiple ways. In this thesis, the terminology from [52] is used for the description of different transfer learning methods. In general, three sets of weights are relevant in a transfer learning setting:

• θs: the weights of the shared layers in a pre-trained model.

• θo: the weights of the output layers in a pre-trained model.

• θn: the randomly initialized weights for the new task layers to be optimized in a transfer learning setting.

The goal in any transfer learning setting is to learn item θnby using information from θs. There are three main approaches for this:

• Feature Extraction: the weights of the pre-trained model (θsand θo) are frozen and the features of an intermediate layer are used in a downstream model to learn θn. This approach is computationally cheap as the pre-trained model, which is generally a complex architecture, does not have to be optimized. Furthermore, extracted features can be saved and used in different settings without having to extract them multiple times.

• Fine-tuning: the weights of θsare unfrozen and optimized together with θn, while θois left frozen.

This approach is computationally more expensive, but can lead to better results when the features from the pre-trained model do not generalise well for the new task.

• Multi-task Learning: also known as joint training, the weights of θs, θoand θn are all optimized simultaneously. This approach is computationally expensive but can lead to good results when the amount of data for θnis very small, in which case it can be beneficial to interleave samples of θn

with samples used in optimizing θsand θo. This approach is explained in detail below.

2.1.4 Multi-task Learning

Multi-task learning (MTL) is related to transfer learning but instead of using a target task to improve performance on a source task, multiple related task are learned at the same time. Both transfer learning

(10)

and multi-task learning are becoming more popular due to their similarities to human-like learning where learning is not a straightforward mapping from x to y but a simultaneous process where many tasks are learned at the same time and knowledge from other one domain is used to learn a task in a new related domain. For example, many languages share or borrow words and grammatical rules from each other.

By learning to translate from one language to multiple languages simultaneously, a model can learn to extract these common rules and generalize better. Machine-translation is also one of the fields where multi-task learning has shown great results [27].

From a machine learning perspective, MTL introduces an inductive bias to the learning process which helps generalisation. By introducing more tasks to a model, the model is forced to optimize its weights for all tasks, which reduces the chance of overfitting on one task. In general, MTL for deep learning involves any network where multiple losses are optimized. As defined in [15], "MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks". As described in, [68], there are two common methods of performing MTL in a deep learning setting, namely hard and soft parameter sharing. In hard parameter sharing, which is the most common approach, some of the early hidden layers share all weights between all tasks, while some of the later layers are task-specific as shown in figure1.

Figure 1:Example of hard parameter sharing in a multi-task learning setting. Image from [68].

In soft parameter sharing, a separate model is used for each task and the weights are not directly shared. However, the distances between the weights in all layers are regularized to ensure that they remain similar to some extent. An example of this method is shown in figure2.

(11)

2.2 A R T I F I C I A L N E U R A L N E T W O R K S 7

Figure 2:Example of soft parameter sharing in a multi-task learning setting. Image from [68].

There are also multiple ways to present the data to the model. In joint training, a single dataset is used with multiple labels for each input. For example, a dataset of animal images might contain a set of labels, each indicating whether a certain animal is present in the input. Since animals share certain characteristics (eyes, ears etc.) the model might benefit from a MTL setting. In alternate training, multiple datasets are used with related tasks. In this case, the assumption is that the input data and the tasks are similar. For example, a dataset of English to French translation and a dataset of English to German translation might be suitable for this training strategy since there will be similarities between the tasks that the model has to learn. In both joint training and alternate training, it is important to have somewhat balanced datasets. For joint training, this means that all labels should be present somewhat equally, while in the case of alternate training, this means that the datasets should be of approximately equal size. If this is not the case, the model could become biased towards the label which occurs more.

2.2

A R T I F I C I A L N E U R A L N E T W O R K S

The Artificial Neural Network (ANN), or simply Neural Network (NN), is one of the most important concepts of modern deep learning. The ANN is loosely based on neurons in the in the human brain which are responsible for processing sensory inputs. An extension of the ANN, named the deep neural network (DNN), uses multiple layers. In the following sections, a brief description is given of how ANN are able to learn. Furthermore, a description is given for two relevant DNN architectures, namely the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN).

(12)

2.2.1 Perceptrons

The perceptron, which was first proposed in [67], is the simplest implementation of a NN. The perceptron is a single layer neural network which produces a single output. Its goal is to learn a linear separation for the (binary) input samples. The function learned by a perceptron can be described as equation1.

f(x) =

(1 if w · x+b> 0

0 if w · x+b≤ 0 (1)

Here, x is a vector of one or multiple input values, w is a vector of weights of equal dimensionality and b the bias which is a weight that is independent of the input values. The bias b can be compared to the intercept in a linear equation. By adding a constant to the function, the perceptron can move its decision boundary up or down since the linear separation would otherwise always have to pass through the origin. In the most simple form, a perceptron is trained by sending the inputs x through the network.

This is called a forward pass, or forward propagation. The weights are then updated using the function described in equation2.

wi=wi+η(yi− ˆyi)xi (2)

Here, wi represents a weight at position i. y represents the target value, y the predicted value, xithe input value and η the learning rate which is a positive number between 0 and 1.

The practical capabilities of the perceptron are quite limited as the perceptron is only able to learn functions with a linear decision boundary, as shown in [59]. Notably, the authors show that the perceptron is not able to learn a separation for the the simple XOR function, as it is not possible to separate this function using one linear decision boundary. For this reason, the Multi-layer Perceptron (MLP) was introduced. The MLP can solve non-linearly-separable problems. This is done by adding one or more hidden layers, hence the name multi-layer. Each hidden layer consists of a number of perceptrons, or neurons. An example of a simple MLP is shown in figure3. Here, a single hidden layer is used and one output node. All the layers in a MLP are fully connected, meaning that each input is connected to each weight in the hidden layer, and each weight in hidden layer n to each weight in hidden layer n+1. Unlike the perceptron, the equation described in2 is not sufficient for learning the weights in a network with hidden units, as noted in [69]. To effectively optimize the weights in an MLP, a method called backward propagation, or backpropagation, is used, which is explained in detail in section2.2.4.

(13)

x₁

x₂

x_n

Input Layer Hidden Layer Output Layer

ŷ .

. .

. . . .

. .

Figure 3:Example of a multi-layer perceptron with one hidden layer and a single output node.

2.2.2 Activation Functions

Activation functions are the last part of a neural network. After calculating the weighted sum of the input and adding a bias, the activation function decides whether the value should be activated, or "fired". The activation function is needed for two reasons: firstly, since the weights in a NN are multiplied by the input values, which is a linear operation, it would be impossible to learn a non-linear function. The activation function introduces a non-linear operation, which solves this. Secondly, since the output of the NN before the activation function can be anything from -inf to inf, a non-linear activation function ensures that the final output is bound to a certain range of values, depending on which activation function is used, which can in turn help the learning process. A large number of activation functions exist, of which only the ones relevant for this thesis are described. Plots of the various activation functions are shown side by side in figure4.

• Linear activation: The simplest form of activation is the linear activation function which is described in equation3. This function has an output ranging from -inf to inf. Since the derivative of the function is a constant, it is not possible to use this function for back-propagation. However, it is suitable in the output layer for tasks with continuous outputs or regression tasks which can take any value between -inf to inf, such as citation prediction.

f(x) =x (3)

• Sigmoid activation: The sigmoid, or logistic activation, is an activation function for which the outputs range from 0 to 1. The sigmoid function is described in equation 4. Since a sigmoid activation has an output between 0 and 1, it is suitable for the last layer in binary classification tasks where the goal is to compute the probability that an input belongs to a certain class.

sigmoid(x) = ¹

1+e^−x. (4)

(14)

• Tanhactivation: The tanh activation is a commonly used activation function for which the outputs range from -1 to 1. The tanh is similar to the sigmoid, except that is centered around zero. The tanh function is described in equation5.

tanh(x) = ^e

x− e^−x

e^x+e^−x. (5)

• Softmaxactivation: The softmax activation function, described in equation6, is commonly used in multi-class classification tasks since it normalizes the output for each of the classes between 0 and 1 and then divides by the number of classes, creating a probability distribution.

softmax(x)i= ^exp(xi)

∑jexp(xj) ⁽⁶⁾

• ReLU activation: the ReLU (or Rectified Linear Activation) function, described in equation 7, resembles a linear activation, but it becomes 0 when x is less than 0, making it non-linear. ReLU is arguably the most used activation function due to the fact that it computationally very efficient when compared to other activation functions while still being a non-linear function. Furthermore, ReLU converges faster than the traditionally used sigmoid and tanh, as shown in [48]. For these reasons, RelU is used as the activation for most layers in the models proposed in this thesis.

reLu(x) =max(0, x) (7)

(15)

Linear activation ReLU activation

Sigmoid activation Tanh activation

Figure 4:Plots of the linear, reLU, sigmoid and tanh activations.

2.2.3 Loss Functions

The goal of a loss function is to measure the difference between the predicted label ˆy and the target label yare. Since the loss function output is the value which is passed back through the network using back propagation and used to update its weights, it is important to use an appropriate loss function for the task.

For the purpose of this research, two categories of loss functions are important, namely classification losses and regression losses. A classification loss deals with a finite number of output classes, while a regression loss deals with a continuous output.

(16)

For classification, one loss is relevant, namely cross entropy (CE). CE loss is based on the concept of entropy, which measures the expected value of a variable. CE loss calculates the difference between two probability distributions, making it suitable as a loss function for models which use the softmax activation function as the last activation function of the network. The CE loss function is described in equation8. Here, yiis the target label, ˆyithe predicted label and n the number of examples.

CE =−∑ⁿiy_ilog(ˆyi)

n (8)

For regression, two losses are relevant, namely mean squared error (MSE) and mean absolute error (MAE) loss. MSE loss is calculated as the average of the squared difference between target labels and predicted labels. The square operator makes it so that predicted labels which range very far from the target labels are penalized heavily. This can cause outliers to have a very large effect on the calculated loss. This makes sense for problems where a quadratic rather than linear increase of the penalty for errors is suitable, but for a problem where being far off is not much worse than being off by a little, MAE is more appropriate. The MSE loss function is described in equation9.

MSE= ^∑

n

i(y_i− ˆyi)²

n (9)

MAE loss is calculated as the average of the absolute difference between target labels and predicted labels. Since MAE does not use the square operator, it is more robust to outliers. The MAE loss function is shown in equation10. Note that the absolute difference between yiand ˆyiis taken since otherwise the loss can be negative. In MSE, this is solved by squaring the difference.

MAE= ^∑

n

i|y_i− ˆyi|

n (10)

2.2.4 Optimization

To effectively update the weights of a neural network, the backpropagation algorithm is used, which is a special type of reverse automatic differentiation. During backpropagation, a technique called gradient descentis used, where the gradient of the loss with respect to the network weights is computed by using the chain rule. Mathematically speaking, a weight wiis updated as shown in equation11.

wi=wi+_∆wi

Where∆wi=−η∂E

∂wi

(11)

∆wiis thus defined as the partial derivative of some loss or error E with respect to some weight wi, multiplied by η which is the learning rate. This can be seen as a function which determines a weight withat minimizes the error E. Since the gradient measures in which direction E increases the most, the negative is taken so that the algorithm moves in a direction where E decreases. Using backpropagation, gradient descent thus moves through the network, updating all weights in the direction where E is minimized.

This method is performed until the loss function converges to a global or local minimum.

(17)

Since it is computationally expensive to consider the entire dataset when performing a single update to a weight, a variant of gradient descent called stochastic gradient descent (SGD) is often used. In SGD, weights are updated based on each single sample. Since weights are updated more frequently, SGD causes the model to converge faster. It also requires less memory since the loss values of just one sample have to be stored instead of the loss values of the entire dataset for a single update. However, due to the fact that many steps are taken with SGD with only a single sample of the dataset, the updates can be very noisy, which in turn can lead the gradient in the wrong direction and cause high variance. To prevent the high variance in SGD, a technique called momentum is applied which helps the model by moving faster in the correct direction. This is done by considering not only the current gradient for finding the correct direction, but also the gradient of previous steps.

Another variant of GD is called mini-batch gradient descent and can be viewed as a combination of GD and SGD. In mini-batch gradient descent, a subset of the entire dataset is taken (a mini-batch) and used to update weights. Iteratively, the entire dataset is used to optimize the weights. In practice, this is the most used technique since it provides benefits from both techniques and minimizes the disadvantages.

A number of different optimizers have been proposed which extend upon the idea of gradient descent.

Adaptive Moment Optimization (Adam) [46] is a variant which, in addition to the global learning rate, uses an exponentially decaying average of past gradients and an exponentially decaying average of past squared gradients to compute an adaptive learning rate. In this thesis, Adam is used as the optimization method for all models since it is able to achieve state-of-the-art results as shown in [46].

2.2.5 Convolutional Neural Networks

Convolutional Neural Networks (CNN) [50] are a special type of neural network which excel at learning from highly dimensional data. When the dimensionality of an input is large, in the case of e.g. an image, a MLP would have to optimize an extremely large number of weights. The CNN drastically reduces the dimensionality of the input by moving a sliding window over the input. The current input of the sliding window is convolved with a trainable matrix of the same size as the sliding window which is called the kernel. The output is then summed, leading to a single output for each step of the sliding window. The step size of the sliding window is called the stride. Furthermore, a padding method is used to deal with edges where there is not a full input for the sliding window. An example is shown in figure5. Here, the green matrix is the input, in which the yellow matrix is the current sliding window patch. This is multiplied with the kernel which results in a single output in the output matrix. In this example, only the four weights of the kernel have to be optimized instead of sixteen weights, which would be the case when an MLP is used. Furthermore, the weights of the kernel, which are shared for the entire input, cause the CNN to be translation invariant. This means that features, or objects, can be recognized regardless of their position in the input.

(18)

1 0 0 1

0 1 1 1

0 1 0 1

1 1 0 0

3 4

1 3

= 6

Input

Kernel Output

Figure 5:Example of a convolution of a single image patch and a kernel.

In addition to convolutional layers, CNNs often use pooling layers to reduce the dimensionality even further. A pooling layer moves a sliding window over the input just like a convolutional layer. However, instead of multiplying the input with a trainable kernel, the pooling layer reduces the input by applying a pooling method, such as max-pooling. In the case of max-pooling, the highest number in the input window is taken as the output.

While simple CNNs provide a solution to the dimensionality problem of MLP’s to some extent, it is still hard to learn from large datasets with high-resolution images. For this reason, numerous network architectures have been proposed which can be classified as "very deep convolutional networks".

AlexNet [48] was the first network which proved the power of such networks by winning the ImageNet [25] challenge of 2012. After this, networks such as VGG [76] and ResNet [35] improved performance even further. In this thesis, the Inception V3 [79] network is used for the visual model since [74] showed the effectiveness of the network for the accept/reject prediction task and the (fairly similar) Wikipedia document quality prediction task.

The Inception network consists of a number of Inception modules, or blocks. An example of such a block, which was first used in Inception V1 [78] is shown in figure6. The main innovation of Inception V2 and V3 is the use of smarter blocks, but they largely resemble the block shown in figure6. As can be seen, multiple convolutional layers are used on the same level, instead of sequentially. This makes the network wider, rather than deeper, which helps both with computational complexity as well as overfitting.

Furthermore, by performing convolutions with different kernel sizes simultaneously, the network is able to learn salient parts of images with various sizes. Since visual renderings of scholarly documents are large inputs, it is important to use a network which is computationally efficient, Furthermore, the ability to learn salient parts of various sizes is attractive since important visual features in scholarly documents, such as tables and graphs, can vary widely in size.

(19)

Figure 6:An inception block as used in the Inception network. Image from [78].

2.2.6 Recurrent Neural Networks

Unlike Feedforward-Neural Networks such as the MLP and CNN, the Recurrent Neural Network (RNN) [69] learns from sequential data. While it is possible to pass all data samples in a sequence through a feed-forward neural network one by one, in many situations there is a logical order to the data, for example words in a sentence or time series data. In this case, an RNN is more appropriate. The input xito an RNN is presented as [x1, x2...xt], with t representing the time step. Instead of simply passing a single input through the network, all elements of xiare sent through the RNN as a sequence. The hidden states then consider both the current as well as previous inputs. There are many variations of RNNs but for this research the most relevant form consists of a RNN with hidden to hidden recurrent connections where the input sequence is mapped to a single output. A simple form of such a RNN is shown in figure 7. Since updates in an RNN depend on the last hidden state, forward propagation is different from the standard MLP. Starting at initial hidden state h⁽⁰⁾, the equations shown in 12are applied at each time step (equations from [33]).

a⁽^t⁾=b+W h⁽^t⁻¹⁾+U x⁽^t⁾ h⁽^t⁾=tanh(a⁽^t⁾)

o⁽^t⁾=c+V h⁽^t⁾ ˆy⁽^t⁾=softmax(o⁽^t⁾)

(12)

Here, x⁽^t⁾represents the input at time step t. The hidden state h⁽^t⁾ at time step t is calculated based on the bias b, a weight matrix W which represents hidden-to-hidden connections and a weight matrix U which represents input-to-hidden connections. The hidden state is then computed by taking some

(20)

activation function of this function. The output of the network o⁽^t⁾ at time step t is computed based on a bias c and a weight matrix V which represents hidden-to-output connections, multiplied by the hidden state. Finally, output label ˆy⁽^t⁾at time step t is computed by applying an appropriate activation to o⁽^t⁾, such as the softmax in the case of multi-class classification. After the error is calculated based on some loss function, the error is sent back through the network using backpropagation. However, this is when simple RNNs such as the one showed in figure7run into problems. As discussed in2.2.4, during backpropagation, the gradient is calculated based on the the gradient of the layer before it. If the update to the last layer was small (< 1), the update to the current layer becomes even smaller. Similarly, when the update is large (> 1), the gradient becomes larger. Since RNNs are composed of a number of hidden layers which scale with the input length, this becomes problematic for long inputs. After continuously multiplying the gradient, one of two things can happen: either the gradient becomes extremely small, in which case the weights are no longer updated. This is called a vanishing gradient. Alternatively, if the gradient becomes extremely large, the weight updates become so large that any previously learned information is destroyed. This is called an exploding gradient. Both of these phenomena make it hard for RNNs to learn long term dependencies.

Figure 7:Example of a RNN with a single output. Image from [3].

To combat the vanishing gradient problem, a network called the long short-term memory network (LSTM) [38] was proposed. In addition to a hidden state h, the LSTM uses a memory cell c which controls what information is passed on to h. This is done by using three gates: a forget gate f , an input gate i and an output gate o. f determines whether information from the previous cell should be kept or removed, i determines what part of the newly computed information should be used in c and o determines how much of c should be used to compute the next hidden state. Another network called the gated recurrent unit (GRU) [19] is similar to the LSTM, but instead of three gates, the GRU only uses the two gates: an update gate u and a reset gate r. u determines how much previous information should be passed to the next hidden state and r determines how much of the past information can be forgotten.

An example of an LSTM and a GRU cell is shown in figure18. Performance wise, the LSTM and GRU are similar as shown in [21]. However, since a GRU has fewer trainable parameters, it is faster to train and less prone to overfitting. For this reason, the GRU is used in this thesis.

(21)

2.3N AT U R A L L A N G U A G E P R O C E S S I N G 17

Figure 8:An LSTM and a GRU cell. Yellow operators represent sigmoid functions and blue operators represent tanh functions. Image adapted from [28].

2.3

N AT U R A L L A N G U A G E P R O C E S S I N G

Natural Language Processing (NLP) is a sub-field of AI which concerns the processing and understanding of natural language. In this section, all parts of NLP which are relevant for this thesis are explained.

First, a general overview of several older word representation algorithms is given. Then, the concepts of the encoder/decoder and attention are explained, which are both important for the Transformer. After this, a detailed description of BERT is given.

2.3.1 Word Representations

Representing natural language in a format that is usable for computers is a big challenge for NLP. Since all types of neural networks work with numerical inputs which are multiplied by weight matrices, it follows that natural language needs to be transformed to the numeric domain before it can be used as input for a neural network. Many methods have been proposed for representing language. The simplest approach is to simply one-hot encode all words in the used corpus. In this approach, every word is represented by a 1 dimensional vector of size V , where V is the size of the vocabulary. For example, the sentence "I like trains" can be represented as [I = 100, Like = 010, Trains = 001]. While the implementation of this approach is very simple, it is rarely used for large training corpora. This is because the matrices become extremely sparse when V becomes large. The encodings also disregard any contextual and semantic information that the words might hold.

Term frequency-inverse document frequency (TF-IDF) is a more elegant solution for representing language. It accounts for the number of occurrences of certain words in a corpus. Some words such as ’the’ and ’a’ occur very frequently but are usually not very informative. In TF-IDF, commonly used words across all documents in the corpus receive lower weights while words that only frequently occur in the current document receive higher weights. The TF-IDF for a word i in a document j is computed

(22)

using equation13. The first term is the term frequency of a word i in a document j. The second term is the log of the total number of documents, divided by the number of documents containing i. These two are multiplied to calculate the TF-IDF.

TF − IDF(i, j) =t f_{i, j}∗ log N d fi

(13) While this approach creates a better representation than simple one-hot encoding, it does not consider contextual and semantic information. For this reason, several models based on the distributional hypothesis [70] have been proposed which have been very successful for NLP applications. The most basic interpretation of the distributional hypothesis is that words which are similar in meaning occur in similar contexts. The first word embedding model based on this hypothesis which used neural networks was Word2Vec [55]. Word2Vec is trained using a large corpus of words and turns this into a vector space.

The goal is to position words that appear in similar contexts close to each other in the vector space. The learned vector space has some interesting properties. A famous example from [56] shows that the vectors for "King - Man + Woman" result in a vector which is close to "Queen". This shows how the model is automatically able to learn male/female relationships.

One issue with Word2Vec is that it ignores the fact that some words co-occur more often than others for a certain context. In Word2Vec, this simply leads to more training examples for words that appear more often, which can lead to noise representations. To address this issue, a method called Global Vectors for Word Representation (GloVe) [60] was proposed in which the probability of words occurring in certain contexts is taken into account. As an example, the authors of [60] take the co-occurrence probabilities of the words ice and steam. ice co-occurs more frequently with a word such as solid, while steamco-occurs more frequently with a word such as gas. Both ice and steam co-occur frequently with water and co-occur infrequently with fashion. Instead of the raw probabilities, such as P(K|ice)and P(k|steam), where k is a word such as solid, gas, water or f ashion, the ratio P(k|ice)/P(k|steam)is taken. This cancels out noise from words such as water and fashion, and gives the model more power to discriminate between relevant words. In this thesis, pre-trained GloVe embeddings are used for all the textual baseline models.

While distributional representations such as Word2Vec and GloVe have been very successful, there is still one problem: the word embeddings are static, meaning they do not take the context into account for a new sentence. Consider the following two sentences: ’I like to lie on my bed’ and ’I lie to myself all the time.’ Obviously, the word ’lie’ has a different meaning in these two sentences. However, models like Word2Vec and GloVe have only one embedding for the word, based on which context it most appeared in in the training data. For of this reason, contextualized word embeddings were introduced, leading to some major advancements in the field of NLP. All the proposed models in this thesis use contextualized word embeddings. To understand how contextualized word embeddings work, two techniques are important:

the encoder/decoder and attention.

2.3.2 Encoder/Decoder

Initially developed for machine translation tasks, the encoder/decoder model, as proposed in [77] is able to map variable-length sequences to other variable-length sequences. The idea is simple: the encoder takes a variable-length sequence as input and maps it to a fixed-length vector. The decoder then maps

(23)

this vector to a variable-length target sequence. While conceptually simple, the innovation here is that the intermediate layer(s) between the encoder and decoder can work with a fixed-length input and output, while the input and output can be of variable lengths. The encoder and decoder can be any trainable network, though usually a form of RNN is used.

There is one issue with encoder-decoders, which relates to the previously mentioned issues of RNNs:

the model has a hard time dealing with very long sequences. For a very long sequence, the encoder has to encode all relevant information into a single vector, from which the decoder is supposed to generate a meaningful output. This is a hard condition for the network, and intuitively, is does not make sense for very long sentences. Take for example a translation task. When translating a long sequence, the start of the input sequence is likely to correspond more to the start of the output sequence then the end. One proposed solution to this problem is attention.

2.3.3 Attention

First introduced in [8], attention mechanisms have been very important both in visual and textual tasks.

As the name suggests, the goal of attention is for the model to focus on important parts of the input. From a human perspective, this makes sense: when looking at an image or text, humans focus on relevant parts.

As mentioned before, attention mechanisms are used to overcome the limitations of the encoder/decoder model with regards to long inputs. The basic idea is to only use a part of the input to predict a given output.

Mathematically speaking, The attention mechanism, as proposed in [8], works as follows: for a target word at position t, the hidden state of the decoder is defined as a function of the last hidden state si−1, the target word yi−1 and a context vector ci in the form of si= f(s_i−1, yi−1, ci). The context vector is what separates this from a normal RNN hidden state update. For each target word in the output, a context vector ciis created. This context vector depends on a sequence of annotations (h1..., hT_x), in which every annotation hj corresponds to a word in the input sequence. Each annotation is encoded with information of the entire input sequence, but hjand words surrounding hj are emphasized.

An annotation hj for word xj is obtained by concatenating the forward and backward hidden states of the bi-directional RNN used in the paper. Thus, hj can be seen as a summary of the entire input sequence. As noted before, RNNs tend to represent recent inputs better, something which is noted as one of it’s drawbacks. However, in this case this quality actually has the benefit of emphasizing the current word and it’s surroundings more than distant words. After the annotation vector is created, the context vector ci is created which is a sum of the hidden states h in the input sequence, multiplied by a weight matrix a which defines how well the input and output words which are being considered align. ci is computed as shown in equation14, and ai jis computed for each annotation hj according to equation15.

c_i=

T_x

∑

j=1

a_{i j}h_j (14)

a_{i j}= ^exp(ei j)

∑^T_k^x₌₁exp(eik)⁰ ⁽¹⁵⁾

ei jis defined as ei j=a(s_i−1, hj). It is the output of an alignment function described by a which describes the alignment between an encoder annotation at hjand an the last decoder hidden state at si−1in the form

(24)

of a score. Multiple functions have been proposed to define ei j. In the case of [8], a single layer feedforward neural network is used. a(s_i−1, hj)is defined as v^|atanh(W_as_i−1+U_ah_j)where va is a learned scaling factor and Waand Uaare the learned weight matrices corresponding to the layers in the used feedforward neural network. This type of alignment function is also referred to as additive attention. In [53], three more types of alignment functions are proposed, namely location-based, general and dot-product attention. For this thesis, only the latter is relevant, which is defined as s^|_ihi. To illustrate what attention looks like for a sentence, the annotation weights ai j can be visualized to show the correlation between input and output words as seen in figure9.

Figure 9:Annotation weights matrix for an English source and French target sentence. Each pixel shows the weight a_{i j}for a target word at position i and a source word at position j. Image from [8].

In addition to the attention as proposed in [8], a number of different attention mechanisms have been proposed. [85] proposed two variants, namely soft and hard attention. In their approach they use attention for image captioning, but the underlying mechanisms can be used for any task where attention is relevant.

Soft attention largely resembles the attention mechanism as proposed in [8] since the alignment weights are learned and then placed over the entire sequence. Hard attention works by randomly selecting a subset of the input to attend at one time. In the case of [85], a patch of the input image is selected. This makes computation faster, but the trade-off is that the model is non-differentiable, making it harder to train. [53]

proposed two other variants, namely global and local attention. Global attention resembles the attention as proposed in [8] and the soft attention as proposed in [85] but with a simpler architecture. The proposed local attention is a combination of soft and hard attention. The main issue that is being addressed in local attention is that hard attention is non-differentiable. Just like in hard attention, a subset of the input is selected. However, a window is taken around the selected input, making the model differentiable in the selected window. In practice, this makes the model differentiable almost everywhere.

[18] proposed a very important form of attention, namely self-attention, also known as intra-attention.

Instead of relating an input to an output sequence, self-attention focuses only on a single sequence. Essen- tially, it describes the relations between the words in the sequence by relating each word in the sequence

(25)

to the other words in the sequence. For example, take the sentence "The man drank coffee because he was tired". Self-attention allows an algorithm to associate "man" with "he", which is important information for understanding language. Mathematically speaking, self-attention can be described using the same score function as described for normal attention, except that the source and target sequence are the same.

2.3.4 Transformers

Combining the concept of encoder-decoder and attention leads to a model architecture proposed in 2017:

the Transformer [81]. Unlike most previous encoder-decoder based models, the Transformer does not use any kind of RNN or convolution, instead relying entirely on attention to compute representations for the input and output. The main issue that the Transformer addresses is the inability of RNN based approaches to parallelize because of their sequential nature, leading to high computational complexity.

A few other solutions to this problem have been proposed based on CNNs, namely ByteNet [44] and ConvS2S [31]. By replacing RNN layers with CNN layers, these approaches are able to parallelize, making them much more efficient. However, in these models the number of operations required to relate a position from the input and output grows in the distance between the positions. This means that it is difficult for these models to learn long-term dependencies when the positions in the input and output are distant. In the Transformer, the number of operations required is reduced to a constant. Furthermore, while both ConvS2S and ByteNet are substantially faster than RNNs, they are still relatively expensive when compared to the Transformer. Table1shows a comparison of the complexity per layer, minimum number of sequential operations and maximum path lengths for different layer types. Here, n is the sequence length (usually somewhere between 64 and 256), d is the representation dimension (usually somewhere between 128 and 512), k is the kernel size of convolutions for convolutional layers and r the size of the neighborhood in restricted self-attention layers. As can be seen, self-attention layers are much more efficient than convolutional layers.

Layer Type Complexity per Layer Sequential Operations Maximum Path Length

Self-Attention O(n²· d) O(1) O(1)

Recurrent O(n· d²) O(n) O(n)

Convolutional O(k· n · d²) O(1) O(log_k(n))

Self-Attention (restricted) O(r· n · d) O(1) O(n/r)

Table 1:Complexity per layer, minimum number of sequential operations and maximum path lengths for different layer types. Table retrieved from [81].

Extending upon the concept of self-attention, a new form of attention called multi-head attention is proposed. Firstly, the authors generalise the concept of the attention function by defining it as a function which maps a query and a set of key-value pairs to an output. These formulations are based on retrieval systems, for which the operations are comparable to the operations for computing attention. For example, when a search query is entered in Google, the system maps the query against a set of keys (titles

(26)

of websites, descriptions of the term etc.) and then presents the results in the form of a list of websites (the values). The formula used for computing attention then becomes the one described in equation16.

Attention(Q, K,V) =so f tmax(^QK

T

√d_k)V (16)

This can be interpreted roughly the same as the formula described in equation14, but for a single attention value instead of the complete context vector. Value V corresponds to hjand so f tmax(^QK^√_d^T

k)corresponds to ai j. Note that a different score function is used, namely scaled dot-product, which is a variant of the normal dot-product score function. The authors argue that for larger values of dk, the dot products grow very large, which in turns pushes the softmax function into regions where the gradients are extremely small. For this reason, the dot product is scaled by dividing it by√

dk.

The authors use scaled-dot product attention as the main component for multi-head attention. Multi- head attention essentially ensures that the model captures all information from the input and not just a single relation. This is done by using multiple sets of Q,K,V matrices. Each set is randomly initialized, and attention is computed separately for each head. The outputs of the heads are then concatenated, resulting in a much richer representation of the input sequence. A visualisation of scaled dot-product attention and multi-head attention is shown in figure10. The authors show that the model outperforms other state-of-the-art models and is computationally less expensive. Since it’s invention, many models have been made based on the Transformer.

Figure 10:Examples of different attention heads in BERT. Figure from [81]

2.3.5 BERT

BERT (Bidirectional Encoder Representations from Transformers) [26] is a language model (LM) proposed by Google in 2018 based on the the Transformer model. It is one of the biggest breakthroughs in NLP of the last decade. The goal of statistical language modelling is defined as follows in [13]: "to learn

(27)

the joint probability function of sequences of words in a language". This is usually done by learning to predict the next word in a sentence. While models such as ELMo [61] and GPT [64] showed the benefits of pre-training a model with the language modelling task, they are both trained in a unidirectional manner. This is problematic for many tasks. For example, in question answering systems, it is very important to use both context to the left as well as context to the right of a word. However, it is not trivial to make these models bidirectional using the traditional language modelling technique. By applying bidirectional training to traditional LM, lower layers would leak information to later layers, allowing the words to see themselves in later layers. The main innovation of BERT is a the use of a a novel pre-training task, namely Masked Language Modelling (MLM), to achieve bidirectional training.

In MLM, instead of predicting the next token based on previous tokens, a subset of the tokens is masked (15% in the case of BERT). The masked tokens are replaced with a [MASK] token. The goal is to predict these masked tokens. One downside of this approach is that the [MASK] token does not appear during fine-tuning. For this reason, only 80% of the subset is replaced with the [MASK] token, while 10% is replaced with a random token and 10% is left unchanged. It is shown that this strategy improves performance for fine-tuning when compared to masking 100% of the subset. In addition the the MLM objective, BERT has another pre-training objective, namely Next Sentence Prediction (NSP). The motivation behind using the NSP task is that the relationship between sentences is not directly captured with MLM. However, it is important to capture this relationship for tasks such as question answering and natural language inference. The NSP task is implemented by randomly taking a set sentences A and B.

In 50% of the cases, B follows A and in the other cases, a random sentence is taken from the training set.

The goal of the model is to predict whether B is the actual sentence following A.

BERT has a specific input and corresponding tokenization method. To ensure that rare words are captured and reduce the size of the vocabulary, BERT uses the WordPiece tokenization method as proposed in [83]. With this method, words that do not appear in the vocabulary are split into smaller sub-words.

For example, take the word "talk", which is fairly common and is likely in the vocabulary. If the word

"talker" does not appear in the vocabulary but is encountered in the input, WordPiece tokenization splits it into the smallest recognizable "wordpieces", e.g. "_talk", "e" "r" (where "_" denotes the start of a word). This way, all variants of a word are recognized with a relatively small vocabulary. In an input sequence, the first token is always [CLS] (classification). This token captures the hidden state used to represent the sequence for classification tasks. Furthermore, a [SEP] (separator) token is placed between all sentences in a sequence.

The authors show that BERT is able to obtain state-of-the-art results on eleven NLP tasks. Since BERT can both be used for fine-tuning as well as feature-extraction, a comparison is shown for the two methods, where the embeddings obtained with feature-extraction are used in a BiLSTM. The results show that the performance is comparable for these two methods, which is one of the reasons why feature- extraction is used as the method in in this thesis.

The base model that is proposed in the paper is called BERT-base, which uses 12 layers (or Trans- former blocks), a hidden size of 768 and 12 self-attention heads. This is also the model that is used in this thesis. While the authors of [26] do perform exhaustive experiments showing the strengths of BERT when compared to previous models, they do not perform any experiments showing the linguistical patterns which BERT looks at. In [23], a method is proposed to analyze the attention mechanisms of BERT.

Interestingly, they find that each attention head specializes in capturing one form of syntax by looking at a certain pattern. In figure 11, some examples of these patterns are shown. Several heads attend to

(28)

linguistic features such as "objects attending to their verb" and "determiners attending to their noun".

These patterns provide one possible explanation as to why BERT is able to perform so well.

Figure 11:Examples of different attention heads in BERT. Figure taken from [23].

A number of improvements and adaptations have been proposed since the original publication of the BERT model. Some of these focus on reducing the parameters and thus the computational complexity of BERT such as ALBERT [49] and DistilBERT [71], while others are pre-trained on specific datasets such as BioBERT [51] (for biomedical language representation) and SciBERT [10] (for scientific documents).

While the latter would be a promising model to experiment with for the task of scientific document quality prediction, it’s implementation in the deep learning framework used for this thesis is not straightforward and experimentation with other BERT models is left for future research.

While BERT provides many benefits, the drawbacks of the Transformer architecture are also present in pre-trained BERT models. Most notably, the time complexity which is quadratic with respect to the input length due to the self-attention layers. While this is not an issue for short documents, it is very problematic for long document tasks such as scientific documents. In this thesis, a chunking method is used to overcome this issue. However, recent work has shown promising solutions to the quadratic scaling issue of Transformers. The Reformer [47] is a model which drastically reduces the memory requirements of the Transformers self-attention layers by replacing the global self-attention (which is the cause of the quadratic time complexity in the Transformer) with an approximation called locality- sensitive hashing (LSH) self-attention as proposed in [4]. Effectively, this creates a time complexity of O(L log L) with regards to the input length, allowing the model to learn from much longer sequences.

The Longformer [11] is a more recent model which improves upon the Reformer model’s concept of reducing the global attention. The authors propose a model which is linear with respect to the input length. The main innovation is the use of a sliding window over the attention matrix for which a custom CUDA kernel is implemented since the operation relies on banded matrix multiplication, which is not implemented in current deep learning libraries.