A comparison between contextual (BERT, XLNeT) and non-contextual embeddings (Word2Vec, Fasttext) for the prediction of Tesla?s stock price using tweets and Elon Musk statements

(1)

A COMPARISON BETWEEN

CONTEXTUAL (BERT, XLNET) AND

NON-CONTEXTUAL EMBEDDINGS

(WORD2VEC, FASTTEXT) FOR THE

PREDICTION OF TESLA’S STOCK

PRICE USING TWEETS AND ELON

MUSK STATEMENTS

Aantal woorden / Word count: 24994

Jonas De Vos

Stamnummer / student number : 01502076

Tibo Dewispelaere

Stamnummer / student number : 01503851

Promotor / supervisor: Prof. Dr. Dirk Van den Poel

Masterproef voorgedragen tot het bekomen van de graad van:

Master’s Dissertation submitted to obtain the degree of:

Master in Business Engineering: Data Analytics

(2)

Confidentiality agreement

I declare that the content of this Master’s Dissertation may be consulted and/or reproduced, provided that the source is referenced.

Name student: De Vos, Jonas

Signature:

Name student: Dewispelaere, Tibo

(3)

Acknowledgement

Writing this master dissertation was both a challenging and educational experience. It allows us to graduate as business engineers and wrap up this fascinating period. We want to thank Ghent University for the qualitative education it has provided us.

We would like to express our gratitude to our promotor, Dirk van den Poel, professor at Ghent University, for his continued support throughout this thesis. The subject of this dissertation is largely inspired by his passion for state of the art technologies. He pulled us out of our comfort zone to explore what the fast moving, fascinating world of NLP has to offer. Also we would like to thank Arno Liseune, doctoral student at Ghent University, for the interesting insights and helpful remarks.

A special thanks to Thomas Demeester, professor Natural language processing at Ghent university, for putting us on track. Also a special mention for Joni Dambre, professor deep learning at Ghent university, who teached us the fundamentals of a field that will only grow in the future.

Additionally, a thank word for our parents for supporting us both emotionally and financially through the past years and the process of writing this dissertation.

Last but not least, a thank you to Elon Musk for tweeting so passionately, and sometimes stupidly.

(4)

Abstract

Literature concerning the prediction of stock prices using twitter mostly relies on extracting the sentiment from tweets which is used as feature in the prediction task. However, this intermediate step is both time consuming and expensive as tweets have to be manually annotated. Over the last years, NLP made huge progress as it leverages transfer learning along with the ability to better represent words in vectors called embeddings. Two generations of embeddings can be identified, namely non-contextual and contextual embeddings. Especially contextual embeddings generated by BERT or XLNeT are achieving SOTA results on many NLP applications. In this light, an approach is studied which would directly link these embeddings to the stock prices instead of first extracting the sentiment. Moreover, the different embedding generations are tested along with their ability to leverage from transfer learning.

Results show that contextual embeddings (BERT and XLNeT) do not outperform the non-contextual ones (Word2Vec and fastText) as the structure of tweets does not fit the most distinc-tive features of contextual language models. The most striking observation stems from the fact that leveraging transfer learning leads to significantly worse results than if one trains on his own corpus, given that the own corpus is sufficiently large, and the corpus from which the embedding has been able to leverage from does not sufficiently resemble the one at hand. Lastly the tradi-tional sentiment approach is significantly better than the direct embedding approach. However, if one does not has the time nor the resources to manually annotate, the direct embedding approach will yield decent results as well.

(5)

3.2.1 Temporal dependency . . . 46 3.2.2 Nested cross-validation . . . 46 3.3 Model evaluation . . . 48 3.3.1 Metrics . . . 48 3.3.2 Model comparison . . . 50 3.4 Set-up . . . 52 3.4.1 Forecast horizon . . . 52 3.4.2 Time window . . . 54 3.4.3 Intervals . . . 55 3.5 Tweet embeddings . . . 61 3.5.1 Overview . . . 61 3.5.2 Tweet pre-processing . . . 62 3.5.3 Embeddings . . . 65 3.6 Sentiment model . . . 71 3.6.1 Sentiment prediction . . . 71 3.6.2 Sentiment as feature . . . 80 3.7 Base model . . . 82 3.8 Combined models . . . 83 3.8.1 Single models . . . 83 3.8.2 Multi-input model . . . 83

(7)

3.9.2 Base model . . . 86 3.9.3 Sentiment model . . . 86 3.9.4 Combined models . . . 87 4 Results 89 4.1 Base model . . . 89 4.2 Embedding models . . . 91 4.2.1 General performance . . . 91 4.2.2 Model comparison . . . 92 4.3 Sentiment models . . . 98 4.3.1 General performance . . . 98 4.3.2 Model comparison . . . 98 4.4 Model comparison . . . 100

4.4.1 Base & embedding models comparison . . . 100

4.4.2 Base & sentiment models comparison . . . 100

4.4.3 Embedding & sentiment models comparison . . . 101

4.5 Model combinations . . . 104

5 Conclusion 106 5.1 Hypotheses and findings . . . 106

5.2 Feedback to literature . . . 108

6 Limitations and future research 109 6.1 Limitations . . . 109 6.1.1 Practical limitations . . . 109 6.1.2 COVID19 . . . 110 6.2 Future research . . . 110 Appendix 121 6.3 Intervals . . . 121 6.4 Results . . . 125 6.4.1 Embedding models . . . 125

(8)

(9)

List of Figures

1.1 Structure dissertation . . . 4 2.1 Archaic Representations . . . 6 2.2 A neural language model [Bengio et al., 2003] . . . 9 2.3 The General architectures of CBOW and Skip-gram. Source:[Suleiman et al., 2017] 12 2.4 The 3 stages of ULMFiT [Howard and Ruder, 2018] . . . 15 2.5 Differences in pre-training model architectures. BERT uses a bidirectional

trans-former. OpenAI GPT uses a left-to-right Transtrans-former. ELMo uses a concate-nation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among 3, only BERT representations are jointly conditioned on both left and right context in all layers1 . . . 16 2.6 One encoder block [Vaswani et al., 2017] . . . 18 2.7 Multi-head scaled dot-product attention mechanism. [Vaswani et al., 2017] . . . 19 2.8 [CLS]-token usage . . . 21 2.9 CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected

using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyperparameters [Devlin et al., 2018] . . . 22 2.10 Bias in training bidirectionally in traditional language modeling [Devlin et al., 2018] 23 2.11 Next sentence prediction task . . . 24 2.12 Fine-tuning BERT [Devlin et al., 2018] . . . 24 2.13 The General Language Understanding Evaluation benchmark (GLUE) results. It

is a collection of datasets used for analyzing NLP models relative to one another. The collection consists of 8 “difficult and diverse” tasks designed to test a model’s language understanding [Devlin et al., 2018] . . . 25 2.14 Set-up . . . 32

(10)

2.17 CNN filter . . . 37

2.18 LSTM architecture . . . 38

3.1 Overview approach . . . 40

3.2 Tesla’s trading volumes during a typical trading day . . . 41

3.3 Tesla’s stock price during modeled period . . . 42

3.4 1 hour stock returns over time . . . 43

3.5 1 hour stock return histogram . . . 44

3.6 Available data . . . 45

3.7 Validation set approach . . . 46

3.8 Time series nested cross-validation . . . 48

3.9 Example best epoch: Trained W2V Fold 1 Run 2 . . . 49

3.10 Example Intervals . . . 56

3.11 1 Minute Histogram . . . 57

3.12 1 Minute Interval . . . 58

3.13 5 Minute Histogram . . . 58

3.14 5 Minute Interval . . . 59

3.15 Number of words in a tweet . . . 67

3.16 Number of BERT-tokens in a tweet . . . 68

3.17 Numerical input BERT for feature extraction. . . 68

3.18 Standard vocabulary vs augmented vocabulary . . . 69

3.19 A pre-training instance in the Tesla corpus. . . 70

3.20 Test results pre-training stage. . . 71

3.21 Validation loss per epoch per fold. . . 73

3.22 Class distribution labeled tweets + financial phrasebank . . . 74

3.23 Class distribution labeled tweets only . . . 74

3.24 Confusion matrix: Worst run fastText Fold 1 Run 2 . . . 76

3.25 Confusion matrix: Best run fastText Fold 3 Run 3 . . . 76

3.26 Confusion matrix: Worst run BERT Fold 1 Run 1 . . . 77

3.27 Confusion matrix: Best run BERT Fold 3 Run 1 . . . 78

(11)

3.31 Sentiment over time generated by BERT . . . 80

3.32 Example: sentiment as feature . . . 81

3.33 Example: open price as feature . . . 82

3.34 Example: all historical price features . . . 82

3.35 1D-CNN handling textual data. Source: [Kim, 2014](edited) . . . 84

3.36 Architecture of pre-trained BERT fold 3 . . . 86

3.37 Architecture of fastText sentiment fold 3 . . . 87

3.38 Architecture of multi-input model . . . 88

4.1 Confusion matrix: Worst run base model Fold 3 Run 3 . . . 90

4.2 Confusion matrix: Best run base model Fold 3 Run 1 . . . 90

4.3 Embedding models: AUC boxplot . . . 93

4.4 Test results pre-training stage from scratch . . . 95

4.5 Confusion matrix: BERT from Scratch Fold 2 run 3 . . . 96

4.6 Confusion matrix: BERT from Scratch Fold 3 run 1 . . . 96

4.7 Sentiment models: AUC boxplot . . . 100

4.8 Confusion matrix: Worst run fastText sentiment Fold 1 Run 1 . . . 103

4.9 Confusion matrix: Best run fastText sentiment Fold 2 Run 2 . . . 103

4.10 Confusion matrix: Worst run trained fastText Fold 3 Run 3 . . . 103

4.11 Confusion matrix: Best run trained fastText Fold 1 Run 3 . . . 103

6.1 1 Hour Interval . . . 121 6.2 1 Hour Histogram . . . 122 6.3 30 Minute Interval . . . 122 6.4 30 Minute Histogram . . . 123 6.5 10 Minute Interval . . . 124 6.6 10 Minute Histogram . . . 124

(12)

List of Tables

3.1 Used metrics . . . 50

3.2 Hypothesis testing . . . 51

3.3 Labels over a 1-day period . . . 53

3.4 Available labels . . . 54

3.5 Results comparison time windows . . . 55

3.6 Summary statistics: intervals . . . 57

3.7 Results comparison intervals . . . 59

3.8 Overview compared embedding models and their embeddings . . . 61

3.9 Abbreviations embedding models . . . 63

3.10 Specific pre-process steps . . . 64

3.11 BERT’s batch size in function of the sequence length on a Titan X GPU. . . 67

3.12 Sentiment accuracy results: fastText with and without financial phrasebank . . . 75

3.13 Sentiment accuracy results: BERT . . . 77

3.14 Hyper-parameters values . . . 85

4.1 Results base model . . . 91

4.2 Embedding models: performance per fold . . . 92

4.3 Embedding models: performance metric averaged per model . . . 92

4.4 Performance FinTweetBERT . . . 96

4.5 Average recall & specificity: non-contextual vs contextual models . . . 97

4.6 Average recall & specificity: trained models vs pre-trained models . . . 97

4.7 Sentiment models: performance per fold . . . 98

4.8 Sentiment models: performance metric averaged per model . . . 99

4.9 Average AUC per model and fold (sentiment models) . . . 99

(13)

4.11 fastText sentiment & trained fastText comparison . . . 102

4.12 Nemenyi post-hoc test: p-values . . . 102

4.13 Results model combinations . . . 104

6.1 Results: Pre-trained W2V . . . 125

6.2 Results: Trained W2V . . . 125

6.3 Results: Trained W2V 5 minute intervals . . . 125

6.4 Results: Trained W2V 3 hour time window . . . 126

6.5 Results: Pre-trained fastText . . . 126

6.6 Results: Trained fastText . . . 126

6.7 Results: BERT Vanilla Base . . . 127

6.8 Results: BERT Vanilla Large . . . 127

6.9 Results: BERT Pre-trained Base . . . 127

6.10 Results: XLNeT . . . 128

6.11 Results: BERT Scratch . . . 128

6.12 Results: Base model . . . 128

6.13 Results: fastText Sentiment model . . . 129

6.14 Results: BERT Sentiment model . . . 129

6.15 Results: Base-embedding model . . . 129

6.16 Results: Sentiment-base-embedding model . . . 130

6.17 Results: Sentiment-base model . . . 130

6.18 Results: Sentiment-embedding model . . . 130

(14)

Chapter 1

Introduction

Research conducted by Beobank in 2017 shows that 3 out of 5 Belgians do not possess any form of investment vehicle [Web2]. Year after year the total amount of money on the savings account grows even though the interest rate is zero. Investing, on the other hand, is a way to bypass the inflation trap. However, this is especially by Belgians considered to be very risky and if exercised by non-experts, doomed to fail. What most people, however, do not know is that only 10 per cent of equity trading is done by traditional traders (in the US), i.e. operating based on gut and experience, according to a news report from JPMorgan [Web5]. The other 90 per cent is done by quantitative investing based on computer algorithms. For instance, the investment company Renaissance Technologies, owned by one of world’s famous investors Jim Simons, got a return of 1.2 billion dollar on a single trade which was recommended by one of its models [Web18]. The hedge fund raised its stake to 3.9 million shares before the stock price doubled in just two months. The company at hand was Tesla, managed by the famous Elon Musk. In 2019, its stock price tripled. Clearly, there must be some predictiveness in the sudden surge of Tesla as demonstrated by Renaissance Technologies, making it a really interesting company to investigate and the main subject of this dissertation.

Thanks to the growing computing possibilities it is possible to use all sorts of information as input into a model: historical prices, quarterly results, financial statements , financial texts, etc. Investigating all these sources would lead to a dissertation too large in scope. A source has to be picked which is both informative and interesting. One that is promising is the study of tweets. In an ideal transparent world, it should be able to incorporate all the news about a specific company along with potential buyers’ or stockholders’ thoughts and moods. Moreover, all this information is present on a single page, making it a very attractive source of information that

(15)

could yield those extra returns on the stock market. Hence, tweets concerning Tesla are used along with some scraped tweets of Elon Musk, which makes it even more interesting as Musk got quite a reputation on twitter. Evidently, Stock price prediction using tweets has been the subject of research a few times already. The most famous research was done by [Bollen et al., 2011], who predicted the stock price using moods of the tweeting public. He was on of the first to extract the sentiment of tweets and link it as a feature to the stock price, which is now the conventional way to predict stock prices using tweets.

To be able to assign a sentiment score to each tweet. A thorough understanding of language is required. This is covered in the field of natural language processing (NLP). Not too long ago, NLP was considered too flawed and hence ignored by many machine learning researchers [Web10]. Luckily, the past years NLP has made enormous progress through the creation of embeddings, language modeling and the adoption of transfer learning. Moreover, TechCrunch, a leading startup in technological advancements news, claims that NLP has made more progress in the past three years than any other field in machine learning [Web10]. It began in 2013, with the creation of word embeddings generated by Word2Vec, followed by Facebook’s fastText in 2016. These embedding models are non-contextual, i.e. they are unable to take context into account for a given word in a sentence, each word has exactly one word embedding no matter the context. This changed when ELMo was proposed in 2018, as this model was able to generate different word embeddings for the same words in different contexts. What followed was a bulging amount of models coming out every month: BERT, GPT-2, TransformerXL, ALBERT, XLNet, DistilBERT and others. These models elevated the state of the art, one release at a time. Big companies also tapped into this field, for instance, Microsoft that introduced a model of 17 billion parameters in February 2020, named Turing-NLG.

The CEO of one of the largest companies in the world publicly tweeting about NLP created even more awareness1. NLP, a field that many considered obscure a few years ago, was finally in the spotlights as a technology companies could leverage from. In literature, a large part is spent explaining the advancements NLP made throughout the years. 4 models play a major role in this dissertation, namely Word2Vec, fastText, BERT and XLNeT. The transformation NLP went through would not be possible without deep learning, which is also becoming better and more user-friendly every year. One big advantage of deep neural networks is their ability to

1

(16)

handle high-dimensional data making it an adequate method to use, as embeddings also have this attribute.

Because of the advancements in both NLP and deep learning, a new approach is proposed in this dissertation, which is referred to as the ’direct embedding approach’ or ’embedding model’, that is to use the embeddings directly as a feature for stock price prediction, without sentiment as an intermediary, leaving an obnoxious step, since sentiment prediction for stock price prediction usually requires the act of manually labeling the corpus. Furthermore, the annotator should have a background in finance as the tweets can get technical. Hence, this step is very expensive and time consuming. As far as known, this comparison has not been the subject of research yet. Especially since contextual embeddings are included in the research as well. Afterwards the approach is compared to the more ’traditional sentiment approach’ or ’sentiment model’.

In literature, it has been proven plenty of times that contextual embeddings outperform their non-contextual counterparts on many NLP tasks including sentiment prediction. Therefore, a second part of this dissertation is to investigate the performance differences on the stock price prediction task between non-contextual and contextual embeddings. The question is whether the conclusions made in NLP can be extrapolated to the field of stock price prediction, two com-pletely different tasks. In total, 8 embeddings are compared, derived from Word2Vec, fastText, BERT and XLNeT. Some of these embeddings are derived from the same embedding generating model, however a distinction is made between pre-trained embeddings that benefit from transfer learning and embeddings that were generated by training on the $TSLA twitter corpus. Transfer learning applied to models means that they have been pre-trained on a large corpus not equal to the one used for the task at hand. Finally, embeddings and sentiment are combined as features, to evaluate whether they can complement each other and hence, increase performance of stock price prediction.

To summarize, the following hypotheses are evaluated:

• Sentiment analysis as intermediary step for stock price prediction can be discarded, instead use embeddings directly as feature.

• Contextual embeddings outperform non-contextual embeddings when using them directly as feature in stock price prediction.

(17)

• Embeddings generated by models using transfer learning outperform embeddings gener-ated by training on the twitter corpus when using them directly as feature in stock price prediction.

• Combining sentiment and embeddings as features for stock price prediction outperform the models using only 1 of the 2.

This master dissertation starts of by a literature review (chapter 2), laying the foundation for the approach chosen and discussed in the next part (chapter 3). The results are discussed in chapter 4. The conclusion is drawn in chapter 5, accompanied by a small reference to the literature review to put thing into perspective. Finally, this dissertation is concluded in chapter 6, where the encountered limitations and suggestion for future research are given. A visual overview is depicted in figure 1.1.

(18)

Chapter 2

Literature

In the first chapter of this dissertation, a literature overview is given, laying the foundation for the approach followed in the remainder of this dissertation. This overview starts with a brief discussion on the different ways to numerically present a piece of text (section 2.1). One such way is by means of word embeddings, these embeddings are explored much more thoroughly in the next part, starting with the traditional non-contextual embeddings like W2V and fastText, and how they eventually evolved to contextual embeddings such as BERT and XLNeT (section 2.2).

This is followed by a section on how to effectively pre-process the tweets before turning them into embeddings (section 2.3). Afterwards, sentiment analysis is discussed, as it is important to execute the traditional sentiment approach correctly in order to make a good comparison with the direct embedding approach (section 2.4). Section 2.5 is less focused on the tweets and considers the ideal forecast horizon for the Tesla stock price prediction task. Given all the knowledge gathered in the previous sections, the final section discusses the used machine learning model for this problem (section 2.6).

2.1 Text representation

In 1950, Alan Turing developed a test to determine whether a machine can demonstrate human intelligence. The test states that a human evaluator has to judge natural language conversations between a human and a machine which is designed to generate human-like responses. If the evaluator cannot reliably distinguish the machine from the human, the machine is said to have passed the test. Hence, the machine must be able to comprehend and manually build sentences

(19)

as if it is a normal human being. All this is covered in the field of Natural Language Processing (NLP). 70 years later, no algorithm was able to pass the test yet and it will probably still take some time until one does, but NLP has come a long way since [Web7]. Text representations in the NLP domain has progressed from sparse vectors of one-hot encoded words to condense representations with the ability to link words hundreds of sequences away.

In the following, text representations are quickly being discussed along with their pros and cons. This section solely describes how text is inputted in the algorithm (e.g. the food that comes into the model). How one comes to the text representation is covered in the next section.

A distinction is made between archaic methods and word embeddings. Archaic methods, in contrast to word embeddings, do not have any notion of either syntactic or semantic similarity between words and/or sentences. The reader should not conclude that archaic methods must be avoided at all time. They are still relevant in certain contexts. However, in the light of this dissertation the word embedding approach yields better results.

Figure 2.1: Archaic Representations

2.1.1 One-Hot Encoding

One-hot encoding is a term borrowed from digital electronics that defines a binary method for mapping categorical data or words into numeric vectors for a feasible use as input by mathemat-ical algorithms and machine learning methods [Web3]. It is one of the easiest ways to convert text data to operable input for NLP applications. A word can be represented by simply filling in its vector position as a 1, while keeping the other positions marked 0: the word ‘rains’ in the sentence ‘It rains today’ is represented as the One-Hot Encoded vector [0,1,0]. The reader can imagine the inefficiency of this method: The Oxford English Dictionary contains 171 476 words and in modern programming languages a digit is represented as one byte. Assuming the twitter

(20)

data set used contains 1 million tweets with an average of 7 words, results in a storage need of 1.2 terabytes just to store this input.

2.1.2 Bag-of-Words

Just 4 years after Turing contemplated his famous test, [Harris, 1954] introduced the famous Bag-of-Words model. It consists of 2 parts, first of all a vocabulary of all known words in the corpus and secondly a measure of the presence of known words in each document. E.g. each tweet (document) in a corpus of 500 tweets with 10 000 different words is represented as a vector with a dimension of 10 000. Each column then represents the amount of occurrences of a particular word within the document. The BoW-approach also learns from similar words occurring across documents. Since twitter is a micro blogging platform and the algorithm cannot recognize synonyms, the probability of having important words in common declines. Additionally, all words occurring in the text are in ‘the bag’. Hence, the resulting matrix is large and sparse leading to an increase in computing time.

2.1.3 N-grams

A N-gram model is a Bag-of-Words model taking order into account. N-grams are all combina-tions of adjacent words or letters of length N that you can find in your source text. For example, the sentence ‘It rains today’ can be represented by 2 bigrams: [It, rains] and [rains, today]. The longer the N-gram, the more context you have to work with. An optimal N length does not exist: it depends on the application it is used for. Apart from the order, the same drawbacks are present as in the BoW-approach [Web17].

2.1.4 TF-IDF

The term weighting function known as IDF was proposed by [Jones, 1972], and has since been extremely widely used, usually as part of a TF-IDF function. Term frequency-inverse document frequency (TF-IDF) is a statistical measure used to evaluate how important a word is to a document in a corpus. The weight can be calculated by multiplying the term frequency with the inverse document frequency.

T Fi,j = ni,j P jni,j (2.1) IDFi = log N dfi (2.2)

(21)

Where T Fi,j is the number of times the word i occurs in the document j and indicates the

logarithmically scaled inverse fraction of the documents that contain the word. Despite having the same flaws as the Bag-of-Words model, it has a certain property the BoW-approach lacks. The inverse document frequency penalizes frequent words that occur a lot across the entire text. Hence, less frequent words containing more informative content are given more weight and certain stop words are discarded.

2.1.5 Word Embeddings

Word embeddings were introduced by [Bengio et al., 2003]. This kind of word representation relies on the idea of distributed representations for symbols by [Rumelhart et al., 1986]. Figure 2.2 shows how one obtains word embeddings. It takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation [Web23]. After the process, the word embeddings can be obtained by extracting the weights of the hidden layer. Word embeddings introduce a more concise and efficient way of representing words which capture a large number of precise syntactic and semantic word relationships [Mikolov et al., 2013] as they, in contrast to the archaic text representations, pass through a neural network and learn about the adjacent words. That way, embeddings of words or sentences that have a similar meaning, also have a high cosine similarity, since embeddings are nothing more than a vector representation.

(Static) word embeddings generated by Word2Vec or fastText were for many years the go-to methods to represent text in any machine learning pipeline. However, these word em-beddings are static: each word has a single vector, regardless of context [Mikolov et al., 2013, Pennington et al., 2014]. This poses several problems, most notably that all senses of a polyse-mous word have to share the same representation: the word ‘apple’ has the same static embed-ding in the sentences “I want to eat an apple.” and “I have an Apple iPhone.”. More recent work, namely language models such as ELMo [Peters et al., 2018] and BERT [Devlin et al., 2018], have successfully created contextualized word representations, word vectors that are sensitive to the context in which they appear. Replacing static embeddings with contextualized representations has yielded significant improvements on a diverse array of NLP tasks [Ethayarajh et al., 2019].

(22)

Figure 2.2: A neural language model [Bengio et al., 2003]

2.2 Language models

It should be clear from the previous section that word embeddings yield the best numerical representation of the Tesla twitter corpus. Since the goal is to explore a new approach using the tweets as embeddings, it is essential to understand the models that generate those embeddings.

With these models, one has the option to train from scratch (i.e. the model has no initial knowledge) or use a pre-trained version of the model (it is also possible to further pre-train a model hence, using already acquired knowledge before training starts). The power of the latter option was first demonstrated by [Collobert and Weston, 2008]. Now, pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch [Turian et al., 2010]. Pre-training models can be seen as a form of transfer learning. Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a related problem [West et al., 2007]. Although being

(23)

widely used in the field of computer vision e.g. ImageNet [Deng et al., 2009], transfer learning in the field of NLP was not that common. Luckily, the success of transfer learning in the computer vision discipline induced research of NLP practitioners into the possibilities, making it widely used today. Furthermore, what made transfer learning in NLP even more popular was language modeling. Language modeling can be seen as the ideal source task and a counter-part of ImageNet [Deng et al., 2009] for NLP: it captures many facets of a language relevant for downstream tasks, such as long-term dependencies [Linzen et al., 2016], hierarchical relations [Gulordava et al., 2018], and sentiment [Radford et al., 2017].

In language modeling one calculates the joint probability distribution over word sequences [Web24]. E.g. given the sequence of tokens Tesla,stock,price,is,surging, the language model calculates the probability P r(surging|T esla, stock, price, is) that the word ’surging’ is the fol-lowing word. This is a semi-supervised task as the labels are generated from the corpus itself. The possibilities are therefore endless considering the vast amount of text data present on the internet. That is why the combination transfer learning - language modeling works so well. A model can gain lots of general language knowledge by pre-training on a corpus that requires no expensive annotation. Typically, large companies like Google or Microsoft do the expensive pre-training part (weeks of training) as they have the most resources.

Variants of the language modeling approach are widely used today. BERT uses masked lan-guage modeling, XLNeT uses permutation lanlan-guage modeling. Word2Vec and fastText, however, use CBOW and Skip-gram. Especially CBOW looks a lot like a language model. In fact there are different opinions on whether Word2Vec and fastText are language models . Some say they do, others say they do not as traditional language modeling only takes previous words into account. Another argument is that CBOW and Skip-gram do not learn order of words, while language models do learn the order. On the other hand, these 2 agruments also contradict BERT’s and XLNet’s training algorithm but they are still called language model (albeit a variant). In order to facilitate reading and as there is no real conformity in literature, the W2V and fastText models are also called language models.

This section provides the reader with a clear and non-technical overview of the latest ad-vancements in NLP-research. Only highly necessary information - which is vital to understand this dissertation - is discussed. The models are discussed in a chronological order. A common thread in this section is how each model builds on the flaws of earlier released models to further

(24)

strengthen the learning signal. Have in mind that the NLP-field is much broader than this and can get a lot more technical in terms of mathematics.

Firstly, some important models are examined along with their most significant contributions to literature. Followed by a thorough study on the famous BERT model. To end with one of the most state-of-the-art models: XLNeT.

2.2.1 Evolution of embedding models and how they led to BERT

BERT plays an important role in this dissertation. It is therefore essential to understand how the model works and equally important, 2 questions must be answered: why does the model behave the way it does and why does it have these particular properties? This section provides an answer to these 2 questions by elaborating on previous embedding models which had significant influence on BERT.

2.2.1.1 Word2Vec

The term word embeddings was originally coined by [Bengio et al., 2003]. However, it was [Mikolov et al., 2013] who really brought word embedding to the fore through the creation of Word2Vec, a toolkit enabling the training and use of pre-trained embeddings.

There are 2 types of Word2Vec’s, namely Skip-gram and Continuous Bag-of-Words. Those algorithms learn word representations that maximize the probabilities of a word given other contextual words (CBOW), or of a word occurring in the context of a target word (Skip-gram). The Continuous Bag of Words model is trained to predict a target word given the near words. For instance, given the phrase “The sun shines today”, if one feeds a CBOW model with the list of words [“The”, “shines”, “today”], one would expect the word “sun” as its output. The same training is performed using all other sentences of the input corpus. In general, given a context , a predicted word ˆwtis inferred as:

ˆ

wt= P (wt|{wt−, ..., wt−1, wt+1, ..., wt+}) (2.3)

Where { ≥ t ≥ −; t 6= 0} in which t is the word index and represents the half window size (how many words before and after the word wt are used to feed the model). In the Skip-gram,

the objective is the reverse. A word serves as input to the model that tries to infer its “context”, that is, the words that are commonly found together. Consider again the phrase “The sun

(25)

shines today”, if one feeds a Skip-gram model with the word “sun”, the words [“The”, “shines”, “today”] can be expected as its output. More formally, given a word wt, a predicted context ˆ

is inferred as:

ˆ

= P ({wt−, ..., wt−1, wt+1, ..., wt+}|wt) (2.4)

Where again { ≥ t ≥ −; t 6= 0} in which t is the word index and represents the half window size (how many words before and after the word wt are used to feed the model).

Figure 2.3: The General architectures of CBOW and Skip-gram. Source:[Suleiman et al., 2017]

[Mikolov et al., 2013] states that Skip-gram works well with small amount of the training data while CBOW is several times faster to train than the Skip-gram and attains slightly better accuracy. Since the Tesla corpus is large enough, the CBOW algorithm is chosen.

As mentioned in section 2.1.5, [Mikolov et al., 2013] found that semantic and syntactic pat-terns can be reproduced using vector arithmetic. For instance, taking the algebraic sum of the vector (“Belgium”) and the vector (“capital”) produces a result which is closest to the vector (“Brussels”) in the model. Syntactic relations are also present as, for example, vectors of differ-ent tenses of a verb are mapped close to each other in vector space, i.e. they have a high cosine similarity.

Although being superior to the archaic methods, there still are some major shortcomings to the static embeddings Word2Vec produces. They ignore the morphology of words by assigning a

(26)

distinct vector to each word. Words that are not present in the training data have no embedding. Especially in the light of this dissertation this can be harmful. Misspelled and slang words are common in Twitter data and induce Out-Of-Vocabulary words (OOV) and consequentially no embedding is assigned. Another problem is the embeddings being static as mentioned above.

The Word2Vec approach has been generalized to capture broader level representations, such as sentence embeddings using encoder-decoder models [Zhu et al., 2015, Logeswaran and Lee, 2018] or paragraph embeddings [Le and Mikolov, 2014]. Encoder-decoder models are introduced in section 2.2.1.5.

2.2.1.2 fastText

To counter the Out-Of-Vocabulary problem occurring at Word2Vec, [Bojanowski et al., 2017] published fastText. fastText is based on the Skip-gram model with the modification that each word is represented as a bag of character n-grams. An individual word is then represented by the sum of its character n-grams. Hence, words that would normally be Out-Of-Vocabulary in the W2V model are decomposed and have an embedding in the fastText model. BERT uses a similar way of handling the OOV-problem: WordPiece [Wu et al., 2016]. Instead of summing all character n-grams back together to get 1 embedding for 1 word, WordPiece splits the word in sub-words and assigns each sub-word an embedding without summing them back together.

2.2.1.3 ELMo

ELMo and its predecessor [Peters et al., 2018] were the first one to take word embeddings along a different dimension. They recognized the lack of context in general word embeddings and how these vary across linguistic contexts (i.e. to model polysemy). The solution was a deep recurrent network, like an LSTM, where the hidden states are computed based on the previous hidden states, making the word embeddings learned contextual (W2V and fastText only have 1 hidden layer, hence static embeddings). ELMo uses vectors across layers from a bidirectional LSTM. This needs some further explanation as BERT also leverages this approach, being: training bidirectionally and using intermediate layer representations.

Training bidirectionally means that the word being predicted uses information from both directions: the words before and after the predicted word. ELMo succeeds in this method by combining both the forward as the backward LM. However, these 2 language models are not

(27)

trained simultaneously. Given a sequence of N tokens, (t1, t2, . . . , tN), a forward LM computes

the probability of token tk taking the history into account (t1, t2, . . . , tk−1) [Peters et al., 2018].

Hence, a word is passed through all L layers of a forward LSTM. At each position k (e.g. k=1 indicates first word in sentence), each LSTM layer outputs a context dependent vector ~hk,J

representing the vector of word k passing through layer j. A backward LM is similar as the forward one, except it runs over the sequence in reverse: predicting a word tktaking the ‘future’

into (tk+1, tk+2, . . . , tN) results in the vector ~hk,J [Peters et al., 2018]. The 2 resulting vectors

are then combined into one vector: hLM_k,j = [~hk,J; ~hk,J].

The embedding of the token tk is then acquired by using the intermediate layer

representa-tions. Moreover, ELMo learns a linear combination of the hidden state vectors which improves performance over just using the top LSTM. Using intrinsic evaluations, [Peters et al., 2018] showed that the higher-level LSTM states capture context-dependent aspects of words, while lower level states models aspects of syntax.

2.2.1.4 ULMFiT

Although inductive transfer learning, in 2013, was gaining increasing attention in the NLP domain, there still was a large gap between the possibilities of transfer learning in the field of NLP and computer vision [Howard and Ruder, 2018]. In 2018, the prevailing approach was to pre-train embeddings and use these as input features in other tasks (this is the proposed direct embedding approach, more on that later). In computer vision this method is known as hypercolumns [Hariharan et al., 2015] and is used by [Peters et al., 2018] in the ELMo model. In computer vision, hypercolumns have been nearly entirely superseded by end-to-end fine-tuning [Long et al., 2015]. Although, fine-tuning approaches were successful for transfer between similar tasks (question answering, sequence classification, . . . ), successfully fine-tuning unrelated tasks from a baseline transfer model were yet to be invented. To be able to fine-tune on a universal model solves the overfitting-problem: if a domain specific dataset, which is typically small, has to be built from scratch, overfitting occurs: the model corresponds too close to the specific dataset and does not generalize well on unseen data. In May 2018, Universal Language Model Fine-tune (ULMFiT) [Howard and Ruder, 2018] was introduced, which pre-trains a language model (LM) on a large general-domain corpus and fine-tunes it on the target task using novel techniques. The method is universal in the sense that it meets these practical criteria [Howard and Ruder, 2018]:

(28)

2. It uses a single architecture and training process.

3. It requires no custom feature engineering or pre-processing. 4. It does not require additional in-domain documents or labels.

Figure 2.4: The 3 stages of ULMFiT [Howard and Ruder, 2018]

ULMFiT consists of 3 stages:

1. I. LM pre-training stage where the model is trained on a general-domain corpus (e.g. Wikipedia) to capture general features of the language.

2. II. LM fine-tuning stage is necessary to enhance the general language captured in the first phase with language from a specific domain.

3. III. Classifier fine-tuning stage adds a block on top of the architecture. This block varies across specific fine-tuning tasks (e.g. for sequence classification tasks, like sentiment anal-ysis tasks, a linear layer is added on top).

An important mechanism that can occur is Catastrophic Forgetting. It is the tendency of completely and abruptly forgetting previously learned information upon learning new informa-tion [McCloskey and Cohen, 1989]. Hence, fine-tuning the language model too much towards a specific domain can result in the model ‘forgetting’ general language aspects. To counteract the problem of losing general information gained in stage one, the authors of ULMFiT propose the use of 3 counter mechanisms. 2 of them are used in the sentiment prediction phase (sec-tion 3.6.1.1), namely discriminative fine-tuning and gradual unfreezing. The first allows tuning of layers with different learning rates as different layers capture different types of information

(29)

[Yosinski et al., 2014], the latter is a technique to gradually unfreeze the model starting from the last layer rather than fine-tuning all layers at once, which risks catastrophic forgetting. The interested reader can learn more about them in the paper of [McCloskey and Cohen, 1989].

2.2.1.5 OpenAI GPT

ULMFiT closed the gap between computer vision and NLP by introducing an effective transfer learning method based on an LSTM neural network. Although gaining state of the art per-formances on various tasks, [Radford et al., 2018] felt it could be better. In 2017 transformers (section 2.2.2.1) emerged into the scene, beating LSTM architectures. [Radford et al., 2018] swapped the LSTM with decoders, resulting, yet again, in top results over different tasks.

Figure 2.5: Differences in pre-training model architectures. BERT uses a bidirectional trans-former. OpenAI GPT uses a left-to-right Transtrans-former. ELMo uses a concatenation of indepen-dently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among 3, only BERT representations are jointly conditioned on both left and right context in all layers2

The OpenAI GPT model introduced a fine-tunable pre-trained model based on the decoders part of transformers (section 2.2.2.1). But something went missing in the transition from LSTMs to decoders. ELMo’s language model was bidirectional, but the OpenAI GPT decoder is trained as a traditional language model:

P r(xi) = P r(xi|x<i) (2.5)

Traditional language modeling could not cope with the problem that training bidirectionally would result in each word indirectly seeing itself and thus it is not a real prediction. Remember

2

(30)

that ELMo was able to train bidirectionally as it trains the 2 tasks separately (section 2.2.1.3). BERT resolved this problem by training a masked language model where 15% of the words in the corpus are masked (section 2.2.2.3).

2.2.2 BERT

Word2Vec’s word embeddings, fastText’s character-based tokenizer, ELMo’s contextualized word embeddings, ULMFiT’s transfer learning method and FastAI’s transformer based transfer model all led to the development of BERT [Devlin et al., 2018]. BERT smartly picked all the most promising distinctive features from each previously described model, while ameliorating where possible. BERT stands for ‘Bidirectional Encoder Representations from Transformers’. The meaning of these words become apparent in the following subsections.

In this following, BERT’s architecture, word representations, pre-training phase and fine-tuning phase are discussed.

2.2.2.1 Architecture

For a long time, the dominant sequence transduction models were based on complex recur-rent neural networks. Transduction refers to the conversion of an input into another form (the output) while sequential refers to the way the input data is structured. The best example is to convert a sentence in a certain language to exactly the same sentence in another language. The traditional RNN/LSTM based approaches had one major flaw, their poor performance on long sequences due to their inability to parallelize the network. In the breakthrough paper of [Vaswani et al., 2017] a much simpler network architecture was introduced called a ‘transformer’ which dispenses recurrent and convolutional networks entirely, based solely on attention mech-anisms. The transformer, built upon attention technologies without using an RNN structure, highlights the fact that the attention mechanisms alone, without recurrent sequential processing, are powerful enough to achieve the performance of RNNs [Web15].

Transformers typically consist of an encoder and a decoder block. Decoders, however, are not used in BERT’s architecture and is not discussed in this dissertation. One can read about them in the paper of [Vaswani et al., 2017]. However, a basic understanding of a an encoder is needed to understand the principles of BERT. Hence, the working of an encoder is discussed along with the attention mechanism.

(31)

Figure 2.6: One encoder block [Vaswani et al., 2017]

An encoder converts the original input se-quence into its latent representation in the form of hidden state vectors. In case of sen-tence translation, the input is a sensen-tence with the embedding for each word in the sentence and its relative position (positional encoding). This input is passed through a multi-head at-tention layer that enables the model to learn which words refer to one another and their different meanings in different contexts. This layer is explained below. It outputs a new adapted word embedding of the same size tak-ing its context into account. After normaliztak-ing this embedding, it is fed into a feed forward neural network with 2 linear layers and an ac-tivation function.

Attention is a mechanism that takes 2 sentences and turns it into a matrix where the words of one sentence form the rows and the words from another sentence form the columns. Key to transformers is the concept of self-attention where the rows and columns are from the same sentence. Self-attention helps the model to learn which words in and across sentences refer to one another. Consider the following sentence: ‘The animal didn’t cross the street. It was too tired.’ To which word does ‘it’ refer? Self-attention allows the model to associate ‘it’ with ‘animal’. How far models are able to go back linking words depends on the model at hand. BERT, for example, is able to map words that are 512 sequences apart. This, however, comes at a cost: computing resources.

It works by learning 3 new embeddings of the same size for each individual word embedding of the input sentence during training phase. These 3 embeddings are called query Qi, key Kiand

value Vi. Each word embedding Xiwants to know its value with respect to each word embedding

Xj. Therefore, Xiqueries Xj and Xj responds by providing its key. Mathematically, this means

performing the dot product of Qi and Kj which results in a single value as both embeddings

have the same dimension. Next, each value of the dot product is divided by the square root of the key matrix dimension. Finally, each resulting value goes through a softmax function to

(32)

ensure the sum is equal to one and this will result in Si,j [Web1].

This value is then used to calculate Si,j.Vj = Vi,j0 where Vi,j0 is the new embedding that

is adjusted for the importance of word i relative to word j. To find the final embedding for word i the following sum must be taken: Pn

j=1Vi,j0 = Vi0 where n is equal to the predefined

hyper-parameter sequence length minus one (the word itself). More on this later.

Since words have different meanings in different contexts, the word can have multiple V_i0 embeddings. For each set of queries, keys and values the self-attention steps as explained above are performed. For h different sets, h different final embeddings V_i0 are created for each word Xi. These are then concatenated and multiplied with another learned matrix Z which sole

purpose is reducing the concatenated embedding to an embedding of the original size. This way the original embedding of word i is transformed in such a way that it considers multiple contexts and addresses the issue of static word embeddings. This last process is called multi-head attention [Web20].

Figure 2.7: Multi-head scaled dot-product attention mechanism. [Vaswani et al., 2017]

BERT, however, does not limit itself to only one encoder. 2 types of BERT architectures are mostly used: BERT Base and BERT Large. BERT large consists of 12 encoder blocks on top of one another, while its larger alternative has 24 encoder blocks [Devlin et al., 2018].

(33)

2.2.2.2 Word representation

Doing feature extraction with BERT Base results in an embedding of shape (1, 768) per word. BERT Large outputs embeddings of shape (1, 1024). Some words, however, have 2 or even more embeddings. If a word is not present (Out-Of-Vocabulary), the word is split in sub-words until all have a fit in the dictionary. This tokenization process is called WordPiece [Wu et al., 2016]. Note that WordPiece is different from, yet inspired by, fastText’s character based tokenization method [Bojanowski et al., 2017] as it splits in sub-words rather than splitting on characters. What is also appealing is the fact that no lemmatization or stemming is needed as the word ’buying’ is converted to ’buy’ and ’ing’. Something, that needs to be remembered when pre-processing the tweets.

When initialising a BERT model, a vocabulary must be passed to it. The standard vocab-ulary file consists of 30522 tokens, compared to the English dictionary (171,476 words), this seems rather small. However, the vocab file of BERT only consists of common words and sub-words, greatly decreasing the amount of possible combinations. If one takes into account the [CLS],[MASK], [SEP], [UNKNOWN] and 1000 [UNUSED]-tokens, this amount even drops to 29518.

[MASK]-tokens are used for the masked language modeling objective present in pre-training. More on that later. [UNUSED]-tokens are used to specify additional domain specific words according to the task at hand. Adding tokens like ’$tsla’, ’elon’ and ’musk’ should enhance the quality of the tweet embeddings as, for example, the token ’elon’ will not be split into ’el’ and ’on’ by the BERT tokenizer and will convert better and faster to the true word embedding capturing the relevant information [Web14]. One, however, should only specify additional words in the vocab if the model is pre-trained on the own corpus as all the [UNUSED]-tokens have an initial random embedding. While pre-training, the embedding is altered through the encoders to fit the semantic meaning of the word. If one uses BERT Vanilla (BERT without further pre-training or fine-tuning) and specifies additional tokens, it could happen that a token like $tsla has an embedding that in feature space is close to, for instance, the token tomato.

A [SEP]-token is used to split sentences appearing in the same document. In the pre-training stage, this token is used for the next sentence prediction task (section 2.2.2.3). An [UNKNOWN]-token is used for words that cannot be subdivided. Consequentially, this [UNKNOWN]-token does not hold

(34)

any information. For example, BERT’s tokenizer assigns each emoji an [UNKNOWN]-token. To avoid this, a methodology is composed so that BERT is able to use the information captured in an emoji (see section 3.5.2.2).

Figure 2.8: [CLS]-token usage

Lastly, the [CLS]-token stands for classification and is used in sentence-level classification. This is especially relevant when fine-tuning to a downstream classification task as a sequence ultimately needs to be reduced to a single vector. When fine-tuning, the [CLS]-token learns the representation of the adjacent sentence. As a consequence the [CLS]-token is inserted in front of the sentence. Although being a good sentence representation when fine-tuning classifi-cation tasks like sentiment analysis and entailment , it is not good in representing the sentence when no additional classification layer is placed upon the encoder blocks (feature extraction) [Devlin et al., 2018].

In this dissertation, fine-tuning is only applied for the traditional sentiment approach. Hence, the question arises how one represents a tweet, knowing that the ’sentence capturing token’ is only useful when fine-tuning. In fact when you have a sentence of 4 words like the one in figure 2.8, there are 66 possible embeddings ([4 word tokens + 1 CLS token + 1 SEP token] x [11 hidden layers]) each of size 768 in BERT Base. In this dissertation the possible amount of embeddings even reaches 792 as padding needs to be taken into account (section 3.5.3.3).

(35)

Luckily, there has been done some research concerning the optimal layers to choose from [Devlin et al., 2018, Araci, 2019]. In figure 2.9, the feature-based approach section indicates that concating the last 4 hidden layers yields the best results. A concatenation of 4 layers would result in an embedding, if all tokens are averaged, of size 3072 (768 x 4). However, taking memory limits of Google Colab into account, the decision was made to take the second to last hidden layer. Which is also the preferred layer of the makers of the model3. Why the second to last and not the last layer, one may ask? It appears that the last hidden layer is to close to the target functions (i.e. masked language model and next sentence prediction) in pre-training and is therefore to biased towards the targets4.

Figure 2.9: CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyperparameters [Devlin et al., 2018]

2.2.2.3 Pre-training

BERT has been pre-trained as a semi-supervised learning problem on the entire English Wikipedia (2.5 billion words) and Brown University Standard Corpus of Present-Day American English (800 million words). 2 techniques were put into use. The first technique is called Masked Lan-guage Modeling i.e. 15% of the words are replaced with a MASK token. This allows for training bidirectionally without words seeing themselves. The problem becomes apparent in figure 2.10. In a unidirectional context only the previous tokens ‘<s>’ and ‘open’ are seen when predicting

3

https://github.com/google-research/bert/issues/196 4

(36)

token ’a’. It is easy to see that in a bidirectional context, without masking, the token ‘a’ is already seen by the model. Hence, making the task trivial.

Figure 2.10: Bias in training bidirectionally in traditional language modeling [Devlin et al., 2018]

Why 15% one may ask? Well, it appears that the developers of BERT did not tune this hyper-parameter as it was a really expensive task to train. Hence, an informed choice had to be made. Too little masking leads to an even more expensive training task while too much masking induces less context to learn from. Jacob Devlin states: “We didn’t try a lot of ablation on this. Those numbers are just what made sense to me and the only thing that I tried. It’s possible that other values will work better (or more likely, the system isn’t very sensitive to the exact hyperparameters)”5.

There is, however, a problem with MLM. 15% of the tokens are never seen by the model during fine-tuning. To mitigate this, they do not always replace masked words with the actual [MASK] token. 10% of the time the masked word is replaced by a random word (e.g. ‘went to the store’ becomes ‘went to the walking’). Another 10% of the time the masked words are not replaced (e.g. ‘went to the store’ becomes ‘went to the store’). Hence, only 80% of the time the masked words are replaced by a [MASK] token [Devlin et al., 2018]. Keep in mind that there are more problems with the MLM-approach, more on that in section 2.2.3.

The second way BERT is pre-trained is by means of Next Sentence Prediction (NSP) as it is an effective way to learn relationships between sentences. In this process, the model receives pairs of sentences where the probability of one sentence being subsequent to the other is 50%

(37)

Figure 2.11: Next sentence prediction task

While for the other 50%, the second sentence is chosen randomly from the corpus. The goal is to predict whether the second sentence is subsequent to the first or not.

Masked Language Model and Next Sentence Prediction are trained simultaneously with the goal to minimize their combined loss. The BERT models were trained on 4x4 or 8x8 TPU slice for 4 straight days, leading to BERT Base and BERT Large.

2.2.2.4 Fine-tuning

Figure 2.12: Fine-tuning BERT [Devlin et al., 2018]

Before BERT, there were 2 existing strategies for applying pre-trained language represen-tations to downstream tasks: feature-based (ELMo: section 2.2.1.3) and fine-tuning (OpenAI GPT: section 2.2.1.5). The developers of BERT claimed that current techniques restricted the power of the pre-trained representations, especially for the fine-tuning approaches (the reader should know by now what these limitations were and how they solved it). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering [Devlin et al., 2018].

(38)

It offers some appealing advantages:

• The hidden states already encode a lot of aspects of language (if transfer learning is applied). As a result, it takes far less time to train a fine-tuned model. It is recommended to fine-tune for 2 to 4 epochs with a low learning rate to lower the chance of Catastrophic Forgetting (section 2.2.1.4). This recommendation is empirically tested later on (section 3.6.1.1).

• Far less data is needed as it already has sufficient language knowledge.

• Achieving state-of-the-art results on 11 NLP tasks: 8 GLUE tasks (figure 2.13), SQuAD v2.0, CoNLL and SWAG [Devlin et al., 2018].

Figure 2.13: The General Language Understanding Evaluation benchmark (GLUE) results. It is a collection of datasets used for analyzing NLP models relative to one another. The collection consists of 8 “difficult and diverse” tasks designed to test a model’s language understanding [Devlin et al., 2018]

2.2.3 XLNeT

The adoption of transformers (with its attention mechanism) and the creation of BERT induced lots of research in the field of NLP. They were introduced in respectively 2017 and 2018, which is quite old considering the progress made by NLP the past years [Web10]. Hence, in 2020, BERT has been superseded a few times already. This does not mean that the attention given to BERT in this dissertation is not justified. All models that came after BERT are based on the principles introduced by it. It is therefore essential to first understand these, if one wants to be able to grasp the new ones.

(39)

One of the new ones, is called XLNeT [Yang et al., 2019]. It is considered (in end 2019) the best NLP model around (not taking 17-billion-parameter language model of Microsoft into account as it is not publicly available). Moreover, a famous NLP researcher stated the following:

”The king is dead, Long live the king.” ∼ Sebastian Ruder, DeepMind

This quote cannot be taken lightly. XLNeT did not only achieve better results in 20 NLP tasks, it changed the whole concept of masked language modeling (section 2.2.2.3). Let that just be the innovative solution to the bidirectionality problem where BERT got their credits from.

The authors of XLNeT felt there were 2 major drawbacks in BERT’s pre-training process, more notably in the MLM-approach [Web26]. Firstly, artificial tokens like [MASK] are absent in the fine-tuning stage, leading to a pre-train-fine-tune discrepancy [Yang et al., 2019]. Hence, one might ask if it really learns to produce meaningful representations for non-masked tokens as it can simply copy non-masked tokens to the output. Secondly, BERT assumes that predicted tokens are independent from one another, which most of the time is not the case. This needs some clarification. Suppose a pre-training instance:

’Tesla can manufacture additional [MASK] in their new [MASK]’. This can be filled as:

’Tesla can manufacture additional cars in their new factory’. but, the sentence

’Tesla can manufacture additional cars in their new cybertruck’.

is not valid. The masked tokens are predicted in parallel, meaning that it does not learn to handle dependencies between its own predictions. If it was possible, the model could infer that a car was already produced, leaving the path of the cybertruck prediction. This indicates that the pre-training stage can be even more efficient, leading to a stronger learning signal.

The problems can be avoided by the means of permutation language modeling. It is a task trained to predict one token given preceding context, but in stead of predicting the natural sequence order, it predicts tokens in a random fashion.

(40)

Essentially, this is the traditional language model task while allowing for bidirectional context (MLM) by means of randomly shuffling the tokens. E.g. consider the previously mentioned pre-training instance with 9 words. It is possible to make a set of 9! permutations from that sentence, where 9! times a calculation is made about the probability of token xi given preceding

tokens x<i [Web4].

Now how does one take the order of words into account? Well to answer this, a thing or 2 needs to be mentioned about TransformerXL [Dai et al., 2019]. Basically, it is a transformer that can map longer-term dependencies than the classic RNNs and traditional transformers. The idea is to insert the recurrence from RNNs into transformers by freezing the hidden layers from the previous segment (documents in the corpus are split into several segments). It succeeds by mapping words in different segments through a mechanism called relative positional encoding. This way, it is possible to map dependencies that are more than 512 sequences apart (this is the limit in the BERT model). Hence, XLNeT leverages the relative positional encoding to keep track of the natural order of sentences. Furthermore it takes the TransformerXL architecture to build the XLNeT model. Hence, its name.

To summarize, XLNeT has 2 key advantages over BERT: its capability to learn even longer dependencies than the 512 limit of BERT and its new improved language modeling approach inducing a stronger language learning signal.

2.3 Pre-processing

2.3.1 Problem statement

In order to fully exploit the capabilities of language models, the text that is fed into them should be pre-processed first. Especially when the text under consideration stems from tweets, as these are characterized by informal language, the use of emoji and spelling mistakes. There are numerous standard methods often applied for tweet pre-processing when dealing with a sentiment prediction problem. However, the methods applied to enhance sentiment prediction also improves embeddings in general, and therefore many of these techniques are applied in this dissertation for both the direct embeddings approach and the traditional sentiment approach. Extensive research has been done into which methods work and do not work, some of these works are briefly discussed below.

(41)

2.3.2 Pre-processing literature

[Bao et al., 2014] identify the following methods to be the most effective: • Replacing negation. E.g. ’Won’t’ becomes ’will not’.

• Repeated letters normalization. E.g. ’Cooool’ becomes ’Cool’. Some methods even have a negative impact :

• Stemming. I.e. reducing inflected words to their stem, base or root form. E.g. ’Studies’ becomes ’studi’.

• Lemmatization. I.e. converting inflected words to their root form. E.g. ’Studies’ becomes ’study’.

According to [Jianqiang, 2015] the following 2 methods render the biggest increase in senti-ment classification accuracy:

• Replacing negation. • Expanding acronyms

Some methods hardly effect performance: • URL removal.

• Stopword removal. • Number removal.

However, they emphasize the importance of selecting the appropriate pre-processing methods depending on the type of model being used.

As for [Angiani et al., 2016], some useful techniques are:

• Reducing emoticon categories to only 2 classes i.e. positive emoticons and negative emoti-cons.

• Replacing negation. • Stemming.

• Stopword removal.

(42)

2.3.3 Reflection

There are some inconsistencies in the recommendations of the earlier mentioned researchers, this only enforces the idea that sentiment prediction is not an exact science. Therefore, the type of embedding model and the text data to be processed should be taken into account before applying any of these methods. In this dissertation, many of the techniques are applied including some others not mentioned in the literature as they are very case specific. For an overview consult section 3.5.2.

2.4 Sentiment analysis

In the second part of this dissertation, i.e. the traditional sentiment approach, sentiment is extracted from the tweets and used as feature for Tesla’s stock prediction. To do so, sentiment analysis is applied, which is a subfield of NLP that is used to analyze texts and attempts to classify them according to their emotional polarity or valence. One can approach sentiment analysis in a supervised manner (which makes it a classification problem) or an unsupervised manner.

2.4.1 Unsupervised sentiment analysis

Unsupervised sentiment analysis is characterized by the absence of labels, there is no need for labeled data because an unsupervised model doesn’t learn anything. One common approach for unsupervised sentiment analysis is lexicon-based sentiment analysis. Within lexicon-based sentiment analysis there are multiple approaches possible as well, but the general idea is as follows. In a lexicon (dictionary), words are mapped each with an emotional valence score. E.g. on a scale from -7 to 7 the word ’bad’ may have a score of -4.9 while the word ’excellent’ may have a score of 6.3. To predict the sentiment of a text, look up each word of the text in the dictionary and get its sentiment value. In the next step, either the average sentiment of all words in the text is calculated or the percentage of positive/neutral/negative words is calculated. This represents the overall sentiment of the text [Web17]. This lexicon can contain unigrams i.e. single words, or bigrams/ n-grams i.e. multiple words. E.g. a bigram could be ’not good’ (see section 2.1.3). There are numerous packages available in python and other programming languages which provide a much more advanced and context specific approach than the one described. One such example is VADER, a lexicon and rule-based sentiment tool specifically

(43)

attuned to sentiments expressed in social media [Web22].

2.4.2 Supervised sentiment analysis

Another way to exert sentiment analysis is the supervised approach. Since it is supervised, it does require labels. This is often a problem because most text corpora do not have a sentiment score for each document in the corpus. However, when there are sufficient labels available it very often outperforms unsupervised sentiment analysis. For unsupervised sentiment analysis many machine learning models can be used e.g. random forests, ANNs, SVMs,... depending on the input data. The text can be processed into the appropriate input data using many of the techniques described above, e.g. Bag-of-words, TF-IDF, word embeddings, ....

2.4.3 Reflection

The scraped Tesla tweets used in this analysis are not automatically accompanied with a sen-timent score. Therefore, it might seem an appropriate idea to use an unsupervised sensen-timent approach like VADER, as it is well suited for social media content. However, VADER is well suited for general sentiment (i.e. happy, not happy) but is not fit for financial sentiment as is the case with tweets scraped on $TSLA. VADER may very well be able to capture the positive sentiment of the following tweet: ”Tesla is awesome !!”. However, the following tweet is positive as well: ”Buying Tesla call options now!”, unfortunately VADER is most likely not able to un-derstand this financial language. Therefore, the decision is made to label some tweets manually and apply the supervised sentiment approach.

2.5 Forecast horizon

2.5.1 Problem statement

Market efficiency

According to the Efficient Market Hypothesis (EMH) [Malkiel and Fama, 1970] there are 3 types of market efficiency:

1. Weak Form Market Efficiency: implies that all past information is incorporated in security prices and prices follow a random walk pattern. Therefore, security prices cannot be predicted.