Predicting the future performance of companies using Twitter data

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Predicting the Future Performance of Companies

Using Twitter Data

Master’s Thesis -

MSc in Econometrics: Big Data Business Analytics

Author: Stefan Bemelmans 10899510 Supervisor: Dr. Katarzyna Lasak Second Reader: Dr. Noud van Giersbergen

ABSTRACT

This thesis aims to predict the three month performance of companies using their Twitter activity. For this research, the companies listed on the S&P 500 are used. The text from the tweets is modelled using text analysis techniques including GloVe, Word2Vec, LDA and Paragraph Vectors. The ‘tweet embedding vectors’ are used as input for classification models, including logistic regression, naive Bayes and random forest. These models classify the compa-nies into good and bad investments. The Twitter based models are compared to a benchmark logit model. It is found that the Twitter activity of a company can be of added value and the best found model has a 27% improvement over the benchmark. This model uses Paragraph Vectors for text embedding and a random forest for classification and has an AUROC of 0.68.

Keywords: Text Analysis, Classification, Machine Learning, Twitter, AUROC

(2)

Statement of Originality

This document is written by Student Stefan Bemelmans who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this docu-ment is original and that no sources other than those docu-mentioned in the text and its references have been used in creating it.The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Introduction

For a long time the stock market has been the subject of many research and many investors have tried to use the models from this research to make profits. Numerous methods have been tried to predict the future stock prices, but none of them seem to be able to consistently make good predictions. Since the Efficient Market Hypothesis (EMH) was formulated, we know that stock markets are not predictable, at least not using historic data. Still, many researchers are looking for a way to predict stock prices. For this, they use all kinds of aspects of a company that might predict the stock of this company well. There are many aspects that contribute to a greater or lesser extent to a company’s performance, but it is impossible to capture all of these aspects into a model.

One important aspect is the marketing of a company. Good advertisements and product place-ment is important for the brand awareness. This brand awareness has a big impact on the sales. Another important aspect that contributes to the performance of a company is the relation of this company with its customers. A good customer service improves the reputation of the com-pany and attracts more clients. Other information that is required to predict the performance of a company can include decisions on the direction a company is heading, or other strategy statements. The mentioned aspects seem chosen at random. However, there is a source that gives us information on all three aspects, namely social media and more specifically Twitter. Companies use Twitter for advertisement, for customer care and for important announcements or press releases.

A lot of researchers have investigated whether the sentiment of social media is a good predictor of the stock prices. Many of them find that stock prices can be predicted by negative sentiment in tweets about some stock (Risius, Akolk, & Beck, 2015), positive sentiment in tweets about some stock (Smailovi´c, Grˇcar, Lavraˇc, & ˇZnidarˇsiˇc, 2014), or by sentiments of tweets about a stock from people who have few followers (Kee Sul, Dennis, & Yuan,2016). And since Twitter introduced the ticker symbol ($) which makes it possible in tweets to refer to stocks, it is easier for people to make use of Twitter when searching for information on investing. For example, using the ‘cashtag’ $AAPL, one can find all tweets that have information about the Apple stock.

In this thesis another source of information on Twitter is investigated on its power to predict the performance of a company. Not the sentiments of tweets are used, but the Twitter accounts of the companies are used to predict their stock prices. This leads to the research question of this thesis which is as follows: Is it possible to predict the three month future performance of a company with its Twitter activity? Because of the applications of Twitter by companies, who use Twitter for marketing, customer care and as a communication tool, the hypothesis is that the Twitter activity of a company can be used to predict the future performance of this company and can be of added value to the traditional investment strategies.

(6)

month performance of these companies is predicted over the period from 9 January 2018 until 9 April 2018. To obtain their Twitter data, their Twitter accounts are scraped. First the text needs to be made usable for modelling. In this thesis it is chosen to use text analysis models that represent the tweets by (numerical) vectors. After preprocessing these tweets, four text analysis models are used to make these representations. Two text analysis models, the GloVe model and the Word2Vec model, make word embeddings, which are vectors representing the words. To capture a tweet, the word vectors are averaged over all words in this tweet. The other two text analysis models that are implemented, Latent Dirichlet Allocation (LDA) and the Paragraph Vector (PV) model, directly make a representation for the complete tweet. Since the word representations of the first two models need to be averaged, it is expected that the LDA and PV model give the best results.

The obtained data can then be used for predicting the performance of the companies. For this purpose three classification models are used, namely a logistic regression, a naive Bayes model and a random forest. Due to the complex data it is expected that a random forest model will give the best results. For example, the more Twitter is used for advertisement, the better the brand awareness of this company. However, a company like Apple does not need advertising on Twitter to have a high brand awareness. The contribution of advertisement on Twitter to the performance of a company is therefore not monotonous. The measure that is used to assess the results is the area under the receiver operating characteristic (AUROC).

The rest of this thesis will be arranged as follows: first the existing literature on the use of Twitter accounts of companies is discussed, followed by the literature on text analysis models, in Section 1. Then the methodology that is used to be able to answer the research question is described in Section2. In Section3 the data that is used will be explained. Then the results of the models will be presented in Section4. Finally, the conclusion in Section5 makes this thesis complete.

(7)

1 Theoretical Background

This thesis combines social media, or more specific Twitter, and machine learning techniques. To introduce this combination we will give a brief overview of the literature in the two areas that cover this thesis. But first, the EMH is discussed to show that this thesis is not necessarily in contradiction with the EMH. Then, we will discuss the literature on the Twitter activity of companies. This will elucidate why a company’s Twitter page might be a good source of information for investment decisions on this company. Because we want to apply the information on these Twitter pages, we need text analysis models to extract the information from the tweets. For this reason we will give an overview of the literature on the existing text analysis methods, which makes it possible to make a reasonable selection of models to use for our purpose.

1.1 Efficient Market Hypothesis

First we briefly discuss the EMH. The EMH is a hypothesis that is introduced by Fama. Accord-ing to the EMH the prices of the stock market fully reflect all the information that is available at that time. This means that the prices paid on the stock markets are the fair prices for the stocks and that only news can influence the stock prices. If this hypothesis holds, it means that it is not possible for people to find a way to beat the market or predict the stock prices in the future. The idea is quite logical, since if the EMH does not hold, one could get infinite returns out of the market by buying stocks that are underpriced and selling stocks that are overpriced. One just needs the true information that is not available to others that suggest the under- or overpricing of a stock. But there are cases known in which people have traded with inside information about a company and in this way earned high returns. This brings us to a division in the strength of the EMH.

There are three versions of the EMH that differ in strength of the assumptions. InFama(1970) the different forms of the EMH are brought up. The strong version assumes that the stock markets are always efficient, which means that it does not matter if there is an information asymmetry between investors, so that some have more information than others. The stock prices reflect all information. The semi-strong version assumes that news is absorbed very quickly, but not immediately, into the stock price. This still means that it is not possible to beat the market by analysing the company or by looking at the historical data. The third version is the weak form of the EMH. This version only assumes that it is not possible to beat the market by using the historical data of a company. However, according to this version, it might be possible to predict future stock prices based on an analysis of the state of the company.

Fama (1970) found no evidence against the weak and the semi-strong forms of the EMH, but found some evidence against the strong form.

(8)

in the earlier days, but that the new generation of economists does not believe in the EMH as much as the previous generation anymore. More and more researchers have tried to disprove the hypothesis. Malkiel(2003) discusses some arguments against the hypothesis. One example is a study that addresses a remarkable difference in monthly stock returns between January and other months, where the returns in January are higher than in other months. Malkiel(2003) states that this might be the case the past few years, but that this pattern might as well not hold for the coming January. In conclusion, he finds that the markets are not completely efficient and that there are always irregularities on the market. But in the long run, the markets are efficient.

This suggests that in the short run it could be possible to beat the market. And even though the EMH is commonly known, still many researchers and stock traders try to find ways to predict the future performance of companies. A possible explanation is given by Timmermann & Granger (2004). They found that it is likely that new financial prediction methods can be used to beat the market. However, when the new models become common and everyone starts using them, their results will diminish.

In short, the EMH in strong form would make it a waste of time to search for prediction models for stock markets. However, if we presume the weak form of the EMH, it should be possible to find prediction models that are able to predict the performance of companies, based on an analysis of this company. Apart from this, more recent visions on the EMH are based on the long run of the stock markets. In the short run, there might be some irregularities. And in addition, a new prediction model could be used to beat the market, before it becomes publicly used. In this thesis, the aim is to find such a new prediction model using the Twitter pages of companies. In the following subsection Twitter is discussed.

1.2 Twitter and Company Tweets

Since Twitter was launched in 2006, almost every company has adopted the social medium to connect with other companies, and of course with users around the world. Twitter is very useful for companies, because it is a cheap way to communicate with customers, whether their customers are individuals or other businesses. This is called a Virtual Customer Environment (VCE) andCulnan, J. McHugh, & I. Zubillaga(2010) state that businesses can create business value by communicating with their customers via their VCE. From the social media activity, such as the Twitter page, of a company we can thus retrieve information about the way they are adapted to modern ways of communicating with customers. But there is more information in a Twitter page.

Golbeck, Robles, Edmondson, & Turner(2011) did research on subtracting personality traits of people from their Twitter pages. They used the information from Twitter to predict the person-ality dimensions from the Big Five model, which include (i) the openness to experience, (ii) the

(9)

conscientiousness, (iii) the extroversion, (iv) the agreeableness and finally (v) the neuroticism of a person. These personality dimensions were scored for each individual by a personality test. They then found that using the information on the Twitter page it is possible to predict the five personality dimensions of people as scored by the personality test. Since the Twitter pages of companies are being maintained by people, we might be able to extrapolate this result to the Twitter pages of businesses.

For companies, there does not yet exist such a personality test as for individuals. Still, there has been done research on retrieving information about non-individual Twitter pages. One example of research that is done on this is the paper by Al-Daihani & Abrahams(2016). They used text mining techniques on the Twitter pages of libraries and found for instance that the most tweets were related to knowledge, insight and personal and cultural relationships. They conclude that text mining applications can help libraries to evaluate their own tweets and that these applications can be used for comparisons with other libraries.

The research of Wang, Pauleen, & Zhang(2016) shows that the use of social media boosts the business performance for companies that have other businesses as customers. This means that social media are not only necessary for businesses that want to reach individuals through social media. Kim & Youm(2017) investigated the effect of the Twitter activity of companies on the recommendations of stock analysts on the company’s stocks. They found that for tweets that are generated by companies, the number of retweets and the content of the tweets are important aspects that influence these recommendations.

1.3 Text Analysis

Now we discuss the literature on the different text analysis models. There are several differ-ent kinds of models for making represdiffer-entations of words or documdiffer-ents and often these word representations are a by-product of the main goal of a model. These goals can be to predict a word’s surrounding words, or to make a topic distribution for documents. And from these text analyses, we get word representations that are based on different assumptions, but can be very useful in our understanding of the texts they are in, without having to actually read those texts.

A lot has been said about the different methods of text analysis and often people have a preference on which text analysis model they would use. But it is hard to make a justified decision on which of these models to use, since little research has been really comparing the models. The research that does compare these models, are often the papers in which the authors introduce a new model and compare it to the existing models. The authors always show that their models can beat previous models. Therefore, there might be a bias in this kind of literature. We describe the steps that have been taken in this research area. This leads us from the earlier models to the more state of the art models, which we will use in this thesis.

(10)

One of the earlier examples in the distributional semantic models is the Vector Space model (VSM) bySalton, Wong, & Yang(1975). Their goal was to find similarities between documents, and between a search query and documents. For this, they needed to create word representations of the words inside the documents and the queries. They used the frequencies of words to make representations for these words and proposed the TF-IDF model. This model considers both the term frequency and the inverse document frequency of the word. This already gave a fourteen percent improvement for the recall and the precision over the standard term frequency model in which only the frequencies of words are used.

After the VSM model, Landauer, Foltz, & Laham (1998) proposed another model to find sim-ilarities between words and text segments, the Latent Semantic Analysis (LSA) model. The LSA model uses a term-document matrix, where the entries of the matrix consist of the number of terms in each document. It then uses singular value decomposition to lower the dimensions of this matrix and to create term vectors. The advantage of LSA over VSM is that it takes the distribution of a word over the documents into account, which means for example that a document that has the word ‘tulip’ often in it, probably has the word ‘flower’ often in it as well.

Only one year later,Hofmann(1999) came up with a statistical latent class model, named Prob-ability Latent Semantic Analysis (PLSA). PLSA is a statistical extension of the previous LSA model. This model uses a latent variable to statistically model the joint probability distribu-tion of words and documents. This latent variable can be seen as the topics of the documents. Some advantages of PLSA over LSA are that the dimensions of the matrix obtained by PLSA are explainable as being word distributions, whereas the matrix obtained by SVD in the LSA model cannot be interpreted. Moreover, because PLSA is a statistical model, one can choose the number of topics using model selection. For LSA this is not possible. On the other hand LSA is an exact method, while PLSA uses an approximation that searches for a local optimum and therefore might not find the global optimum. In an experiment on two different corpi, which are sets of documents,Hofmann(1999) found a significant higher reduction of perplexity using PLSA than using LSA.

A few years later,Blei, Ng, & Jordan(2003) came with a new model, Latent Dirichlet Allocation (LDA), which is an extension of the PLSA model. LDA also uses latent topics to statistically model the distribution of words and documents. And it also makes use of the topics as prob-ability distributions over words. However, PLSA does assume that the documents are a set of words drawn from the topic distributions, but it does not model them statistically. LDA introduces a probabilistic model for the documents. For this reason, LDA should have better results for prediction of new documents. Blei et al. (2003) show better empirical results of LDA over PLSA in both document modelling and document classification. More details of the model are discussed in Section 2.2.4.

For distributional semantic models, there has been no tremendous improvement of the topic models since LDA. But more recently Mikolov, Chen, Corrado, & Dean (2013) did come up

(11)

with a new word embedding model based on completely different assumptions. This model is called Word2Vec. Mikolov et al. (2013) assume that words can be recognised by the words in their neighbourhood and they proposed a model that creates word representations based on this assumption. This model uses a rolling window over all words in the corpus1 _{and uses a}

neural network with one hidden layer for every word to predict the surrounding words. During this prediction task, the word representations are made. A more elaborate description of this model can be found in Section 2.2.2. Mikolov et al. (2013) tested the Word2Vec model on the Semantic-Syntactic Word Relationship test set. The accuracy for the semantics was 55% and for syntax even 59%, which was better than for other neural network models.

Not very long after this, Pennington, Socher, & Manning (2014) also proposed a new word embedding model that is based on different assumptions as well. This model is called the Global Vectors for Word Representation (GloVe) model. It uses a word-word co-occurrence matrix, in which each entry represents the number of times a word is in the neighbourhood of another word. This is a different approach than the Word2Vec approach which uses only the local context of words. GloVe uses a type of matrix factorisation to handle the size of this matrix and as to this it is similar to LSA. We describe this model more extensively in Section 2.2.1. In the paper,Pennington et al.(2014) conduct an experiment on a large dataset, in which GloVe performs better than Word2Vec both on the word analogy and on the speed of obtaining the results.

So far, we have mostly considered models that make representations for words. But why would we use word representations when we are interested in the the tweets of the companies, which are texts consisting of multiple words. The problem with text representations is the co-occurrence of texts. If we want to make a document-document matrix, the sparsity would be enormous, since most text passages are unique. However,Le & Mikolov(2014) have found a way to make a representation of a text passage, which they call the Paragraph Vector (PV) model, see Section 2.2.3 for more detailed information. This model is an extension of the Word2Vec model. It uses the local context of words just like the Word2Vec model, but on top of that, it uses a unique vector for each text passage. The rest of the PV model is the same as the Word2Vec model. It uses a rolling window over all words in the corpus and then uses the same kind of neural network that optimises the word representation vectors and the text representation vector, which is called the paragraph vector. In a sentiment analysis experiment on a dataset of IMDB reviews, the PV model outperformed several models including the LDA model.

To summarise this section, Twitter is not a traditional criterion in an investment process. However, the Twitter page of a company does say something about the company and can be used for comparison with similar companies. Apart from that, Twitter is a source of value generation for businesses. The text analysis on the tweets can be done using many different word embedding models that are based on different assumptions. It is therefore favourable to

(12)

use a few different models that are all based on a different set of assumptions. This might result in a new prediction model that could be used to predict future performances of companies on the stock market.

(13)

2 Models

In this section the models that are used for the classification of the investments are discussed. We cover the complete process from obtaining raw data, to converting it into usable data with the use of text models and finally to using this output in our classification models. First, in Section2.1, we explain how the data from the Twitter pages of the companies is obtained and subsequently how this data is preprocessed to be able to use it in the text analysis models. The text analysis models are then discussed in Section 2.2. These text analysis models are used to extract the data from the Tweets of the companies. Hereafter, in Section 2.3 we elaborate on how the output of the text analysis models is used in the classification models. With the output of the text analysis models as explanatory variables, we try to explain if a company is a good investment or not by using these classification models. In Section2.4, the ‘pipeline’ is discussed in which the classification models are implemented. This pipeline can be seen as a few steps of scaling and transforming data after which the classification is done. To conclude the part on the models, Section2.5covers the benchmark models that we use to compare the Twitter based models to.

2.1 Scraping the Twitter Data

We begin with the collecting of the Twitter data. For the scraping of the tweets and the information of a company’s twitter account, we use the Python Twitter library. With this library we can access the API of Twitter through Python. From Twitter, we only need data of the twitter page of each company since this is the data that will be used for our prediction. More information on the meaning of the data can be found in Section3.

To be able to scrape the twitter page of a company, we need to know the user name of the company on Twitter, or the user id. We used the user name and we found these user names manually. Next to the user name, the maximum tweet id of a company is found manually. We use this maximum tweet id, because we need to scrape the tweets of a company up until three months ago, in order to be able to predict quarterly results. This maximum tweet id is used instead of a maximum date, since tweets are not provided with a date in the API. The following is done for each company.

First we scrape the statistics of a company’s page. These statistics consist of the number of followers, the number of friends, the number of statuses and the number of favourites. After scraping them, the company statistics are stored. Then we look at the tweets. From the tweets we need two kinds of data. First we scrape the statistics of the tweet. These statistics include the number of retweets and the number of likes for each tweet. After scraping all tweets from a company, these statistics are averaged for this company. The statistics of the tweets and of the companies that are described above will be used in the classification models that are described

(14)

in Subsection 2.3. The second kind of data we use from the tweets is the text of these tweets. For each company, all the tweets that are scraped are stored in one text document.

After the tweets are collected, the text in these tweets needs to be cleaned to be useful in our text analysis models. With the cleaning of the text data we mean several things. First of all we want to remove some typical Twitter symbols such as hash tags (#) and tickers ($). Punctuation is removed as well and all capital letters are transformed to lower-case letters. The previous changes make sure that for example ‘twitter,’ and ‘twitter’ are not two unique words, but are the same word: ‘twitter’. Moreover, words with 2 letters or less are removed.

Then some larger changes are made. The hyperlinks in the tweets are removed, since these links are almost always unique. This means we remove quite a large part of a small tweet, but this part is not informative enough. Usernames in the tweet are removed as well, since these are unique too. Finally the emojis (the smileys) in the tweets are removed. This gives us for each company the final document that includes all preprocessed tweets. On these documents we will perform the text analysis models that are described in the following subsection.

2.2 Text Analysis

For the machine learning models that we are going to use to make a prediction of the performance of the companies, we need numerical variables as input. However, the largest part of the data we are using, consists of the tweets of these companies. In order to transform our tweets into numerical variables, we are going to use both word embedding models and a topic model. All text analysis models use a text corpus consisting of documents as input. We use the combined tweets of a single company as one document.

There are many different word embedding models and all of them have different structures that have both advantages and disadvantages. One of the most important distinctions between the models is, that some models create a (vector) representation for each word in the corpus, whereas other models create a representation for each document in the corpus. For the topic modelling we will only use the LDA model.

Now we will discuss the GloVe model first. After this we will discuss the Word2Vec model, followed by the Doc2Vec model. We conclude the text analysis section with describing the LDA model.

2.2.1 GloVe

In this part we will give a brief description of the GloVe model. For all details of the model, we refer toPennington et al.(2014). From the GloVe model, we get a vector representation of each word in the corpus as output. To obtain these vector representations we do the following.

(15)

tweets of each company, as mentioned before. From this, globally a word-word co-occurrence matrix X is created. This matrix contains the number of times a word is in the context of another word. With globally, we mean that only one matrix is created with all the words in the corpus. The GloVe model uses matrix factorisation to handle these large amount of text.

A word j is in the context of another word i if it is within ten words away from word i. It does not matter whether word j is left or right from word i. If word j is d words away from word i,

1

d will be added to Xi,j, where Xi,j is the number of times word j is in the context of word i. This means that if a word is far away, it gets less weight added in the matrix. Then the sum over a all words in the neighbourhood of i is defined as Xi. Xi =Pk6=iXi,k is the number of times any word is in the neighbourhood of word i.

Now that the word-word co-occurrence matrix has been created, the word vector representations need to be made. For word i, there are two word vectors, w_i, ˜wi ∈ Rd. These vectors are essentially the same, but are different because of a random initialisation. There are two word vectors used because of optimisation advantages.

To train the word vectors, the variable pi,j = P(j|i) = X_Xi,j_i is introduced, which is the probability that word j is in the context of word i. Pennington et al.(2014) then try to encode meaning in the vector differences. For this, they look at the ratio P(k|i)

P(k|j), since this takes out properties that i and j have in common or properties that do not have anything to do with both i and j. This ratio will thus give us information on a property specific for i if the ratio is much larger than one, or specific for j if the ratio is much smaller than one. To train the word vectors based on these ratios, we need a function F like this:

F (wi, wj, ˜wk) = pi,k pj,k .

Since the model has to say something about the vector differences and the output of the function is a scalar, it can be written as:

F ((wi− wj)Tw˜k) = pi,k pj,k .

The word vector w_i and context word vector ˜wi need to be interchangeable. This also holds for X and XT, since this is the word-word co-occurrence matrix. To handle these symmetries,

Pennington et al. (2014) use the following formula:

F ((wi− wj)Tw˜k) = F (wT i w˜k) F (wT jw˜k) (1) ⇒ F (w_iTw˜k) = pi,k = Xi,k Xi . (2)

And since the solution to (1) and (2) is that F is the exponential function, we get that:

ewTiw˜k ₌ Xi,k

Xi

(16)

Now taking ln(X_i) in a bias for w_i, denoted as b_i, and for symmetry inserting a bias for ˜wk, denoted as ˜bk, we come to the final equation:

wT_i w˜k+ ˜bk = ln(Xi,k) − bi (3)

Finally, for GloVe, Pennington et al.(2014) chose to solve (3) by using weighted least squares. This comes down to minimising:

J = V X i,j=1 f (Xi,j) w_iTw˜j + bi+ ˜bj− ln(Xi,j) 2 , (4)

where V is the number of words in the corpus and f (Xi,j) is the weighting function. Pennington

et al. (2014) propose the following weighting function:

f (x) =    x xmax α x < xmax 1 x ≥ xmax ,

where the authors chose α = 3₄ and xmax= 100 based on empirical reasons. This function has the advantage that the weight goes to zero if there are no co-occurrences between two words and that the weight is topped at one if there are many more than 100 co-occurrences. In between, the function smoothly goes from 0 to 1. On top of that, it is important to notice that it has the property of being a non-decreasing function.

Minimising (4) gives us then the optimal vectors that represent the words in the corpus. This is the output of the GloVe model.

The implementation of this model is done in R, using the text2vec package. This is a package for text analysis and natural language processing that includes the GloVe model. We set the hyperparameters as proposed in the article of Pennington et al. (2014), which means that we set x_max equal to 100 and α equal to 3₄. Then we run the GloVe model for different numbers of word vector lengths. We run the model for a word embedding dimension of 25, 50, 100, 200 and 300. Each word has now an embedding vector. But we need a representation vector for each document in the corpus, which represents all the tweets of a company. In order to make a representation for each document, we average the word embedding vectors over all the words in the document. This gives us a representation that we can use in our machine learning models to classify companies into good and bad investments.

To conclude this part on GloVe, we take one other step. Since we use word embedding vectors that are trained on the Twitter data, this will probably give us good representations of the words and documents. However, there is not much data to train the text models on. Therefore, we also use pretrained word vectors that are trained on 27 billion tweets. These pretrained word vectors are available on the website of GloVe2. Similar to our own trained vectors, we average the word representation vectors over all words in the document to get a document representation that is used to classify the companies.

2

(17)

2.2.2 Word2Vec

Now we will describe the Word2Vec model byMikolov et al.(2013). This model consists of two different versions, namely the Continuous Bag of Words (CBOW) model and the Skip Gram (SG) model. Both models rely on an artificial neural network and capture representations of words in the corpus. However, the objective of the CBOW model is different from the objective of the SG model. The CBOW model uses a few words, say (w₁, w2, w3), as input to predict which word in the corpus has the highest probability of being in the context of (w₁, w2, w3) given (w1, w2, w3). The SG model works precisely the other way around. This model predicts context words given an input word. In Figure 1the structure of the SG and the CBOW model are shown.

During the optimisation of these models, the word representation vectors are adapted and this makes these models well applicable for our purpose. In other words, we use this model for the word representation vectors of each word in the corpus, that are created during the optimisation. The question which of these models is better to use for our research, is rather a hard one. From the experience ofMikolov(2013), the CBOW model has the advantage of a faster training time and the SG model has the advantage of representing rare words or phrases better. Since the Twitter data of companies might involve some infrequent words, we tend to use the SG model here.

Figure 1: A schematic representation of the SG and the CBOW model. Figure reprinted from

Mikolov et al. (2013).

The SG model is a feedforward neural network with one hidden layer that optimises its weights through backpropagation. An elaborate explanation of this model can be found inRong(2014) and in this thesis we will follow their notation for the largest part. The input of this neural network is a one-hot vector xwI, for input word wI, which is the word for which we will calculate

(18)

the most probable context words. Each word w in the vocabulary has its own vector with dimension w ∈ RV, where V is the number of words in the vocabulary. This vector consists of only zeros except for one element k that has a value of one.

The next step is the hidden layer. In this layer, the output is h = (xwI)

T_{W . W is the weight} matrix of the input layer to the hidden layer, that has dimension W ∈ RV xN. Because of the nature of the input vector, the output of the hidden layer is the k-th row of W , which we call vwI, following Rong (2014). From this we take the step to the output layer, which consists of

C context word vectors yc, c = 1, ..., C that have the probabilities of each word j = 1, ..., V of being in the context of wI. So C is the number of context words we try to predict. We then need to have that the probability of the j-th word in context word vector c, y_c,j, is equal to P(wc,j = wO,c|wI), and find the words for which this probability is the highest, where wO,c is the true c-th word in the context of wI. These probabilities are defined as:

yc,j = P(wc,j = wO,c|wI) =

euc,j PV

j0₌₁euc,j0

Instead of the linear activation function for the hidden layer, we see that we now have a softmax activation function for the output layer, where u_c,j is the input of the j-th unit (so of word j) on the c-th context word vector. Now this input uc,j is equal for every context word c, since the weight matrix W0 of the hidden layer to the output layer is the same for every output in the output layer. Therefore, we get that u_c,j = u_j = (v_w_j)TvwI, for c = 1, ..., C, which is the

dot product of the j-th column in W0 and the hidden layer output h.

After this, we define the loss function. We must maximise the probability of the true c-th output word for every context word c = 1, ..., C, which is P(w1,j∗

1 = wO,1; w2,j ∗

2 = wO,2; ...; wC,j ∗

C =

wO,C|wI), where jc∗ is the index of the true word in the context of wI on place c, for c = 1, ..., C. Now we have:

P(w1,j∗

1 = wO,1; w2,j2∗ = wO,2; ...; wC,jC∗ = wO,C|wI) =

=P(w1,j∗

1 = wO,1|wI)P(w2,j2∗ = wO,2|wI) · · · P(wC,j∗C = wO,C|wI) =

= C Y c=1 P(wc,j∗ c = wO,c|wI) = C Y c=1 euc,j∗_c PV j0₌₁euj0 ,

where the first equality holds, since the events are independent of each other. Then we want to minimize loss function E = −ln QC

c=1 euc,j∗_c PV j0₌₁euj0 ! = −PC c=1uc,j∗ c + C · ln PV j0₌₁euj0 .

The optimisation of the weights is done with backpropagation (Rumelhart, Hinton, & Williams,

1986). This goes as follows. First we look at W0. We want to minimise E, so we take the derivative of E with regard to w_i,j0 , ∀i, j and want this derivative to be equal to zero.

∂E ∂w0_i,j = C X c=1 ∂E ∂uc,j ∂uc,j ∂w0_i,j = C X c=1 ∂E ∂uc,j hi= C X c=1 (−_1j=j∗ c + yc,j) · hi= EIj· hi,

(19)

where EI_j =PC

c=1(yc,j− 1j=j∗

c) ∀j ∈ {1, ..., V } is the sum over the prediction errors for every

word in the context. Secondly, we want to adjust the weights W , so we have:

∂E ∂wk,i = ∂E ∂hi ∂hi ∂wk,i = V X j=1 ∂E ∂uj ∂uj ∂hi xwI,k = V X j=1 EIj · wi,j0 · xwI,k= EHi· xwI,k, where h_i =PV

j=1xwI,jwj,i is used in the second equality and where EHi= PV

j=1EIj· w0i,j ∀i ∈ {1, ..., N }. Then for the optimisation, we use stochastic gradient descent:

w0(new)_i,j = w0(old)_i,j − η · ∂E ∂w0

i,j

= w0(old)_i,j − η · EI_j· h_i

w(new)_i,j = w_i,j(old)− η · ∂E ∂wi,j

= w(old)_i,j − η · EHi· xwI,k,

where η > 0 is the learning rate. But since only one entry of x_w_I_,k is not zero, namely the k-th entry which is equal to 1, we do not need the whole matrix W , but we just need the k-th row, or vwI. Then we get:

v_w(new)_I = v_w(old)

I − η · EH,

where EH = (EH1, ..., EHN). The intuition behind the update function of W0, is that we check for each word in the vocabulary the sum of prediction errors of the context words (yc,j−1j=j∗

c, c =

1, ..., C) and if this is positive, we overestimate the probability of this word, which means we subtract η · hi of the weights of this word. If this prediction error is negative, we underestimate the probability of this word being in the context of w_I, and then we add η · h_i to the weights of this word. For the update function of W , if we underestimate the probability, we add a weighted part of all output vectors w0 to the input vector w and if we overestimate, we subtract this weighted part of all output vectors w0 to the input vector w. The key aspect of this, is that every time the probability of a word wj being in the context of wI is overestimated, its output vector will shift away from input vector of w_I and vice versa.

Finally, this needs to be done for every word in the corpus and therefore, we can state the objective function that we have to maximise as:

J = V X j=1 C X c=1 P(wc,j∗ c = wO,c|wI). (5)

The training of the Word2Vec model is done in Python and for this we use the Gensim library from Reh˚ˇ uˇrek & Sojka (2010). Gensim is a library for many text analysis implementations including Word2Vec. We train the word representations with a context of ten words, i.e. five words after and five words before the concerning word. Furthermore, we only use words that occur at least five times in the complete corpus. Similar to the GloVe word representations, we run the Word2Vec model for several numbers of dimensions. We run the model for vector dimensions of 25, 50, 100, 200 and 300. Furthermore the document embedding vectors are obtained by averaging the word embedding vectors over all the words of the document.

(20)

Again we use, apart from the on our text data trained word representation vectors, pretrained word vectors. These vectors are trained on 100 billion words from Google News and can be downloaded from Google Drive3. These word embeddings are averaged as well to give us docu-ment embeddings. The self-trained and pretrained docudocu-ment vectors will then be used as input in the classification models.

2.2.3 Paragraph Vector

The Paragraph Vector models, by Le & Mikolov (2014) are an extension of the Word2Vec models from Mikolov et al. (2013). Instead of only looking at words in their contexts, the Paragraph Vector models take the sentence, paragraph or even a whole document into account as well. Therefore the models are usually referred to as Doc2Vec models. The Paragraph Vectors include two models, namely the Distributed Memory (PV-DM) model and the Distributed Bag of Words (PV-DBOW) model.

The aim of the Doc2Vec models is to predict words that are in the context of a paragraph, given a randomly sampled word in this paragraph, which is the PV-DBOW model, or to predict the probability of a word given the paragraph and a few context words, which is the PV-DM model. As we can see, the extension to the Word2Vec models, is the information about the paragraph. This information is captured by a paragraph vector, which is essentially the same as a word vector, and which is unique for every paragraph. For our research, we are not interested in the aim of the model, but in the by-product: the paragraph vector. This vector is adapted during optimisation of the aim of this model, and we will use the final paragraph vector.

The PV-DM model can be compared to the CBOW model in Word2Vec, and the PV-DBOW model can be compared to the SG model. Since Le & Mikolov (2014) find that the PV-DM model gives better results than the PV-DBOW model, we continue to use PV-DM, which is schematically shown in Figure 2. In this section we will not describe the full PV-DM model, but we will briefly discuss the differences between the CBOW model and the SG model from Word2Vec and then give the extension of the CBOW model, which gives us the PV-DM model.

Figure 2: A schematic representation of the PV-DM model. Figure reprinted fromLe & Mikolov

(2014).

3

(21)

The CBOW model differs from the SG model in the sense that the input of this model consists of multiple words. Therefore, we get an output of the hidden layer that is equal to the average of the input word vector columns h = _C1(vw1+ vw2+ · · · + vwC). Then we want to find the most

probable word in the context of these C words, so we maximise P(wO|w_I,1, wI,2, ..., wI,C). This probability is then defined as a softmax function:

P(wO|wI,1, wI,2, ..., wI,C) =

euj∗ PV

j0₌₁euj0

,

where uj is defined as vw0j · h, and j

∗ _{is the index of the true word in the context. Now the} extension of the CBOW model that gives us the PV-DM model is found in this softmax function. Even more specifically, it is found in the hidden layer output h. The extension is the (already mentioned) paragraph vector d: a unique vector assigned to each group of words that we give the label of ‘a paragraph’, such as a sentence, a paragraph, or a whole document. All paragraph vectors are stored in a paragraph matrix D. We can consider this vector as an extra word vector. The paragraph vector is equal for every word from the same paragraph and the word vector of a unique word is the same throughout the whole corpus.

Then there is another adaptation we need to mention. In the CBOW model, the output h was the average of the word vectors of the input words. In the PV-DM model, there are two options. The first option is to average the input word vectors and the paragraph vector. The second option is to concatenate the word vectors and the paragraph vector. The latter is said to produce better results (Le & Mikolov, 2014), but we use the averaging option as the concatenating option is too much memory consuming.

Finally, the optimisation is equal to the optimisation in the word2vec models. Via backprop-agation we optimise the weights with stochastic gradient descent. When the model is trained for every word in the corpus, we have the optimal paragraph vectors. The training of this text model is done with the Gensim library (Reh˚ˇ uˇrek & Sojka, 2010). For this model we chose to use words only when they appear twice or more in the corpus. As said, we use averaging of the context vectors in the output of the hidden layer. And as in the Word2Vec model, we use a context of ten words.

This model gives us the optimal document embedding vectors, but next to these vectors it also trains the word embedding vectors. Therefore, we take from this text model two different outputs. First we use the document embedding vectors. Second we average the word embedding vectors over all words in each document, like we did for the Word2Vec model, to get a document representation. We create document representations in both ways for different vector lengths, namely for lengths of 25, 50, 100, 200 and 300. Since each paragraph or document is different, it is not possible to use pretrained document embedding vectors for the Paragraph Vectors model. The final document embedding models are used as input in the classification models to predict the performance of the companies.

(22)

2.2.4 Latent Dirichlet Allocation

The final text analysis model we are using is the LDA model byBlei et al.(2003). This model is very different from the previous models, since its aim is to find the underlying, latent, topics of the documents in the corpus. The model is therefore not focusing on the word or document representations as the previous models did. However, we can see the topic distribution of a document as a document representation as well. The LDA model uses Bayesian statistics to find these topic distributions. Blei et al. (2003) used variational methods to approximate the posterior, but nowadays Gibbs sampling (Griffiths & Steyvers, 2004) is used more often. See

Heinrich(2005) for a clear explanation on LDA including Gibbs sampling. Here, we will give a brief idea of how LDA works.

The LDA model is in fact a generative model. This means that it describes a process of producing words to make a document. In the corpus, there are K topics modelled. First, for each topic k ∈ (1, 2, ..., K), a vector φ_k of words in topic k is sampled from a Dirichlet distribution: φk ∼ Dir(β). Then we create a corpus of M documents (w1, ...,wM). For each document m a number Nm is drawn, where Nm ∼ P oisson(ξ). Nmis then the number of words in document m. Then for every document a topic distribution θ_m is sampled from Dir(α). Finally, for each document m the words (wm,1, ..., wm,Nm) are created in two steps. In the first

step, for word w_m,n in document m a topic z_m,n is drawn, such that z_m,n∼ M ultinomial(θ_m), with the number of trials equal to one. In the second step, a word wm,n is chosen from topic zm,n by drawing wm,n ∼ M ultinomial(φzm,n), again with the number of trials equal to one. If

this is done for every word in every document, we have our corpus. The procedure is shown in Figure 3as a graphical model. α and β are hyperparameters.

Figure 3: The graphical model of LDA. Figure reprinted from Heinrich (2005).

However, we already have a corpus with documents, filled with words. What we want is to get the topics of document m. Therefore, we need to use the LDA model backwards. The assump-tion is then, that a corpus is created in the way that is just described, and we want to learn P(z|w), which is our θ. But since P(z|w, α, β) = P(z, w|α, β)

P(w|α, β) , and P(w|α, β) is computationally intractable (Blei et al.,2003_{), Gibbs sampling is used to find P(zi} = k), the probability that k

(23)

is a topic of document i.

The algorithm for Gibbs sampling uses the conditional probabilities P(zi|z−i, w), ∀i = 1, ..., W to simulate P(z|w) with a Markov Chain, where the subscript −i stands for all j 6= i and where W is the number of words in the total corpus. The algorithm then performs a Monte Carlo simulation over these conditional probabilities until they converge. For this, we need a formula for P(zi|z−i, w). There are several ways of deriving a formula for P(zi|z−i, w), each one differing in their starting points. For the optimisation procedure of LDA, we use, from the Gensim library (Reh˚ˇ uˇrek & Sojka,2010), the LDA function from MALLET (McCallum,2002). This function derives the conditional probabilities with P(zi|z−i_{, w) ∝ P(w}i|w−i, z)P(zi|z−i) as starting point. The derivation with this starting point is described in Griffiths (2002). From this we get the following equation: P(zi = j|z−i, w) ∝ n(wi) −i,j+ β n−i,j+ W β ·n (m)−i,j _{+ α} n(m)_−i + Kα , (6) where n(wi)

−i,j is the number of times word wi is assigned to topic j, excluding this particular word. For clarification, the word “the” is probably mentioned multiple times in a corpus. n(wi)

−i,j then counts the number of times this word is assigned to topic j, except for the “the” in the current iteration. Then n−i,j is the total words in the corpus assigned to topic j, excluding this particular word i. n(m)_−i,j is the number of words in document m that is assigned to topic j, excluding word i and finally n(m)_−i is the total number of words in document m, excluding word i.

For the derivation of (6), see appendix A. For the algorithm, the values of zm are initiated randomly as a number between 1 and K. Then the Monte Carlo iterations begin. In each iteration the following steps are taken. For each word in the corpus and for each topic, the value of (6) is calculated and is stored as the new sample of z. Then the counting variables are updated and the algorithm goes to the next iteration. In the Monte Carlo algorithm, the first iterations are removed to avoid correlation with the initiating values and samples are taken only once every few iterations to avoid a high autocorrelation.

The result is a topic distribution for each document. The number of topics we choose to distinguish is the dimension of the vector that represents the document. Where in the previous text models these vectors represented meanings of words, we now have a distribution of topics. We run the LDA model for multiple numbers of topics, namely for 5 topics, 10, 15 and so on up to 100 topics. Obviously, for LDA it is not possible to use pretrained vectors as we do for GloVe and Word2Vec, since the topic distribution is different for every document. The topic distributions will be used as explanatory variables in the classification models in the next subsection.

(24)

2.3 Prediction Models

In this subsection we will address the prediction models that we use for classifying the compa-nies into good and bad investments. We use three different models for prediction that are all supervised learning algorithms. We use logistic regression, naive Bayes and in the third place a random forest. The implementation is done using Scikit Learn (Pedregosa et al.,2011). Scikit Learn is a Python library that includes a large amount of machine learning models, including the three models we use. The predictions will be of the form of a 0/1 classification, where 0 stands for a bad investment and 1 stands for a good investment. For this we introduce the classification variable for the three month performance of companies

y =

 



1, if the return of this company over the period 9 January - 9 April > 0 0, otherwise

2.3.1 Logistic Regression

The first classification model is the logistic regression, or the logit model. This model uses the explanatory variables to predict the probability of the dependent variable being equal to one (see Section 3 for the explanatory variables). For this prediction it uses a sigmoid function, namely the logistic function. This is an S-curved function that is defined as:

σ(t) = 1 1 + e−t.

This function returns a probability of 0.5 and higher if t ≥ 0. To remind us, we have a vector X of explanatory variables xi and a dependent variable y which is the classification. If we fill in the linear function of explanatory variables α + βX, α being a constant, we get the following logit function:

P(y = 1|X) =

1 1 + e−α−βX.

Then using the property of the logit function, we get a classification of a good investment when α + βX > 0. The basic assumption of this model is that we do a linear regression on the logarithm of the so-called odds. The odds are defined as:

odds = P(y = 1|X) P(y = 0|X) = 1 1 + e−α−βX e−α−βX 1 + e−α−βX = eα+βX,

so log(odds) = α + βX. This gives us a better understanding why we classify an investment as good when α + βX > 0. Now the advantage of this model is that it is a simple model and it is a good starting point. It can also give us more information about the weights of the explanatory variables than other classification models. However, the logit model does suffer from overfitting. Especially when there are few observations, this might be a problem for our prediction.

(25)

2.3.2 Naive Bayes

The second model we use is the naive Bayes model. This model has a different starting point than the logistic regression. The most important assumption for naive Bayes is that the explanatory variables are all assumed independent of each other. This is a con of the model and also the reason that the model is called naive. In Section 3 the explanatory variables are described extensively, but keeping in mind that two explanatory variables we use are the number of followers of a company and the number of favourites of this company, we can already see that the naive assumption does not hold. However, naive Bayes also has some advantages. It works well with very few data. And it is also a very fast algorithm, because of the simple calculations that are done. Finally, in spite of the independence assumption, the algorithm works well with dependent variables as well (Zhang,2004).

The naive assumption implicates that:

P(X|y) =Y i

P(xi|y), ∀i ∈ 1, 2, ..., n

and that

P(X) =X i

P(xi), ∀i ∈ 1, 2, ..., n,

where n is the number of explanatory variables. We want to obtain the estimate ˆy, where ˆ

y = arg max_yP(y|X). For this we need Bayes’ rule:

P(y|X) = P(y)P(X|y) P(X) = P(y) Q iP(xi|y) P iP(xi) . This gives us the optimisation function of naive Bayes:

ˆ

y = arg max y

P(y)QiP(xi|y)

P iP(xi) = arg max y P(y) Y i P(xi|y).

Now P(y) is just the percentage of observations of y in the training set (see 2.4 for how we define our training set). For P(xi|y) we use the Gaussian distribution, which means we have:

P(xi|y) = 1 q 2πσ2 y e− 1 2(xi−µy) 2 σ2_y , ∀i ∈ 1, 2, ..., n,

where σy and µy are estimated with maximum likelihood.

2.3.3 Random Forest

The third algorithm for classification is a random forest, which is introduced byBreiman(2001). A random forest consists of many decision trees. Such a decision tree is a graph consisting of nodes that carry a condition. We put an observation in the root node and based on the condition in this node it goes to the left child or to the right child of the node. Here it finds another

(26)

condition. This goes on until we have reached a leave of the tree. This leave is then the classification of this observation. The conditions, or the different levels in the tree, are based on the explanatory variables. Now let Q denote the set of observations that arrive in node m. The conditions are trained as follows. For every combination of feature i and threshold t_m in node m, we have the following subsets:

Qlef t(θ) = {(x, y)|xi ≤ tm} Qright(θ) = {(x, y)|xi > tm},

where θ = (i, t_m). Now the decision is learned by:

ˆ θ = arg min θ G(Q, θ) = arg min θ _n lef t Nm H(Qlef t(θ)) + nright Nm H(Qright(θ)) ,

where Nm= |Q| and nlef t, nright stand for the number of observations that go to the left child and right child of node m respectively. H(·) is the Gini impurity function:

H(Xm) = pm,0(1 − pm,0) + pm,1(1 − pm,1),

where p_m,k = nm,k Nm

is the proportion of observations in node m that are truly from class k ∈ 0, 1 and nm,k is the number of observations in node m that are from class k ∈ 0, 1. This training of conditions is then done for both the left and the right subset of node m and so on recursively until the number of observations in the final node is equal to one.

As said earlier, the random forest algorithm is a combination of many decision trees. Random forest classification is a bagging technique, which means that it uses bootstrap aggregation for optimisation. This means that there are B samples drawn with replacement from the training data. A decision tree is trained on each sample. Furthermore, a subset of the features is drawn from the set of features with replacement in every iteration. At the end, the probabilistic predictions of the B classifiers are averaged to get the final classifier. The disadvantage of the sampling procedures is that it makes the bias larger. However, the reduction of variance compensates this. The overfitting problem of a decision tree is thus avoided.

2.4 The Pipeline

In this final subsection we bring all the models together and describe the process through which we come to our classification models. The input data here is the data we got from our text models and the non-textual data that we have not described yet but can be found in Section3, such as the number of Twitter followers of a company.

Apart from the machine learning models, Scikit Learn is very useful for other tasks, such as model selection, feature selection and creating a clear overview of the steps one has to perform in the complete model. This last part can be done with the very useful Pipeline function. A pipeline is a chain of steps including the classification model that can be treated as a classification

(27)

model as a whole. Other steps in the pipeline can be the transformation of data, the selection of features, and so on. This has many advantages. For example, since the pipeline performs as an estimator, we can do the fitting and predicting directly with this pipeline. This way, it is not possible to loose information between the steps. Another advantage is that we can simultaneously do a grid search over all parameters in each part of the pipeline. Moreover a cross-validation can be done over the complete pipeline, instead of only on the final estimator.

In order to use the pipeline, we first want to divide the data into a randomly chosen test and train set. The train set is used to fit the models and the test set is used to compare the predictions of our models to the real test data. The train set is then fed to our pipeline. Our pipeline includes three steps. The first step is the scaling of the data. The second step is a feature selection. Thirdly we include the prediction model. Since we have three different prediction models, we make three different pipelines.

According to Hastie, Tibshirani, Friedman, & Franklin (2004) (p. 63), it is common practice to standardise the data for the logistic regression model. Therefore we first apply the Standard Scaler function from Scikit Learn, which corrects the variables for their mean and variance. Although the standardisation is not necessary for the naive Bayes and random forest algorithms, it is not harmful as well. The mean and variance are stored, so that when we apply our classification models on the test data, the test data is corrected for the training mean and variance.

The next step in the pipeline is a recursive feature elimination (RFE) with cross-validation. Since we have many features and relatively few data, we want to reduce the number of features we put into our model to prevent it from overfitting. The RFE trains a classification model on the complete set of features and removes the feature with the least importance. Then it trains the model again on the new set of features, recursively until the best number of features are left. The cross-validation is done to determine what number of features is best to keep. The model used for training in the RFE depends on the classification model. When classifying with the logistic regression and the random forest we use these exact models for RFE as well. However, it is not possible to use naive Bayes in the RFE, since it does not return feature importances, or coefficients. Therefore we use the random forest algorithm for training in the RFE in the naive Bayes pipeline.

The third step in the pipeline that is performed is the classification model. There are still many hyperparameters that may lead to better or worse models when they are changed. To find the best hyperparameters we use a grid search over a range of possible hyperparameters. Next to this grid search, a cross-validation is performed over the train and test sets. Here the train and test sets are based on a cross-section of the companies. The range of the grid search is specific for each part of the pipeline and for each classification model and besides, we vary the number of cross validations as well. We will elucidate these ranges now.

(28)

regular-isation strength, we take a grid of 10−5, 10−4, 10−3, 10−2, 10−1, 1. For the penalty we search between a penalty using an Euclidean norm and a penalty using a taxicab norm. For the RFE we try out a k-fold cross-validation, where k = 2, 3, 4, 5. For the naive Bayes model, we can only search over the grid of k-fold cross-validations. For this search we use again k = 2, 3, 4, 5. For the third algorithm, the random forest, there is one more hyperparameter to search through. First we search for the best number of trees to be made in the forest. We vary this number from 10 up to 100 with steps of size 5. Then, once more, we use for the RFE the same types of cross-validation as in the logistic regression RFE and the naive Bayes RFE.

Finally, we perform this grid search with cross-validation. The cross-validation we use here is similar to the cross-validation used for the RFE. This means we fit the models trying a k-fold cross-validation of k = 2, 3, 4, 5. The output of this grid search is a best model with its predictions for the performance of the companies. With this final step we have concluded our classification model. Now we need to make a distinction between the models based on their performance.

input : Twitter data and non-textual data (see Section3)

output: Classification of company and AUROC of the models

Split data into test and training set;

for every text model do

for every dimension of text model output do for each machine learning model do

for each number k of the k-fold cross-validations do for every fold do

for all values of the hyperparameters do

Scale data;

Eliminate features recursively; Train classification model;

end end

Predict a classification using the test set; Store AUROC of this model;

end end end end

Algorithm 1: Process of obtaining the results of the classification models.

We measure the predictive performance of the models using the Area under the Receiver Op-erating Characteristic (AUROC). This is, as the name suggests the area under the ROC. This ROC curve is a plot of the true positive rate against the false positive rate varying over multiple

(29)

thresholds. The thresholds are set for the probability estimates to decide whether an obser-vation should be classified as a one or as a zero. For example when the threshold is 0.5, all observations with a probability estimate of 0.5 or higher will be classified as a one. The AUROC can range from zero to one. An AUROC of 0.5 would mean that the model predicts equally good as a random model, such as flipping a coin to make a classification. An AUROC of one would mean a model that predicts perfectly.

In order to compare the AUROCs, we can use the DeLong test (R. DeLong, M. DeLong, & L. Clarke-Pearson, 1988). This is a non-parametric test that compares the AUROCs of correlated ROC curves. Because of computational reasons, the fast implementation of the DeLong test is done using Sun & Xu (2014). We also use a paired t-test to compare the averages of the AUROCs of the classification models and of the text analysis models.

To conclude the part on the models based on Twitter, we give an overview of the steps that are taken from the Twitter input up to the classification in Algorithm 1. To clarify the first for-loop, the text models include the four used text models and their pre-trained alternatives as mentioned in Subsection 2.2.

2.5 Benchmark Model

Finally, all the models based on Twitter are compared to a benchmark model. This benchmark model does not use any Twitter data, which makes it possible to see if the Twitter models are of added value. The benchmark model is a classification model that classifies the companies into good and bad investments using a logistic regression model. For an explanation of logistic regression models, please see Section 2.3.1. For the classification we use a lag of the quarterly performance of the companies. Apart from this, the sector of the companies and the location of their headquarters is used.

Next to this, another benchmark model is used which is the same as the first benchmark model, but additionally includes a square of the lag of the quarterly performance. For both models, we use a k-fold cross-validation, where k ranges from 2 to 10. Finally, the AUROC of the resulting models are calculated. These AUROCs form the benchmark results to which we can compare the models based on Twitter.