The impact of news on trading volume

(1)

Bachelor Thesis

Econometrics

The impact of news on trading volume

Author: Joris Brehm 10432922 Supervisors: K.J. van Garderen and M.J. van der Leij

(2)

Abstract

The aim of this thesis is to relate the amount of news, and especially the topics of this news to the trade volume of stocks. The main results show that crisis related news can be used to forecast the trade volume of certain portfolios. However this effect seems to caused by the crisis as it disappears when the crisis period is excluded from the data. News with other topics of which the frequency is highly correlated with the trade volume do not have any predictive power over the trade volume.

Contents

1 Introduction 2

2 Theory 3

2.1 Topic models . . . 3

2.2 The Efficient Market Hypothesis . . . 6

2.3 News and Stocks . . . 7

3 Methods 8 4 Results 11 4.1 LDA . . . 11

4.2 Topic selection . . . 12

4.3 VAR estimation . . . 12

4.4 Analysis without the crisis period . . . 17

5 Conclusion 22

References 23

Appendix A 26

Appendix B 27

(3)

1 Introduction

It is well known that media like newspapers, television and recently social media have a large degree of influence on public opinion (Radford, 1996; Shirky, 2011). Mutz and Soss (1997) find that news organizations can change the environment in which policy changes occur and indirectly change policy this way. The power of new social media should not be underestimated, Lotan et al. (2011) suggest Twitter played a key role in the Egyptian and Tunisian revolutions.

The amount of available media is increasing rapidly, for example the New York Times (NYT) publishes over 5000 online articles weekly (New York Times API, 2015). The NYT is just one of many hundreds of news organizations world wide, on top of that there are many social media sites. Twitter users for example post around 500 million tweets on a daily basis (Twitter, 2015). To analyze this vast amount of data complex computerized algorithms are needed, here Latent Dirichlet Allocation (LDA) is used (Blei, Ng, & Jordan, 2003). LDA is a probabilistic model which describes a generating process for discrete collections of data. The main purpose of LDA is classifying documents by topic (see section 2 for a more detailed description).

According to The Efficient Market Hypothesis (EMH) all the information these media report, directly translate into price changes on the stock market (Fama, 1970). The EMH states that a market is efficient if prices reflect all relevant avail-able information. In other words if information becomes availavail-able it is instantly processed to the price of a product. The EMH is often criticized but Malkiel (2005) argues that there is no hard evidence that the EMH is not valid. Birz and Lott (2011) estimate the effect of several macroeconomic related news topics to stock returns. They find a significant relation between news about unemployment and gross domestic product and stock returns. Mitchell and Mulherin (1994) look at the relation between securities market and the daily Dow Jones & Company news announcements. They directly link the amount of articles to market activity, the topics of these articles are in line with topics that are known to influence security markets. However, the link between news and market activity is not significant.

(4)

The analysis of the Dow Jones databases they perform shows the difficulties of linking measurable information to trade volume and volatility.

The aim of this article is to use LDA to relate the amount of news and topics of this news to the trade volume of stocks from the following three categories: oil and gas, financial sector and consumer services. This is done by first applying LDA on a Reuters news corpus, and then analyzing the topic frequencies over time. Thereafter three portfolio’s are constructed to represent the three categories and finally the topic time line is compared to the trade volumes of the three portfolio’s. In the next section theory is discussed in more detail, thereafter the methods used in this article are presented. Then the results are explained and analyzed, and finally everything is summarized in the conclusion.

2 Theory

In this section a selection of topic models and their flaws are discussed. Next the Efficient Market Hypothesis is explained, followed by some linkage between news and the stock markets.

2.1 Topic models

A topic model is, in the field of natural language processing and computer science, a model that captures the unobserved topic structure of a collection of documents. With the vast growing amount of texts available it is useful to capture the essence of these texts in a short description. Information Retrieval (IR) scientists Baeza-Yates, Ribeiro-Neto, et al. (1999) lay the foundation for modern IR, presenting basic principles to describe text corpora. All topic models that are discussed in this thesis make the ‘bag-of-words’ assumption, i.e. the order of words is not relevant as the topic information is contained in the frequency of words within documents and in the whole corpus. Another assumption is the exchangeability of documents. This means the order of documents does not hold any information regarding the topics.

(5)

(tf-idf) (Salton & McGill, 1983). The term frequency refers to the word frequencies per document and the inverse document frequency the frequency relative to the corpus. The output of a tf-idf analysis of a corpus is a matrix which contains the tf-idf values per document in each column. This matrix has a fixed prede-fined number of rows but the number of columns increases proportional to the number of documents in the corpus. While this analysis does find words that are characteristic for certain documents in a collection, it does not decrease the description size of the documents by much. This is because for each document a large amount of numbers remain. Another flaw of this analysis is its inability to describe statistical relations between documents, that is the topic structure.

To overcome some of the flaws of tf-idf, Deerwester, Dumais, Landauer, Furnas, and Harshman (1990) present latent semantic indexing (LSI). LSI uses the singular value decomposition of the tf-idf analysis matrix to find a linear subspace in the tf-idf space that captures most aspects of the collection. The orthogonal vectors that span this subspace can be seen as topics, however this interpretation lacks statistical foundation (Hofmann, 1999)

Nigam, McCallum, Thrun, and Mitchell (2000) take a different approach to model the latent topics, they start with a simple unigram model. In a Unigram model it is assumed that every word of every document in the corpus is drawn in-dependently from a single multinomial distribution. This extremely simple model will result in just one topic since all words are drawn from the same distribution. The mixture of unigram model is obtained by adding a discrete latent topic vari-able to the unigram model (Nigam et al., 2000). Now for every document a (latent) topic is drawn and then conditional on that topic the words of the document are drawn from a multinomial distribution. This modification allows different docu-ments to have different topics, however this does not allow a document to contain more than one topic.

Probabilistic latent semantic indexing (pLSI), another frequently used docu-ment model (Hofmann, 1999), allows for multiple topics per docudocu-ment, which is an improvement over the mixture of unigram model. In contrast to LSI, pLSI has

(6)

a statistical foundation since pLSI is based on the likelihood principle and has a generative process. This generative process however, is not well defined in terms of document structure as the process generate co-occurrences and not documents (Blei et al., 2003). Another flaw is the manner in which pLSI learns topic mix-tures, it can only distinguish topic distribution it has already seen in a training set. This way of identifying topic distributions leads to an increase in parameters linear to the amount of training documents. If we consider a pLSI model with n topics we get n multinomial word distributions of size V , the amount of words, and M mixtures of n topics. This gives a total of n(V + M ) parameters and is linear in M , the amount of trained documents, which leads to a quickly increasing amount of parameters as the training set grows. Blei et al. (2003) show this linear growth empirically leads to overfitting. It is common to use a tempering heuristic to smooth the parameters for better predictive performance but Popescul, Pen-nock, and Lawrence (2001) show that that overfitting can occur even when this is done.

The last model that is discussed is Latent Dirichlet Allocation (LDA) (Blei et al., 2003), this model has a solid statistical generative process. Additionally it is able to model multiple topics per document by using a hidden random variable instead of linking the topic mixture explicitly to the training set. On top of that the hidden random variable approach also reduces the number of parameters to n + nV for a n-topic LDA model.

LDA as Blei et al. (2003) present it has the following generative process. For every document in a corpus the length of the document is chosen, after that a topic distribution is drawn from the Dirichlet distribution. This drawing is unique for every document and represents the topic mixture in this document. Next a topic is drawn, from the topic distribution, for each of the words in the document. The final step is to draw a word from a multinomial word distribution given the topic. This structure makes LDA a hierarchical Bayesian model with three levels where the prior topic distribution is Dirichlet.

(7)

distribu-tion is of interest. This presents a problem. Blei et al. (2003) show the posterior is intractable due to coupling between the topic distribution and the word dis-tribution conditioned on the topic disdis-tribution. A wide range of approximation algorithms are available. Here the implementation by Hansen, McMahon, and Prat (2014) is used. This implementation uses a Markov Chain Monte Carlo al-gorithm called Collapsed Gibbs Sampling, introduced by Griffiths and Steyvers (2004).

2.2 The Efficient Market Hypothesis

The Efficient Market Hypothesis states that a stock market is efficient if prices re-flect all relevant available information. The EMH has three main forms, the strong from, the semi-strong form and the weak form (Fama, 1970). All forms assume individuals maximize their utility according to their own preferences, have rational expectations and update these expectations when new information becomes avail-able. Furthermore it is assumed all individuals together are always correct. This allows for some individuals to under-correct and some to over-correct when new information becomes available. Next, the three forms of the EMH, as presented by Fama (1970) are discussed followed by some criticism on the EMH.

The weak form of the EMH assumes all price movements are due to information not currently contained in the stock price, thus prices follow a random walk. This implies that technical analysis of historic prices cannot be used to forecast, and no excess returns can be earned this way. However fundamental analysis might still generate excess returns. The weak EMH does not require prices to move around an equilibrium but only assumes no individual beats the market due to market inefficiencies.

The semi-strong-form EMH states that stock prices change, almost instantly and unbiased, to new public information. This means no excess return can be made on new information, and neither technical nor fundamental analysis generates excess returns.

(8)

the stock prices this implies no one is able to get excess return. Since there are laws that forbid insider trading, private information is not taken into account in share prices this makes the strong-form EMH impossible, unless every individual breaks these laws.

The EMH is criticized a lot, for example Schiller (2000) argues the behavior of American stock markets during the late 1990s was far from rational. However, Malkiel (2005) does not find hard evidence that the EMH is not valid.

2.3 News and Stocks

The use of topic models like LDA can simplify a large corpus of news articles to a relatively small amount of statistical data. According to the Efficient Market Hypothesis this summarized corpus is somehow included in the stock prices. Many studies empirically find a positive correlation between stock price and trade volume (Karpoff, 1987; Andreassen, 1988), this means a drop in stock price usually goes with a small trade volume and a rise in price goes with a high trade volume. Hiemstra and Jones (1994) find evidence of a bidirectional nonlinear Granger causality between returns and volume. This means the information contained in the news data also has an impact on the trade volume of stocks.

There are many studies which try to link news to stock volume and price. For example Ito and Roley (1987) show there is weak evidence that economic announcements have an effect on the dollar yen exchange rate, in particular United States money announcements. Furthermore they find that positive production news in the U.S. translates into appreciation of the dollar. Ito and Roley (1987) suspect a similar relation between Japanese news and the Yen, but the data is incomplete. Joulin, Lefevre, Grunberg, and Bouchaud (2008) find no evidence that either market news or idiosyncratic news explain frequency and price jumps. The news they look at is linked to stocks if the name of the company or the stock ticker is included in the article. This approach might bias the results as many articles are rendered useless while they might contain valuable information. For example news about a war in oil producing country might influence the stocks of

(9)

a big oil company however the name of this company does not necessarily has to be in the article. Hisano, Sornette, Mizuno, Ohnishi, and Watanabe (2013) show that extremely large trade volume fluctuations can partially be explained by the information flow of news.

The aim of this paper is to use Latent Dirichlet Allocation, a topic model which can classify documents by topic, to create topic time series. These topic times series are followed over time and compared to trade volume time series of general portfolios for three sectors. As the literature shows there is evidence of a link between the news flow and stock prices, because of the relation between stock price and volume I expect to find a clear relation between the topic frequencies and trade volumes.

3 Methods

The shortcomings of different linguistic models lead to the use of LDA. Combined with the Efficient Market Hypothesis LDA can help to quantify the link between the topics of news and trading volume. Next the news corpus, i.e. the text data, is described followed by the stock trade volume data. Thereafter the data manipulations necessary for the use of LDA, and the construction of the topic time series are discussed and finally the model to test whether the time series are linked is presented.

The news data that is used is the Thomson Reuters Text Research Collection (TRC2) and is obtained via the National Institute of Standards and Technology (NIST). The NIST is a non-regulatory agency within the United States Depart-ment of Commerce. NIST’s mission is to promote U.S. innovation and industrial competitiveness in order to gain more economic security and improve life quality. The TRC2 corpus contains 1,800,370 news stories from the first of January 2008 to the 28th of February 2009 and was originally made available to supplement the BLOGS08 corpus, a large blog crawl result. Per entry the TRC2 corpus contains the date and time at which the article appeared, the headline of the article and the full body text.

(10)

Next the trade volumes of three stock portfolio’s are acquired via Thomson Reuters Datastream (Datastream). Datastream has constructed portfolio’s which represent the sectors of the United States economy. The following three portfolio’s are chosen. The first portfolio is US-DS Oil & Gas (OILGSUS) and represents the oil, gas and energy sector. This portfolio contains stocks from 88 companies including Exxon Mobil and the Chevron Corporation. The second portfolio is US-DS Financials (FINANUS) and includes stocks from 209 companies like JP Morgan, Chase and the Bank of America. The third and last portfolio is US-DS Consumer Services (CNSVSUS) and is a combination of stock from 148 different companies which provide consumer services. For example Wal Mart Stores but also Walt Disney and Netflix. For the period that the news data is available, the trading volumes of the portfolio’s are acquired via Datastream. This gives OILGSU St, F IN AN U St and CN SV SU St the trade volume of the portfolio’s.

The TRC2 corpus contains some articles which are just tables of numbers; since LDA can only use words these article are deleted from the corpus. The corpus con-tains entries which are incomplete. After deletion of the articles which contained only number and the incomplete ones 1,675,704 of the 1,800,370 news stories re-main. The final step is to remove all null bytes, commas within the article field, extra quotes in the article and the headline field. All these steps are performed in

python; the scripts are found in Appendix A (remove headlines.py, remove null.py and remove quotes.py)

The next step is applying LDA to the cleaned data corpus, LDA is only applied to the body of the article as the headlines are often non informative and biased by the author. A 100 topic LDA model is estimated. for each article this results in a vector of length 100, θi the topic distribution, and a label which contains the date

and time of appearing. Most of the companies included in the three portfolios are traded on either the New York Stock Exchange (NYSE) or in the NASDAQ. Both the exchanges open at 9:30 am and close at 4 pm. This is reason to redefine the start and end point of days to 4 pm (and for early closing days 1pm). For example an article published on day 1 at 5:30 pm will be labeled day 2. Next the topic

(11)

distribution are summed per day this results in

xk,t =

X

articles on day t

θk

the time serie for each of the 100 topics on a daily basis. Where θk is the kth

element from the topic distribution θi. Within these topic time series, xk,t, the

values of weekends are added to those of Mondays and values of holidays are added to the respective next weekday. Beside the topic distribution the amount of topics per day is counted, this results in the time series AM OU N Tt. The python scripts

that performed this steps alongside a list of holi- and early closing-days can be found in Appendix B.

The following time series are now available, the trade volume series

OILGSU St, F IN AN U St, CN SV SU St,

the 100 topic series

yk,t , k = 0, 1, . . . , 99

and last, the amount of articles each day

AM OU N Tt

where t = 1, 2, . . . , 280. A Vector Autoregression (VAR) model is used to model the relation between the trade volumes and topics. The VAR model is chosen because it doesn’t require knowledge about the dependence structure, but only a reasonable suspicion about the relation between variables. Including all topics in a VAR model would lead to serious over fitting. In fact in a VAR(1) model the number of coefficients would outnumber the number of observations.

This problem is dealt with as follows. First the the non lagged correlation between OILGSU St and all of the topics divided by AM OU N Tt,

xk,t

AM OU N Tt are calculated. The same is done for the other two trade volume series. Next for each of the three trade volume series, the three topics with the highest correlations are

(12)

selected. For example if topic 15, 67 and 83 have the highest correlation with the Oil & Gas portfolio x15,t, x67,t and x83t would be included in the VAR-model.

The final step is estimating the VAR(p) model for each of the three portfolio’s and the topics chosen by correlation and the hand picked ones. For the US-DS Oil & Gas portfolio and the three topics from the previous example the model is as follows            OILGSU St AM OU N Tt x15,t x67,t x83,t            = c + Φ1            OILGSU St−1 AM OU N Tt−1 x15,t−1 x67,t−1 x83,t−1            + . . . + Φp            OILGSU St−p AM OU N Tt−p x15,t−p x67,t−p x83,t−p            + εt.

The amount of articles per day, AM OU N Tt, is added to estimate the effect of

the total amount of articles seperatly from the topics. The number of lags p is selected based on the Schwarz criterion.

In the next sections the results of the above steps are presented and analyzed.

4 Results

4.1 LDA

After cleaning the TRC2 corpus LDA is applied to it. This produces several out-puts. First the 100 topics. The topics are estimated distributions over the words; for example for topic 0 the three most frequent words are project, energi and develop. Based on these words a person could classify topic 0 as energy development, however LDA does not do this. For the topics used in the further analysis a list of the 10 most frequent words is found in Appendix C. Second the estimation of the LDA model gives the topics distribution per article, 100 numbers (θi) per article which sum to one. For example for some article j the first 3 of

these numbers are

(13)

this means a fraction of 0.00151 of the words of this article are identified as from topic 0 and 0.00331 from topic 1 etcetera. Performing the manipulations as de-scribed in the previous section to these estimated topic distributions yield the topic time series xk,t

4.2 Topic selection

To determine which topics are included in the VAR model the correlations with the trade volume time series are calculated. Table 1 shows these correlations. For OILGSU S topic 97, 98 and 58 have the highest correlations, F IN AN U S topic 98, 10 and 81 and finally for CN SV SU S topic 98, 97, 81. It stands out that

OILGSU St F IN AN U St CN SV SU St

topic correlation topic correlation topic correlation

97 0.4243 98 0.4830 98 0.5922

98 0.3971 10 0.4738 97 0.5220

58 −0.3581 81 −0.4177 81 0.4243

Table 1: Topics with the highest correlation with the volume time series OILGSU St, F IN AN U St and CN SV SU St and the correlation values.

topic 98 is highly correlated with all three volume series. Among Topic 98’s most frequent words are govern, crisi, credit, billion, debt and bailout this topic is clearly about financial crises. This could explain the high correlation with all topics as Manda (2010) shows the market was more volatile during the 2008 financial crisis and this crisis was covered extensively by the media. About 30 percent of the data is from the crisis period.

The selection of topics based on correlations gives almost exclusively topics that are directly about the stock market. For example among topic 97’s most frequent words are NYSE, buy and sell and topic 98 is about the financial crisis.

4.3 VAR estimation

After selecting the topics the next step is to determine the amount of lags for each of the VAR models. For each of the portfolios models with different amounts of lags are estimated. The Schwarz Criterion (or Bayesian Information Criterion)

(14)

values are reported in Table 2. Based on these values the ‘best’ amount of lags for each of the models is 1.

p OILGSUS CNSVSUS FINANUS

1 65.744 64.577 68.041

2 65.747 64.774 68.072

3 66.196 65.236 68.508

Table 2: Schwarz Criterion (or Bayesian Information Criterion) values for the different order VAR models ordered per portfolio.

OILGSUS AMOUNT x98 x97 x81

OILGSUS(-1) 0.653371∗∗ 0.002183∗∗ 5.06E-05∗∗ 7.76E-05∗∗ 1.38E-06 (0.05101) (0.00053) (9.4E-06) (1.2E-05) (1.2E-05) AMOUNT(-1) -2.536832 0.014412 −0.006505∗∗ _-0.002017 _−0.015230∗∗ (7.74151) (0.08000) (0.00142) (0.00184) (0.00190) x98(-1) 1014.250∗∗ 3.450925 0.677764∗∗ 0.237390∗∗ -0.035332 (313.252) (3.23696) (0.05746) (0.07436) (0.07672) x97(-1) -267.2890 2.298380 0.107266∗ 0.068771 0.224590∗∗ (308.738) (3.19031) (0.05663) (0.07329) (0.07561) x81(-1) -105.5409 -1.800933 −0.059622∗∗ -0.025144 0.827447∗∗ (132.036) (1.36438) (0.02422) (0.03134) (0.03234) C 116763.1∗∗ 4582.239∗∗ 35.43932∗∗ 30.52177∗∗ 90.75103∗∗ (31608.0) (326.617) (5.79779) (7.50284) (7.74123)

Table 3: VAR(1) estimation output for the topics chosen based on correlation for the Oil and Gas portfolio. 280 observations (whole sample), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

In tables 3 to 5 the VAR(1) estimates are shown for the topics selected by correlation and respectively OILGSU S, CN SV SU S and F IN AN U S. Table 3 shows that the one day lagged amount of news of topic 98 has significant posi-tive predicposi-tive power for the trade volume of the US-DS Oil and Gas portfolio. Furthermore the one day lagged trade volume, has significant predictive power for the amount of news of topic 97 and 98 and also for the total amount of news. A Granger causality test, with α = 0.05, rejects the null of topic 98 not being a Granger cause for the trade volume. Thus there is evidence that the amount of news of topic 98 is a Granger cause of OILGSU S. Table 4 shows similar results for for the US-DS Consumer Services portfolio. Topic 97 and 98 Granger cause

(15)

CNSVSUS AMOUNT x97 x98 x58

CNSVSUS(-1) 0.622938∗∗ 0.001663∗∗ 5.86E-05∗∗ 3.29E-05∗∗ 1.25E-05∗∗ (0.05333) (0.00034) (8.2E-06) (6.5E-06) (5.2E-06) AMOUNT(-1) -8.888738 0.374906∗∗ -0.000848 −0.004894∗ _-0.000402 (17.3506) (0.11165) (0.00266) (0.00211) (0.00170) x97(-1) −953.3052∗ -1.876360 0.048377 0.103466∗ 0.035779 (468.664) (3.01582) (0.07187) (0.05704) (0.04605) x98(-1) 1602.931∗∗ 4.750355∗ 0.337686∗∗ 0.785902∗∗ -0.045767 (395.458) (2.54474) (0.06064) (0.04813) (0.03885) x58(-1) 940.1367 −29.09591∗∗ -0.225784 −0.299731∗∗ -0.005305 (930.749) (5.98929) (0.14273) (0.11328) (0.09144) C 221969.6∗∗ 3830.676∗∗ 18.92492∗∗ 27.31669∗∗ 48.44611∗∗ (49805.9) (320.497) (7.63773) (6.06165) (4.89332) Table 4: VAR(1) estimation output for the topics chosen based on correlation for the Consumer services portfolio. 280 observations (whole sample), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

FINANUS AMOUNT x98 x10 x81

FINANUS(-1) 0.772761∗∗ 0.000420∗∗ 9.87E-06∗∗ 6.51E-06∗∗ 2.27E-06 (0.04947) (0.00011) (2.0E-06) (1.5E-06) (2.6E-06) AMOUNT(-1) -15.24197 0.031565 −0.005581∗∗ _−0.004476∗∗ _−0.013190∗∗ (33.2205) (0.07314) (0.00131) (0.00102) (0.00174) x98(-1) 884.5568 5.950672∗ 0.742049 -0.022443 -0.024326 (1415.50) (3.11664) (0.05587) (0.04336) (0.07430) x10(-1) 1907.541 -1.734850 -0.020402 0.681051∗∗ 0.106388 (1857.56) (4.08997) (0.07332) (0.05690) (0.09750) x81(-1) -1052.556 -2.035465 −0.071514∗∗ -0.014981 0.802922∗∗ (645.341) (1.42091) (0.02547) (0.01977) (0.03387) C 373884.1∗∗ 4849.726∗∗ 41.13830∗∗ 36.44092∗∗ 87.02813∗∗ (133430.) (293.787) (5.26659) (4.08714) (7.00359) Table 5: VAR(1) estimation output for the topics chosen based on correlation for the Financials portfolio. 280 observations (whole sample), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

CN SV SU S. In Table 5 the results of the VAR(1) model for FINANUS are shown. In contrast to the other two portfolios topic 98 does not hold significant predic-tive power for the trade volume and neither do the other topics. Furthermore a Granger causality test does not reject the null that topic 98 is not a Granger cause for the Financials trade volume.

(16)

0 40 80 120 160 200 240 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M1 M2 2008 2009 T98

Figure 1: Line graph of T98, the amount of news of topic 98

98 Granger causes the OILGSU S and CN SV SU S trade volumes but not the F IN AN U S volume. In Figure 1 the time series of topic 98 is displayed, the amount of news of topic 98 seems to be much higher after mid September 2008. On the 14th of September 2008 the Lehman Brothers holding filed for bankruptcy. This is often deemed to be the start of the ’08-’09-financial crisis. Since topic 98 is labeled ‘financial crises related’ this event could explain the increase in topic 98 news. Manda (2010) states the media coverage of the crisis played a big role in the crisis; the time series graph of topic 98 quantitatively shows the same. Secondly, it stands out that the lagged trade volume of all three portfolios influence the amount of crisis related news significantly. This could indicate that the trade volumes and thus the stock market are a driving force for the media. Combined with the previous, this gives a spiral where the amount of crisis related news drives the trade volume and this in terms drives the amount of news. The third and last thing that stands out is the lack of evidence for a Granger cause between topic 98 and F IN AN U S. This could mean the financial sector is less influenced by the media than the Oil and Gas, and Consumer service sectors.

(17)

Granger cause the trade volume. In order to examine this effect in more detail a impulse response analysis is performed. Figure 2 shows the Impulse response graph of CNSVSUS to Cholesy one standard deviation innovations of all variables included in the model. The response of the trade volume to a shock in trade volume seems to be the largest and dies out over time.

-20,000 0 20,000 40,000 60,000 80,000 100,000 120,000 1 2 3 4 5 6 7 8 9 10 CNSVSUS AMOUNT T97 T98 T58

Figure 2: Impulse response graph of CNSVSUS to Cholesy one standard deviation innovations of all variables included in the model.

In Figure 3(a) the response of the Consumer Services portfolio trade volume to topic 97 is shown together with a two standard deviation band. These standard deviations are based on a Monte Carlo simulation with 1000 repetitions. The trade volume seems to be lower for a day after the shock after that day the effect does not differ significantly from zero. Figure 3(b) shows the response of the trade volume to a shock in topic 98. There is a positive effect on the trade volume which grows for the first 2 days and thereafter it decreases slowly and is still significantly different from zero after 9 days.

The long lasting effect of a shock in topic 98 news on the trade volume of the Consumers Services portfolio might be due to the effect on consumer confidence. Alsem, Brakman, Hoogduin, and Kuper (2008) construct a variable which reflects the manner in which consumers reflect economic news published in newspapers. They show that this variable has a significant impact on consumer confidence. Since topic 98 is clearly about economics, it is labeled ‘financial crisis related’, it is not unreasonable to assume it affects the consumers confidence. The Consumers

(18)

-40,000 -30,000 -20,000 -10,000 0 10,000 1 2 3 4 5 6 7 8 9 10

Response of CNSVSUS to Cholesky One S.D. T97 Innovation (a) 0 10,000 20,000 30,000 40,000 50,000 1 2 3 4 5 6 7 8 9 10

Response of CNSVSUS to Cholesky One S.D. T98 Innovation

(b)

Figure 3: Impulse response graph of CNSVSUS to Cholesy one standard deviation innovations. Graph (a) shows the response to topic 97 and (b) to topic 98, the blue lines. The red dotted lines are the responses plus and minus two Monte Carlo standard deviations.

Services portfolio contains stocks from companies that sell the biggest amount of their services and products directly to consumers. Ludvigson (2004) show that the consumer confidence is a good indicator for consumer spending and thus relates to the stock movement of the portfolio. The consumers confidence however is a slowly moving indicator, this could explain the long lasting effect of the shock.

The estimations show that crisis related news might be used to forecast trade volumes. The increase in news of topic 98 during the crisis gives rise to the question whether this effect is general or mainly due to the crisis. In order to answer this in the next subsection the VAR models are re-estimated with some changes.

4.4 Analysis without the crisis period

To explore the effect of topic 98 in more detail an analysis of the crisis and non crisis period would be preferred. However during the crisis period the market was more volatile than normal (Manda, 2010). Combined with the few observations available during the crisis period analyzing this period is not expedient. This leads to the following, first the VAR-models from the previous subsection are re-estimated without the crisis period. Next the topic selection procedure is re-done with the observations from before the 14th of September 2008. Finally VAR-models including these new topics are estimated.

(19)

fore-cast. In order to see whether the significant results that are found are due to the crisis the VAR models are estimated again but observation after the 14th of September of 2008 are excluded. The estimations outputs are shown in tables 6 to 8. The significant predictive effect of topic 97 and 98 is gone in all of the three estimations. Furthermore there is no evidence of Granger causality between any of the topics and the trade volumes of all three portfolios.

OILGSUS(-1) 0.709990∗∗ _0.001256 _3.29E-06 _7.51E-05∗∗ _-3.97E-05∗

(0.05626) (0.00076) (7.4E-06) (1.6E-05) (2.2E-05) AMOUNT(-1) 3.483656 −0.207367∗ _−0.005968∗∗ _-0.002841 _−0.020799∗∗ (9.17140) (0.12446) (0.00121) (0.00258) (0.00351) x98(-1) 955.5772 11.37826 0.662243∗∗ 0.228622 0.440231 (791.848) (10.7461) (0.10427) (0.22241) (0.30297) x97(-1) 27.66450 3.586411 0.038300 0.067982 0.285974∗ (331.711) (4.50161) (0.04368) (0.09317) (0.12692) x81(-1) 11.45199 -0.233560 -0.000543 -0.034097 0.667855∗∗ (152.117) (2.06435) (0.02003) (0.04273) (0.05820) C 32423.11 5475.193∗∗ 42.80795∗∗ 37.42224∗∗ 135.5231∗∗ (29595.6) (401.638) (3.89726) (8.31258) (11.3236)

Table 6: VAR(1) estimation output for the Oil and Gas portfolio. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

CNSVSUS(-1) 0.633606∗∗ 0.000896∗ 4.00E-05∗∗ 8.68E-06∗ 4.51E-06 (0.06217) (0.00048) (1.0E-05) (4.6E-06) (7.9E-06) AMOUNT(-1) -15.19640 0.131768 0.001711 −0.003310∗ _−0.004596∗ (20.6538) (0.15792) (0.00340) (0.00154) (0.00263) x97(-1) 20.07743 -0.219366 0.079280 -0.004065 0.067913 (562.397) (4.29999) (0.09258) (0.04191) (0.07153) x98(-1) 1580.865 10.56900 0.099859 0.650614∗∗ 0.171103 (1380.96) (10.5586) (0.22732) (0.10291) (0.17565) x58(-1) 1464.813 −20.98050∗∗ −0.303401∗ −0.154437∗ 0.049123 (965.874) (7.38490) (0.15899) (0.07198) (0.12285) C 161449.0∗∗ 4847.466∗∗ 28.60644∗∗ 35.25140∗∗ 63.14219∗∗ (54021.3) (413.037) (8.89241) (4.02572) (6.87105) Table 7: VAR(1) estimation output for the Consumer Services portfolio. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

The reduction of the estimation sample to exclude the crisis period changes more than just the estimations. The correlation which were used to select the

(20)

FINANUS(-1) 0.851774∗∗ 0.000143 2.18E-06 3.58E-06∗ -2.85E-06∗ (0.05038) (0.00014) (1.3E-06) (1.8E-06) (4.0E-06) AMOUNT(-1) 34.80555 -0.121199 −0.005105∗∗ _−0.004332∗∗ _−0.018328∗∗ (41.1950) (0.11392) (0.00109) (0.00147) (0.00325) x98(-1) 2139.529 5.244691 0.584688∗∗ 0.093415 0.466003 (4271.59) (11.8126) (0.11307) (0.15280) (0.33697) x10(-1) -3739.174 3.260358 0.022635 0.504572∗∗ 0.100816 (2346.56) (6.48913) (0.06211) (0.08394) (0.18511) x81(-1) -237.5527 -1.246112 -0.002056 -0.017684 0.641406∗∗ (716.877) (1.98244) (0.01898) (0.02564) (0.05655) C 123702.2 5666.021∗∗ 40.97375∗∗ 43.57235∗∗ 125.5265∗∗ (132006.) (365.048) (3.49413) (4.72192) (10.4135) Table 8: VAR(1) estimation output for the Financials portfolio. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

topics have also changed. To see whether the selected topics are still the correct ones the correlations are recalculated. Table 9 shows the three topics with the highest correlation per portfolio. For OILGSU S topic 97, 69 and 81 have the highest correlations, F IN AN U S topic 10, 98 and 97 and finally for CN SV SU S topic 15, 10, 97. It stands out that topic 98 is no longer among the highest correlated topics for the Oil & Gas and Consumer Services portfolios. For the Financials portfolio on the other hand topic 98 is still one of the highest correlated. Another difference with the whole sample, topic 97 is now among the highest correlated topics for all three portfolios. In the whole sample topic 97 was not present for the Financials portfolio.

OILGSU St F IN AN U St CN SV SU St

x97, 0.4304 x10, 0.6658 x15, 0.4008

x69, 0.4054 x98, 0.5672 x10, 0.3641

x81, 0.1071 x97, 0.4384 x97, 0.3140

Table 9: Correlations between the volume time series OILGSU St, F IN AN U St

and CN SV SU St and the topics with the highest correlation

Again three VAR(p)-models are estimated for different amount of lags, based on the Schwartz criterion 1 lag is preferred for all three. Tables 10 to 12 show the estimation outputs.

(21)

OILGSUS(-1) 0.682474∗∗ 0.001863∗∗ 8.30E-05∗∗ 3.16E-05∗ 9.16E-08 (0.05834) (0.00078) (1.6E-05) (1.6E-05) (2.0E-05) AMOUNT(-1) 4.931214 0.028389 0.000901 -0.004219∗ −0.007861∗∗ (7.46974) (0.09979) (0.00209) (0.00200) (0.00251) x97(-1) 109.8782 1.708580 0.043265 -0.057348 0.163432 (334.044) (4.46245) (0.09365) (0.08957) (0.11225) x69(-1) 184.4950∗ −4.089868∗∗ -0.053539∗ 0.890702∗∗ −0.268023∗∗ (109.387) (1.46129) (0.03067) (0.02933) (0.03676) x81(-1) 271.3183 -5.581316∗ -0.103092∗ −0.152795∗∗ 0.313525∗∗ (210.769) (2.81563) (0.05909) (0.05652) (0.07082) C 21278.61 5547.893∗∗ 37.94732∗∗ 43.71083∗∗ 141.9228∗∗ (29386.3) (392.568) (8.23881) (7.87999) (9.87452) Table 10: VAR(1) estimation output for the Oil and Gas portfolio with topics based on selection without the crisis period. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

CNSVSUS(-1) 0.629464∗∗ 0.001716∗∗ 2.11E-05∗∗ 2.24E-05∗∗ 4.73E-05∗∗ (0.06975) (0.00053) (5.0E-06) (7.0E-06) (1.1E-05) AMOUNT(-1) 25.50676 0.073515 -0.002614∗ _-0.001010 _-0.001529 (17.6355) (0.13505) (0.00127) (0.00177) (0.00290) x15(-1) -260.7289 -19.50806∗ 0.220681∗ -0.292466∗ -0.071824 (1327.23) (10.1634) (0.09594) (0.13356) (0.21812) x10(-1) 1.942163 2.560189 0.004120 0.536210∗∗ -0.020834 (779.597) (5.96983) (0.05635) (0.07845) (0.12812) x97(-1) -300.7410 2.385724 0.061531 -0.008090 0.130911 (545.253) (4.17532) (0.03941) (0.05487) (0.08961) C 123254.9∗ 4913.072∗∗ 47.22619∗∗ 34.82104∗∗ 31.85025∗∗ (52774.3) (404.123) (3.81473) (5.31080) (8.67307)

Table 11: VAR(1) estimation output for the Consumer Services portfolio with topics based on selection without the crisis period. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

In Table 10 the estimation output for the US Oil & Gas portfolio are shown. Only the lagged series of topic 69 has a significant effect on the trade volume. However there is not enough evidence that any of the topics Granger cause the trade volume. Table 11 shows the output for the Consumer services portfolio. None of the coefficients are significant and again there is no evidence for Granger causality. The last table, Table 12, shows the estimations for the US Financials

(22)

FINANUS(-1) 0.846766∗∗ 0.000123 3.87E-06∗ 2.00E-06 1.18E-05∗∗ (0.05109) (0.00014) (1.8E-06) (1.4E-06) (3.0E-06) AMOUNT(-1) 19.78649 -0.191307 −0.004517∗∗ _−0.005422∗∗ _-0.000662 (45.3376) (0.12524) (0.00163) (0.00120) (0.00265) x10(-1) -3894.930∗ 2.499478 0.499016∗∗ 0.020079 -0.044331 (2337.53) (6.45733) (0.08380) (0.06187) (0.13642) x98(-1) 2343.158 6.056020 0.080956 0.591976∗∗ -0.114958 (4282.12) (11.8292) (0.15352) (0.11334) (0.24990) x97(-1) 966.0186 4.395361 -0.000349 0.022818 0.103909 (1587.61) (4.38571) (0.05692) (0.04202) (0.09265) C 125393.5 5664.838∗∗ 42.61705∗∗ 41.20468∗∗ 45.94173∗∗ (128379.) (354.643) (4.60250) (3.39790) (7.49215) Table 12: VAR(1) estimation output for the Financials portfolio with topics based on selection without the crisis period. 184 observations (1-1-2008 to 14-9-2008), values with a * are significant at a 5 percent level, values with ** at a 1 percent level, standard deviation in ().

portfolio. The coefficient of the lagged topic 10 series on the trade volume is significant at a 5 percent level. Once more there is not enough evidence for Granger causality between any of the topics and the trade volume.

Based on these results news from the previous day cannot be used to forecast the trade volume of the three portfolios during periods where there is no crisis. This might be due to the speed at which articles reach people these days, with the rise of smart phones and online news sites. Another reason might be the amount of high frequency trading (HFT) that is happening. HFT could possibly be the main reason why the market responds rapidly to information as HFT companies trade on new information within milliseconds. The fact that information is processed so quick is good in terms of efficiency however it is not without risk as the recent flash crash has shown (Kirilenko, Kyle, Samadi, & Tuzun, 2014).

(23)

5 Conclusion

The power and influence of the media is large. With the rise of social media and other online media, this influence will only get bigger. This gives reason to investigate whether this influence is also present on the stock market. The aim of this thesis was to use LDA to relate the amount of news and topics of this news to the trade volume of stocks from the following three categories: oil and gas, financial sector and consumer services. I expected to find a clear relation between the amount of news of certain topics and the trade volume.

The data used in this article was the Thomson Reuters Text Research Collec-tion which contained news from January 2008 to February 2009. After preparing this data, LDA was used to acquire the topic distributions per day. The trade volume data of three portfolios representing the three sectors were extracted from Datastream. Next, topics were selected based on the non lagged correlations with the trade volume series.

The estimation of different VAR-models yielded two main results. First a financial crisis related topic had predictive power for the trade volume of the portfolio representing the oil and gas sector and the consumer services sector. However this result was mainly due to the financial crisis. When the model was estimated excluding the crisis period the effect was gone. Furthermore evidence was found for Granger causality between this topic and the trade volume of these two portfolios.

After changing the period used for estimation to exclude the crisis the corre-lations between the topics and trade volumes changed. Re-selecting topics based on these new correlations indeed yielded different topics. The estimations with these new topics gave one significant link, between topic 10 and the trade volume of the Financials portfolio. However this effect was not strong enough to provide evidence for Granger causality.

Summarizing, it seems that during periods of financial crisis news reports re-lated to this crisis can be used to forecast trade volume. However opposed to my expectations I found no evidence for this effect during non crisis periods.

(24)

References

Alsem, K. J., Brakman, S., Hoogduin, L., & Kuper, G. (2008). The impact of newspapers on consumer confidence: does spin bias exist? Applied Eco-nomics, 40 (5), 531–539.

Andreassen, P. B. (1988). Explaining the price-volume relationship: The difference between price changes and changing prices. Organizational Behavior and Human Decision Processes, 41 (3), 371–389.

Baeza-Yates, R., Ribeiro-Neto, B., et al. (1999). Modern information retrieval (Vol. 463). ACM press New York.

Birz, G., & Lott, J. R. (2011). The effect of macroeconomic news on stock returns: New evidence from newspaper coverage. Journal of Banking & Finance, 35 (11), 2791–2800.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3 , 993–1022.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41 (6), 391–407. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical

work. Journal of Finance, 25 (2), 383–417.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101 (suppl 1), 5228–5235.

Hansen, S., McMahon, M., & Prat, A. (2014). Transparency and deliberation within the fomc: A computational linguistics approach.

Hiemstra, C., & Jones, J. D. (1994). Testing for linear and nonlinear granger causality in the stock price-volume relation. Journal of Finance, 49 (5), 1639–1664.

Hisano, R., Sornette, D., Mizuno, T., Ohnishi, T., & Watanabe, T. (2013). High quality topic extraction from business news explains abnormal financial mar-ket volatility. PLOS ONE , 8 (6), e64846.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international acm sigir conference on research and development

(25)

in information retrieval (pp. 50–57).

Ito, T., & Roley, V. V. (1987). News from the us and japan: which moves the yen/dollar exchange rate? Journal of Monetary Economics, 19 (2), 255–277. Joulin, A., Lefevre, A., Grunberg, D., & Bouchaud, J.-P. (2008). Stock price

jumps: news and volume play a minor role. arXiv preprint arXiv:0803.1769 . Karpoff, J. M. (1987). The relation between price changes and trading volume: A

survey. Journal of Financial and Quantitative Analysis, 22 (01), 109–126. Kirilenko, A. A., Kyle, A. S., Samadi, M., & Tuzun, T. (2014). The flash crash:

The impact of high frequency trading on an electronic market. Available at SSRN 1686004 .

Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I., et al. (2011). The arab spring— the revolutions were tweeted: Information flows during the 2011 tunisian and egyptian revolutions. International Journal of Communication, 5 , 31.

Ludvigson, S. C. (2004). Consumer confidence and consumer spending. Journal of Economic perspectives, 29–50.

Malkiel, B. G. (2005). Reflections on the efficient market hypothesis: 30 years later. Financial Review , 40 (1), 1–9.

Manda, K. (2010). Stock market volatility during the 2008 financial crisis (Un-published doctoral dissertation). Stern School of Business, New York. Mitchell, M. L., & Mulherin, J. H. (1994). The impact of public information on

the stock market. Journal of Finance, 49 (3), 923–950.

Mutz, D. C., & Soss, J. (1997). Reading public opinion: The influence of news coverage on perceptions of public sentiment. Public Opinion Quarterly, 431– 451.

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Machine Learning, 39 (2-3), 103–134.

Popescul, A., Pennock, D. M., & Lawrence, S. (2001). Probabilistic models for unified collaborative and content-based recommendation in sparse-data

(26)

environments. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence (pp. 437–444).

Radford, T. (1996). Influence and power of the media. The Lancet , 347 (9014), 1533–1535.

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. Schiller, R. J. (2000). Irrational exuberance. Princeton UP .

Shirky, C. (2011). The political power of social media: Technology, the public sphere, and political change. Foreign Affairs, 28–41.

(27)

Appendix A

Listing 1: remove headlines.py

import csv

with open("text.csv","rb") as source: rdr= csv.reader( source )

with open("text_this.csv","wb") as result: wtr= csv.writer( result )

for r in rdr:

wtr.writerow( (r[1], r[2]) )

Listing 2: remove null.py

import unicodecsv

with open(’partial_fixed.csv’, ’w’) as correct: writer = unicodecsv.writer(correct)

with open(’partial_no_title.csv’, ’rb’) as mycsv:

reader = unicodecsv.reader( (line.replace(’\0’,’’) for line in mycsv), encoding=’utf-8’ )

for row in reader: writer.writerow(row)

Listing 3: remove quotes.py

import re

out = open(’trc2_lda_ready.txt’, ’w’) out.write("msg_dt\tstory_text\n") num_empty = 0

file_num = open(’amount_empty’,’w’) with open("text_this.txt", ’r’) as data:

for line in data:

line = line.translate(None, ’"’) line = line.translate(None, "’") line = re.sub(’ +’,’ ’,line) line = re.sub(’, ’,’,’,line) line = re.sub(’,’,’\t"’,line, 1) line = line.strip() + "\"\n" if len(line) > 24 : num_empty += 1 out.write(line) file_num.write(num_empty) file_num.close() out.close()

(28)

Appendix B

Listing 4: change date according to time.py

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

# change_date_according_to_time.py #

# #

# This script changes the dates of the articles based on the time of # # publishing. Articles published between 4pm on day one and 4pm on day two #

# are added to day two. #

# #

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

import datetime

import csv

# Set datetime format and create closing times

fmt = "%Y-%m-%d %H:%M:%S" four_pm = datetime.time(16) one_pm = datetime.time(13)

# Create list of dates at which trading stops at 1pm

with open(’date_closed_early’) as closed_early_file:

closed_early_list = closed_early_file.read().splitlines() closed_early_set = frozenset(closed_early_list)

with open(’final_output.csv’,’rb’) as data,

open(’final_output_new_days.csv’,’wb’) as out: data_csv = csv.reader(data)

out_csv = csv.writer(out) i = 0

# Loop over the rows of the LDA output

for row in data_csv:

# Work around to skip headers

if i:

# create datetime object of the date, row[0] is the date and time # as string

date = datetime.datetime.strptime(row[0], fmt)

# If article published on a early closing day, set date to publish # day if published before 1pm otherwise add to next day

(29)

if date.time() <= one_pm: row[0] = str(date.date())

else:

row[0] = str((date + datetime.timedelta(days=1)).date())

# On other days set date to publish day if published before 4pm # otherwise add to next day

elif date.time() <= four_pm: row[0] = str(date.date())

else:

row[0] = str((date + datetime.timedelta(days=1)).date())

# write row to output file

out_csv.writerow(row)

else:

# write headings to output

out_csv.writerow(row) i += 1

Listing 5: sum per day.py

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

# sum_per_day.py #

# #

# This script sums the topic distributions per day and adds a variable to the #

# output, the amount of articles per day #

# #

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

import csv

import numpy as np

from os import remove

# Initialise variables prev = "2008-01-01" sommen = np.zeros(100) som = np.zeros(100) days = [] amount = [] i = 0

# Loop over the entries in the data file and sum per day

with open(’final_output_new_days.csv’, ’rb’) as f: input_csv = csv.reader(f)

(30)

for row in input_csv:

if i:

curr = row[0]

# If the current and previous entry were published on the same date # the respective topic distibutions are added

if prev == curr: count += 1

add = np.asarray(row[1:]).astype(np.float) som = som + add

# When the current and previous date differ the cumulative topic # distribution is saved for the completed day and resets, also the # amount of articles that day is saved

else:

sommen = np.vstack([sommen, som]) days.append(prev) amount.append(count) prev = curr count = 0 som = np.zeros(100) i += 1

# The last day is not saved in the loop, thus it is done manually

sommen = np.vstack([sommen, som]) days.append(prev)

amount.append(count)

sommen = np.delete(sommen,(0), axis = 0)

# Saving the cumulative topic distribution to a temperary file

np.savetxt("temp.csv",sommen,delimiter=",",fmt=’%.14f’)

headings = csv.reader(open("final_output_new_days.csv",’rb’)).next() headings.insert(1,’amount_articles’)

# Saving the amount of articles and cumulative topic distribution per day

with open("temp.csv", ’rb’) as input: input_csv = csv.reader(input)

with open(’final_output_per_day.csv’,’wb’) as output: output_csv = csv.writer(output)

j = 0

output_csv.writerow(headings)

# For some retarded reason this had to be done like this....

(31)

row[4],row[5],row[6],row[7],row[8],row[9],row[10],row[11], row[12],row[13],row[14],row[15],row[16],row[17],row[18],row[19], row[20],row[21],row[22],row[23],row[24],row[25],row[26],row[27], row[28],row[29],row[30],row[31],row[32],row[33],row[34],row[35], row[36],row[37],row[38],row[39],row[40],row[41],row[42],row[43], row[44],row[45],row[46],row[47],row[48],row[49],row[50],row[51], row[52],row[53],row[54],row[55],row[56],row[57],row[58],row[59], row[60],row[61],row[62],row[63],row[64],row[65],row[66],row[67], row[68],row[69],row[70],row[71],row[72],row[73],row[74],row[75], row[76],row[77],row[78],row[79],row[80],row[81],row[82],row[83], row[84],row[85],row[86],row[87],row[88],row[89],row[90],row[91], row[92],row[93],row[94],row[95],row[96],row[97],row[98],row[99] )) j += 1

# Delete temp file

remove(’temp.csv’)

Listing 6: sum weekend holidays.py

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

# sum_weekend_holidays.py #

# #

# This script adds the cumulative topic distributions for weekends to # # mondays. It does the same for holidays and the respective next day #

# #

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

import datetime

import numpy as np

import csv

from os import remove

fmt = "%Y-%m-%d"

# Create set of holiday dates

with open(’date_closed’) as closed_file:

closed_list = closed_file.read().splitlines() closed_set = frozenset(closed_list) # Initialise variables topics = np.zeros(100) amount = 0 amount_out = [] topics_out = np.zeros(100) date_out = []

(32)

with open(’final_output_per_day.csv’,’rb’) as f: input_csv = csv.reader(f)

i = 0

if i:

# Set date as datetime object

dt = datetime.datetime.strptime(row[0], fmt) dt = datetime.date(dt.year, dt.month, dt.day)

# If saturday or sunday add topic dist. and amount to temp. variable

if dt.weekday() == 5 or dt.weekday()== 6:

topics = topics + np.asarray(row[2:]).astype(np.float) amount = amount + int(row[1])

dummy = 1

# If it is a holiday add topic dist. and amount to temp. variable

elif row[0] in closed_set:

topics = topics + np.asarray(row[2:]).astype(np.float) amount = amount + int(row[1])

dummy = 1

# If the previous day was a sunday or holiday add todays # distribution and amount to the temp ones and save it

elif dummy:

topics = topics + np.asarray(row[2:]).astype(np.float) topics_out = np.vstack([topics_out, topics])

amount_out.append(amount + int(row[1])) date_out.append(row[0])

topics = np.zeros(100) amount = 0

dummy = 0

# If today is not a saturday, sunday, monday or day after a # holiday save topic distribution and amount

else:

topics = np.asarray(row[2:]).astype(np.float) topics_out = np.vstack([topics_out, topics]) amount_out.append(int(row[1]))

date_out.append(row[0]) topics = np.zeros(100) amount = 0

(33)

topics_out = np.delete(topics_out,(0), axis = 0)

# Save everything

np.savetxt("foo.csv",topics_out,delimiter=",",fmt=’%.14f’)

headings = csv.reader(open("final_output_per_day.csv",’rb’)).next() with open("foo.csv", ’rb’) as input:

input_csv = csv.reader(input)

with open(’final_output_final.csv’,’wb’) as output: output_csv = csv.writer(output)

j = 0

output_csv.writerow(headings)

output_csv.writerow((date_out[j],amount_out[j],row[0],row[1], row[2],row[3],row[4],row[5],row[6],row[7],row[8],row[9], row[10],row[11],row[12],row[13],row[14],row[15],row[16], row[17],row[18],row[19],row[20],row[21],row[22],row[23], row[24],row[25],row[26],row[27],row[28],row[29],row[30], row[31],row[32],row[33],row[34],row[35],row[36],row[37], row[38],row[39],row[40],row[41],row[42],row[43],row[44], row[45],row[46],row[47],row[48],row[49],row[50],row[51], row[52],row[53],row[54],row[55],row[56],row[57],row[58], row[59],row[60],row[61],row[62],row[63],row[64],row[65], row[66],row[67],row[68],row[69],row[70],row[71],row[72], row[73],row[74],row[75],row[76],row[77],row[78],row[79], row[80],row[81],row[82],row[83],row[84],row[85],row[86], row[87],row[88],row[89],row[90],row[91],row[92],row[93], row[94],row[95],row[96],row[97],row[98],row[99] )) j += 1 remove(’foo.csv’)

Listing 7: date closed

2008-01-01 2008-01-21 2008-02-18 2008-03-21 2008-05-26 2008-07-04 2008-09-01 2008-11-27 2008-12-25 2009-01-01 2009-01-19 2009-02-16

(34)

Listing 8: date closed early

2008-07-03 2008-11-28 2008-12-24

Appendix C

Topic 10 most frequent words

10 insur, credit, loss, lehman, capit, merril, invest, citigroup, ub, bear 15 gain, index, fell, rose, trade, point, investor, gmt, higher, close

58 govern, protest, countri, peopl, presid, nation, right, state, leader, human 69 top, corpor, basic, fund, credit, view, stori, deal, debt, consum

81 top, stori, credit, fund, re, item, indic, merger, depend, subscript 97 order, side, sell, nyse, buy, imbal, excess, yahoo, halt, sub

98 govern, crisi, plan, billion, credit, packag, rescu, financ, help, stimulu Table 13: 10 most frequent words per topic for the used topics