Modeling the volatility of earnings calls : using latent dirichlet allocation to quantify the content of earnings calls

(1)

Modeling the Volatility of Earnings Calls

Using Latent Dirichlet Allocation to quantify the content of earnings calls

Bachelor’s Thesis to obtain the degree in Econometrics and Operations Research University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Jan Meppe

Student no.: 10326316

Date: June 29, 2015

(2)

Abstract

This paper investigates whether the relationship between the content of earnings calls and the volatility of stock returns can be quantified using Latent Dirichlet Allocation (LDA). To estimate this relationship, first a 20-topic LDA model is estimated on all the earnings calls transcripts for the companies listed in the S&P500 stock market index. After applying LDA, a GARCH(1,1) model is modified to control for the effect of having an earnings calls and for the effect of a change in the main topic, discovered by LDA. This model, alongside with a standard GARCH(1,1) model, was tested on the daily logarithmic stock returns of Google, Apple, and Microsoft with significant results for the latter two. I find evidence that the earnings calls cause a volatility spike which lasts for only one day. Furthermore, using two stylized examples, I show that the changes in the content of the earnings calls explain a significant portion of the shocks in the variance caused by the earnings calls. In these examples, a one standard deviation change in the main topic is able to explain 39.1% for Apple and 44.7% for Google of the shock in variance caused by the earnings call. These results show that LDA proves to be a valuable tool to quantify the content of the earnings calls.

Keywordsearnings calls, text modeling, volatility, GARCH, Latent Dirichlet Allocation

Acknowledgements

I would like to thank my supervisors, K.J. van Garderen and M.J. van der Leij, for their invaluable help and guidance throughout this project. I would also like to thank Joris Brehm and Wouter Griffioen for being great friends and additional sources of inspiration. Last but not least, I would like to thank my mother, father, and girlfriend, Gigi Laan, for their unconditional love and support, without which I would not have come this far.

(3)

1 Introduction

Recent developments in technology have changed the way businesses communicate. It has become increasingly common for companies to issue so-called earnings calls. Every fiscal quarter, publicly traded companies are legally required to file a quarterly earnings report. In this report, the financial health of the company is disclosed to the shareholders. Accompanying this earnings report is the earnings call. An earnings call is a conference call between the management of a public company, its shareholders, analysts, and the media to discuss the most recent quarterly earnings.

Bowen, Davis, and Matsumoto (2002) find that these conference calls increase analysts’ ability to forecast earnings more accurately. Work by Mayew, Sharp, and Venkatachalam (2013) shows that analysts who participate by asking questions, obtain superior information over those who merely attend the event. However, these public disclosures of earnings also have some disadvantages. Mayew (2008) shows that firms, realising that this information is valuable for the asking analyst, can use their discretion to discriminate amongst analysts by granting more participation to more favourable analysts. Furthermore, Mayew (2008) finds that favourable and prestigious analysts ask more questions than less favourable and less prestigious analysts.

Bowen et al. (2002) and Mayew et al. (2013) show that earnings calls contain valuable information. The question that arises is how to quantify this information. To quantify the information contained in these earnings calls, this paper uses a topic model. Topic models are algorithms that aim to find hidden semantic structures in large collections of documents. The topic model used in this paper is LDA, introduced by Blei, Ng, and Jordan (2003). The main assumption of LDA is that each document is characterised by a mixture over a fixed number of latent topics, where a topic is a probability distribution over a fixed vocabulary of words.

The main research goal of this paper is to investigate whether the relationship between the content of earnings calls and the volatility of stock returns can be quantified using LDA. In order to quantify the content of the earnings calls, a 20-topic LDA model is estimated on all the 6064 available earnings calls transcripts of the 502 companies listed in the S&P500 stock market index. This LDA estimation results in a list of topics and topic distributions over time for all the earnings calls. These topic distributions over time are used to create a measure of change for the content of the earnings calls for each company. This measure is then added to a GARCH(1,1) model to estimate the effect of a change in content on the volatility of the logarithmic returns

(5)

for Google, Apple, and Microsoft.

The rest of this paper is organised as follows. Section 2 discusses the theoretical background, reviews previous text models, introduces LDA, and specifies the econometric model. Section 3 gives a description of the dataset and methods that were used. Section 4 contains the results and analysis. Section 5 summarises the main results of this paper and outlines possible future work.

2 Theoretical framework

The following section first discusses the economic theory and reviews previous literature. Then, after examining different text models, LDA is introduced. Finally, the econometric model is specified.

2.1 Economic theory

Measuring the effects of news on activity on the markets has always interested researchers. At the core of this research is the widely accepted theory that new information is quickly absorbed in the current price of a stock. This is known as the Efficient market hypothesis (EMH), which states that, at every moment in time, existing share prices fully reflect all available information (Fama, 1970).

For example, Mitchell and Mulherin (1994) study the daily amount of news announcements reported by Dow Jones & Company from 1983 to 1990. They find robust links between the number of news stories and trading activity. This work by Mitchell and Mulherin (1994) uses a simple count as proxy for the importance of information. Two other examples, more in line with this paper, are the works of Antweiler and Frank (2004), and Das and Chen (2007). Their research is focused on distinguishing bullish messages from bearish messages on internet message boards and news stories, respectively. However, the text models used in Antweiler and Frank (2004), and Das and Chen (2007) are sentiment based text models, as opposed to the topic model used in this paper.

Instead of focusing on news stories, this paper focuses on the qualitative linguistic content in the earnings calls. These earnings calls are of more interest than the regular earnings reports because of three reasons: First, earnings calls are significantly less scripted than the actual earnings reports. The quarterly earnings report is a single document that is revised multiple

(6)

times before publishing and contains mostly numbers. In contrast, earnings calls are live conference calls and are thus not only less formally constrained, but also contain more natural language. Second, earnings calls contain nuanced comments from the management of the company. These comments might contain valuable insights on the future implications of certain recent financial results. This information is not present in the earnings report and is thus novel information. Third, the Q&A session at the end might contain valuable information as the shareholders and analysts are allowed to ask critical questions.

Previous research confirms the suspicion that there is qualitative content in these earnings calls. Davis, Piger, and Sedor (2006) explore how managers use optimistic and pessimistic language in earnings press releases. Their findings suggest that the tone of language that the management uses, contains credible information about expected future firm performance. Bowen et al. (2002) study whether regular participation of earnings calls increases analysts’ ability to forecast and predict future earnings. Their results confirm that analysts, who regularly attend earnings calls, have an increased ability to forecast future earnings and make timelier predictions. This indicates that earnings calls increase the total amount of information about a firm. Mayew et al. (2013) go even further and investigate whether individual analysts who actively participate in earnings calls by asking questions, are more informed than analysts who do not. Their research shows that analysts who participate in these earnings calls make more ac-curate forecasting predictions than analysts who merely attend. However, instead of answering these questions truthfully, the management of a company can prepare scripted responses ahead of time. Lee (2014) studies whether market participants infer negative information about future firm performance when managers use these scripted responses to answer questions during the Q&A session. His results suggest that scripted responses are negatively associated with future earnings and future cash flows. Lee (2014) also finds that scripted responses provide less information to market participants than unscripted responses.

In summary, there is a large body of research that confirms that there is novel and valuable information contained in these earnings calls, yet the model used by Mitchell and Mulherin (1994) is fundamentally different than the model that is used by Antweiler and Frank (2004). This shows how much these models have changed and improved over time. What model is best suited to model the content in the earnings calls? The next section gives an overview of different text models and shows why LDA is the best model for this purpose.

(7)

2.2 Previous text models

One of the oldest text models is the term frequency-inverse document frequency (tf-idf) model (Salton & McGill, 1986). In this model, after choosing a suitable vocabulary, a count is made for each word how often it appears in the document (term frequency). After normalisation, this count is compared to the number of occurrences of this word in the entire corpus (inverse document frequency). The result is a single k × M matrix (the document term matrix) with the tf-idf scores per term, per document (where k is the amount of terms and M the amount of documents). Because tf-idf transforms a collection of text documents into a single matrix, it is a dimensionality reduction technique.

A major drawback of tf-idf is that it is unable to capture semantic structure between docu-ments. To address this shortcoming, Deerwester, Dumais, Landauer, Furnas, and Harshman (1990) introduce Latent Semantic Indexing (LSI). LSI uses a mathematical technique called singular value decomposition (SVD) on the document term matrix to estimate the structure of word usage across documents. Deerwester et al. (1990) note that while LSI largely solves the problem of synonymy (multiple words having similar meaning), it still faces problems with polysemy (single words having multiple meanings).

To challenge this, Hofmann (1999) proposes an alternative model called probabilistic Latent Semantic Indexing (pLSI). Hofmann explains that although LSI has been applied with varying degrees of success, it lacks a proper statistical foundation. pLSI borrows ideas from LSI, but is based on a probabilistic mixture decomposition of latent variables.

Blei et al. (2003) explain that there are several problems with pLSI, stemming from the fact that there is not a well-defined generative model at the document level. The first problem is that the number of parameters grows linearly with the size of the corpus, which leads to overfitting. The second problem is that it is not clear on how to extend pLSI outside of the training set. LDA solves these problems by treating the topic mixture as a random variable instead of a set of parameters that is linked to the training set (Blei et al., 2003). On top of that, Blei et al. show that LDA achieves objectively better results than the aforementioned models, which is the main motivation behind using LDA.

(8)

2.3 Latent Dirichlet Allocation

Blei et al. (2003) describe LDA as a “generative probabilistic model for collections of discrete data such as text corpora” (p. 993). LDA is a widely used model and has been cited more than 8,000 times (Hansen, McMahon, & Prat, 2014). Because LDA is a generalised model, it is easily applied to a wide range of different topics.

For example, Quinn, Monroe, Colaresi, Crespin, and Radev (2010) apply a topic model similar to LDA to analyse the topics that Members of Congress talk about in political speeches. Curme, Preis, Stanley, and Moat (2014) apply LDA to a cross-section of Wikipedia to find general topics in the English language. Using the search frequencies of these topics, they create and backtest a simple trading strategy, which outperforms a random strategy. And Hansen et al. (2014) use LDA to model government meeting transcripts, to quantify the effect of increased transparency on the quality of debate. To quantify the content in these earnings calls, this paper applies LDA on the written transcripts of earnings calls. Although LDA has been applied in many different contexts already, I was unable to find any studies that apply LDA to these earnings calls transcripts. In this regard, it is a novel application, and I hope this paper might serve as a springboard for other research.

LDA is a model that generates collections of discrete data, and is assumed to be the underly-ing generative process for the earnunderly-ings calls. To formalise LDA, Blei et al. (2003) first define the following terms:

• A word is the basic unit of data, defined as an item from a vocabulary with V words. A word is represented by a (1 × V) unit vector with a single element equal to 1 and all other elements 0. Thus, the i-th word in the vocabulary is represented with a unit vector wi= (0, . . . , 0, 1, 0, . . . , 0), where wi_i= 1and wj_i= 0for i 6= j, with superscripts denoting elements.

• A document is a collection of N words. A document has the following form: d = (w1, w2, . . . , wN), where wj is the j-th word in the document for j = 1, . . . , N • A corpus is a collection of M documents. A corpus has the following form:

D ={d1, d2, . . . dM}, where djis the j-th document in the corpus for j = 1, . . . , M

LDA uses latent variables to model the abstract concepts of topics. The basic idea is that documents are represented as distributions over K latent topics, where each topic is characterised

(9)

by a probability distribution over a fixed vocabulary of words (Blei et al., 2003).

Latent variables are variables that can not be directly measured, but have to be inferred from other (directly observed) variables. Given a document d, the only observables are the words (w1, . . . , wN). The topic allocations per word (z1, . . . , zN)are latent variables and have to be inferred from the observable words. The vector θ = (θ1, . . . , θK)is the distribution of topics for a single document. LDA assumes the following generative probabilistic process for each document in the corpus (Blei et al., 2003):

1. Choose N∼ Poisson(ξ) 2. Choose θ∼ Dir(α)

3. For each of the N words wn:

(a) Choose a topic zn∼ Multinomial(θ), the topic allocation

(b) Choose a word wnfrom p(wn|zn, β), a multinomial probability conditioned on the topic zn

For each of the M documents in the corpus, first the length (in words) is drawn from a Poisson distribution. In the second step, a topic distribution is drawn for that particular document from a Dirichlet distribution (see Appendix A1 for details about the Dirichlet distribution). The third step, done for each of the N words in the document, first assigns a (latent) topic to each word, and then generates a word from the word matrix, conditional on the assigned topic.

Blei et al. (2003) make several assumptions in this model. First, LDA is a ’bag-of-words’ model, which means that the order of the words in the document are irrelevant. Second, the amount of topics is assumed to be a fixed number K. Third, the K × V word matrix β with probabilities βij = p(wj = 1|zi = 1) is assumed to be fixed and has to be estimated (see Appendix A2 for an example of a word matrix). Moreover, the assumption that N is Poisson distributed is not critical and any other probability distribution can be chosen to suit different corpora.

LDA has one major drawback. The main goal of LDA is to find the topic distributions for each document and the topics (the rows of the word matrix), given a large collection of documents. However, because of the mathematical interaction between θ and β, exact statistical inference of these variables is impossible (Blei et al., 2003).

(10)

To estimate the topics and topic distributions, this paper uses software written in Python by Hansen et al. (2014), which uses a Markov chain Monte Carlo (Jordan, Ghahramani, Jaakkola, & Saul, 1999) technique called Collapsed Gibbs sampling. Technical details can be found in Hansen et al. (2014).

2.4 Model specification

Now that LDA has been introduced, the econometric model can be specified. Financial time series often show serial correlation in the levels, clustered volatility, and excess kurtosis. To model these phenomena, Bollerslev (1986) proposed the generalised autoregressive conditional heteroskedasticity (GARCH) model, a generalisation of the ARCH model (Engle, 1982). This paper uses both a standard GARCH(1,1) model, and a modified GARCH(1,1) model with extra regressors in the variance equation.

Because LDA is a topic model, it is unable to distinguish between ’good’ and ’bad’ topics. Therefore, it is sensible to model the impact of certain topics on the volatility instead of the levels. Bollerslev, Chou, and Kroner (1992) show that a GARCH(1,1) model is often adequate for modeling empirical financial time series. In this paper, to model the time series of Google, Apple, and Microsoft the following modified GARCH(1,1) model is used

y_t = µ + _t, _t ∼ (0, σ2_t) (1) σ2_t = α0+ α12t−1+ β1σ2t−1+ 2 X i=0 δiDt−i+ 2 X j=0 γjDt−jxt (2) where yt = log pt p_t−1

is the daily logarithmic return for the stock price ptat time t. The variable Dt−ifor i = 1, 2, 3 is a (lagged) dummy variable that is 1 on the days of an earnings call and 0 on all other days.

Recall that estimating LDA results in a list of K topics and the topic distributions per document. Where a topic was defined as a probability distribution over a fixed vocabulary of words. Resulting from the LDA estimation, let ˆθs,t be the estimated proportion of topic s = 1, 2, . . . , K for the earnings call at time t, and ˆθsits mean, then the percentual deviation from

(11)

the mean of that topic is defined as follows

ˆθ∆ s,t=

ˆθ_s,t− ˆθs

ˆθ_s ∗ 100%

such that the variable xt = { ˆθ∆_s,t|s = arg max ( ˆθs)} is the percentual deviation from the mean of the most important topic for that company. Thus, the variable xtmeasures the change in content of the earnings calls. Any further reference to a ’change in content’ or ’measure created with LDA’ refers back to this variable xt.

According to the EMH (Fama, 1970), if these earnings calls contain new information, this is quickly reflected in the stock price, triggering a volatility jump on the day of an earnings call. To model this, the (lagged) dummies Dt−ifor i = 0, 1, 2 are included in the variance equation. Furthermore, to capture the effect of a qualitative change in the most important topic of these earnings calls the (lagged) interaction terms Dt−jxtfor j = 0, 1, 2 are included as well.

The coefficients γjfor j = 0, 1, 2 measure the (lagged) effect on the volatility of a 1% deviation in the main topic from the mean. If these coefficients turn out to be significant, this is evidence for the relationship between the content of the earnings calls and the volatility of the stock returns.

3 Research design

3.1 Data gathering process

From the website http://www.seekingalpha.com, the available earnings calls transcripts for the 502 companies listed in the S&P500 were scraped with a web scraper written in Python. Python is a general purpose high-level programming language designed for rapid prototyping. The full source code of these Python scripts are available on request. The following section describes this data gathering process in more detail.

The first step was to gather the list of ticker symbols for all the companies listed in the S&P500 stock market index. A ticker symbol is an abbreviation to identify publicly trades shares of a particular stock on a particular market (for example Google’s ticker is GOOG). This information was retrieved with a Python script from http://en.wikipedia.org/wiki/ List_of_S%26P_500_companieson May 30, 2015.

(12)

scraped all the links from http://seekingalpha.com/symbol/X/transcripts. Then, after saving these links in a list, it looped over all the elements of this list, scraped the content of the individual links and saved them to a text file. The scraper ran for over 8 hours and created a database of 6064 earnings calls transcripts.

Preliminary LDA estimation revealed that the names of the analysts, executives, and compa-nies heavily distorted the estimated topics. In hindsight this is very obvious, because unique names do not occur in a lot of documents, but do occur with high relevancy and frequency in a small amount of documents. All the files in the database were analysed and the names of the analysts, executives, and companies were extracted and grouped in a list. Stop words refer to the most commonly used words in a language and are filtered out before processing natural language data. This large list of names of the analysts, executives and companies was added to the list of basic stop words used in Hansen et al. (2014), creating a new list of 7794 stop words.

3.2 Dataset

The data used in the experiment are all the scraped transcripts of the earnings calls and stock prices for Google, Apple and Microsoft, found using Thomson Reuters Datastream (Datastream).

The whole database of 6064 earnings calls transcripts was used to estimate a 20-topic LDA model. Fifteen transcripts from Google’s earnings calls (Q3 2011 up until Q1 2015), sixteen transcripts from Apple’s earnings calls (from Q3 2011 up until Q2 2015), and six transcripts from Microsoft (Q2 2014 up until Q3 2015) were used for their topic distributions over time.

Most of the earnings calls were held after the stock market had already closed, the exception being Apple’s earnings call on July 22, 2014. Because the shock in volatility of this news would not occur until the opening of the stock market on the next day, it was necessary to shift all the dummies (excluding Apple’s earnings call on July 22) one day forward.

Figure 3 (a), (c) and (e) in Appendix B show the development of the daily share price (pt) for Google, Apple, and Microsoft over their respective timeframes. A clear upward trend is noticeable and an Augmented Dickey-Fuller (ADF) test with a linear trend and intercept shows that the null hypothesis of a unit root can not be rejected (p-values of 0.62, 0.92, and 0.38 respectively). This motivates modeling the daily logarithmic returns.

Figure 3 (b), (d) and (f) in Appendix B show the daily logarithmic returns for Google, Apple, and Microsoft over their respective timeframes. The mean daily logarithmic returns are respectively 0.00093, 0.00061, and 0.00079 with standard deviations of 0.016, 0.013, 0.014.

(13)

The daily logarithmic returns have minimum values of -0.132, -0.087, -0.097 and maximum values of 0.085, 0.129, and 0.099. Compared to the mean the standard deviations are quite large. Furthermore, the graphs clearly show periods of low volatility and periods of high volatility, so called volatility clustering.

3.3 Methods

Before LDA could be applied to the earnings call transcripts, some preprocessing was necessary: punctuation was removed, names were removed, stop words were removed, and a procedure called stemming was applied. Stemming is a procedure that reduces a word to its ’base’ form (stem). For example, it transforms the words “economist”, “economy” and “economics” into “econ”. After stemming, the vocabulary contained more than 20,000 unique stems. Based on the suggestions and criteria given in Hansen et al. (2014), only the 5,000 most relevant unique stems were used for the vocabulary, such that V = 5000. To filter out stop words Hansen et al. (2014) use a basic list of stop words, I extended this list with the names of the analysts, executives and companies as mentioned earlier.

After preprocessing the data, a K = 20 topic LDA model was fitted on the 6064 earnings calls transcripts with the software provided by Hansen et al. (2014). The burn-in period was set to 1000 samples and afterwards 20 samples were taken with 50 samples in between for a total of 2000 samples. The main motivation behind the choice of fitting 20 topics was that fitting 5 topics resulted in topics that were too broad, and fitting 50 and 100 topics resulted in topics that were badly specified. The hyper-parameters α and η were set to α = 50/K = 50/20 and η = 0.025, as suggested in Griffiths and Steyvers (2004).

Finally, Eviews8 was used to estimate both the standard GARCH(1,1) model and the modi-fied GARCH(1,1) model.

4 Results and analysis

The following section describes the results and analysis of fitting LDA and the econometric model.

(14)

4.1 LDA topics

Estimating LDA results in the topic distributions per document and the K × V word matrix β. Recall that the topics, the K rows of the word matrix, are probability distributions over the vocabulary V. Ordering the words (stems) per row in the estimated word matrix in descending order of probability, gives a good impression of what each topic is about.

Table 1: Topic description of fitted LDA model with 20 topics Topic Label Words

0 Energy transmiss megawatt coal nuclear environ

1 Real estate rent squar occup tenant feet

2 Food (dining) food restaur revpar traffic guest

3 Energy coal crude export coast ton

4 Telecom wireless lte nanomet smartphon churn

5 Drinks wine drink spirit distributor africa

6 Advertising advertis user app tv video

7 - yeah inaud ice cigarett usa

8 Technology cloud softwar analyt subscript licens

9 Insurance claim underwrit cat retent reinsur

10 - cultur doubt impair manner absorb

11 Materials raw specialti steel ton metal

12 Aviation aircraft defens aftermarket aerospac fleet 13 Fashion shop commerc traffic merchandis women

14 - advertis food venezuela snack sap

15 Medical patient clinic studi hospit drug

16 Transport vehicl car truck fleet rental

17 Financial loan deposit mortgag card lend

18 Oil (off-shore) rig barrel drill basin acreag 19 Financial client institut consult affili outsourc

Table 1 shows the 20 topics estimated by LDA. It shows the five most relevant words (stems) per topic, along with the label that was given to each topic. While most of these topics can be labelled with a high degree of accuracy, topic 7, 10, and 14 are rather ambiguous and have been left unlabelled. Note that these labels were added manually and are not generated by LDA. This labelling process is obviously subjective and one of the disadvantages of using LDA.

Figures 4 (a), (c), and (e) show the topic proportions over time for the earnings calls of Google, Apple, and Microsoft. The graphs clearly show that the majority of the earnings calls revolve around a single topic. For Google and Apple this is topic 6 (advertising), and for Microsoft this is topic 8 (technology). These topics can be interpreted as being the most important topic for each company. Only the percentual deviations from the mean for these specific topics are used in the econometric model.

(15)

Another interesting observation is that while the main topic that is being discussed in the earnings calls of Google and Apple appear to fluctuate heavily over time, the main topic that is being discussed in the earnings calls of Microsoft is almost constant over time. This could be the result of the earnings calls of Microsoft containing a consistent amount of technology content, but it could also be the result of the small sample size for Microsoft, as only 6 earnings calls were found compared to the 15 and 16 earnings calls for Google and Apple.

Figures 4 (b), (d), and (f) show the percentual fluctuations around the mean for the most important topics in the earnings calls. The means of the most important topics are 0.62, 0.57, and 0.73 for respectively Google, Apple, and Microsoft. The interpretation of the most important topic of Google having a mean of 0.62 is that, on average, Google’s earnings calls contain 62% content related to advertising. A surprising result is that Google’s earnings calls contain slightly more (62% versus 57%) content related to advertising than Apple’s earnings calls.

4.2 Model estimation

Table 2 summarises the regression results for the standard GARCH(1,1) model and the modified GARCH(1,1) model. Values given in triplets stand for Google, Apple, and Microsoft respectively.

Immediately notable is the negative β1 = −0.08 in Table 2, the β1 for the Google’s standard GARCH(1,1) model. Because all GARCH parameters should be non-negative, this shows that Google’s standard model is badly misspecified. Further investigation shows that this non-linear model is very sensitive to the initial starting values of the estimated parameters. This modeling error in the standard model might invalidate the results for the extended model as well, this is something that has to be taken into consideration when interpreting the other results.

Table 2 reports values of α1 = (0.12, 0.07, 0.05) for the standard model and α1= (0.02, 0.10, 0.14) for the extended model. Comparing the standard model with the extended model, the coeffi-cient α1 increases for Apple and Microsoft, yet decreases for Google. Moreover, Table 2 shows values of β1 = (−0.08, 0.86, 0.73) for the standard model and values of β1 = (0.44, 0.57, 0.37) for the extended model. According to Campbell, Lo, MacKinlay, et al. (1997) normal values for α1range from 0.05 to 0.10 and normal values for β1range from 0.85 to 0.95. Higher values of α₁are often coupled with lower values of β1 and are typical characteristics of a ’spiky’ market (Campbell et al., 1997). Although the β1was on lower side before modifying the model, the results show a clear decrease in β1 and an increase in α1, with the exception of Google.

(16)

Table 2: (Modified) GARCH(1,1) regression results for Google, Apple, and Microsoft. Coefficient Google Google+ Apple Apple+ Microsoft Microsoft+

µ 0.000808* 0.000483 0.001514** 0.001336** 0.000745 0.000893 (0.000399) (0.000401) (0.000486) (0.000482) (0.000765) (0.000565) α₀ 0.000179** 0.000073 0.000021** 0.000070** 0.000043 0.000062** (0.000027) (0.000041) (0.000004) (0.0000020) (0.000036) (0.000023) α₁ 0.123198** 0.023289 0.069342** 0.099268** 0.054007 0.137390* (0.044404) (0.026302) (0.017078) (0.028019) (0.045387) (0.056378) β1 -0.082212 0.441473 0.859961** 0.570471** 0.726810** 0.373488 (0.133126) (0.325257) (0.028226) (0.099880) (0.209997) (0.214511) δ₀ 0.002732* 0.002299** 0.000471** (0.001339) (0.000871) (0.000118) δ₁ -0.001062 -0.001142* -0.000091 (0.001057) (0.000575) (0.000665) δ2 -0.000155 -0.000133 -0.000104 0.000085 (0.000126) (0.000331) γ₀ 0.000152 0.000176* 0.000095** (0.000128) (0.000089) (0.000030) γ₁ -0.000057 -0.000131* -0.000106 (0.000076) (0.000057) (0.000205) γ2 -0.000004 0.000018 0.000043 (0.000010) (0.000013) (0.000105) log likelihood 2.569.985 2.652.034 2.739.882 2.646.665 1.000.131 1.051.466 no. obs 889 880 1019 952 350 350 AIC -5.772.744 -6.004.622 -5.369.739 -5.539.212 -5.692.179 -5.951.236 Notes: Mean equation: yt = µ + t. Variance equation (include the terms inside the brackets for the extended model): σ2t = α0+ α12t−1+ β1σ2t−1+ [

P2

i=0δiDt−i+ P2

j=0γjDt−jxt].

Results for the extended model are marked with a ’+’. Standard deviations are denoted inside parentheses. Significant coefficients (p<0.05) are marked with one asterisk, highly significant coefficients (p<0.01) are marked with two.

in a large residual, then the relatively high value of α1causes a large effect on the conditional volatility of the next day. However, this shock is quickly dissipated through time because of the low β1. These results show that the volatility of the conditional variance reacts quite strongly to shocks in the returns, but also quickly reverts back to its mean.

The results also show that the persistence of volatility (α1+ β1)decreases when adding the additional regressors in the variance equation. Comparing the standard model and extended model, Table 2 shows that α1+ β1decreases from 0.93 to 0.67 for Apple and from 0.78 to 0.52 for Microsoft.

Table 2 reports values of δ0 = (0.0027, 0.0023, 0.0005) and values of δ1= (−0.0011, −0.0011, −0.0001). These variables (δifor i = 1, 2, 3) measure the (lagged) effect of an earnings call day on the

volatility of the logarithmic daily returns. There is a clear (inverse) relationship between δ0 and δ1. According to the EMH (Fama, 1970), new information is quickly absorbed in the price

(17)

of a stock, causing a temporary increase in volatility. On the day of an earnings call this new information is available for the market to absorb and reflect. In line with this theory, the results show that δ0is positive and significant in all three regressions. Accordingly, δ1shows a negative response in volatility following the day of an earnings call, signifying that the new information has been absorbed and reflected in the price. None of coefficients δ2 are significant which suggests that the volatility shock of an earnings call lasts for only one day.

A similar inverse relationship was found for the coefficients γ0and γ1. Table 2 reports values of γ0 = (0.00015, 0.00018, 0.00010) and γ1 = (−0.0001, −0.0001, −0.0001) . These variables (γi for i = 0, 1, 2) measure the (lagged) effect of a 1% increase in deviation from the main topic on the conditional volatility. The unlagged γ0is significant in the regressions for Apple (p<0.05) and Microsoft (p<0.01), but not for Google (p=0.23). The 1-day lagged coefficient γ1is negative in all three regression, but only significant for Apple. Just like δ2, none of the 2-day lagged coefficients γ2turn out to be significant. These results suggest that changes in the content of the earnings calls have an effect on the conditional volatility of the daily logarithmic returns, and that this effect lasts for only one day.

Because it proves to be difficult to give a meaningful interpretation to these coefficients by simply looking at the numbers, I examine some stylized hypothetical scenarios. In these scenarios I examine what would happen if a new earnings call comes in which, in terms of content, deviates one standard deviation from the mean. Recall that Figure 4 (b), (d), and (f) show the deviations from the mean for the most important topics of Google, Apple, and Microsoft. These topics were respectively topic 6 (advertising) for Google and Apple and topic 8 (technology) for Microsoft. One standard deviation for these changes in the main topic are calculated at 8.0%, 8.4%, and 4.0% for respectively Google, Apple, and Microsoft.

In these stylized examples, only the effect of dummy variable and the interaction term are taken into account, the feedthrough effect of previous volatility is neglected. This simplification is justified as the feedthrough effects are relatively small with β1 = (0.44, 0.57, 0.37). Testing the linear coefficient restriction δ0 = γ0 = 0with a Wald-test shows that δ0and γ0are jointly significant for Apple (p=0.03) and Microsoft (p=0.00), but not for Google (p=0.12). Therefore, I will only discuss two stylized scenarios for Apple and Microsoft, starting with Apple.

(18)

.000 .001 .002 .003 .004 .005 .006

IV I II III IV I II III IV I II III IV I II

2012 2013 2014 2015

Conditional variance

Figure 1: Conditional variance Apple

Figure 1 shows the conditional variance of the Apple’s daily logarithmic returns. While the conditional variance has a mean of 0.00026, the shocks in variance caused by the earnings calls are very large. Imagine a new earnings call coming in for Apple which, in terms of content, deviates exactly one standard deviation (8.4%) from its mean. The contribution to the conditional variance from the earnings call dummy is δ0∗ 1 = 0.002299 and the contribution to the conditional variance from the deviation from the mean is 8.4∗γ0 = 8.4∗0.000176 = 0.001478. This gives a total shock of 0.002299 + 0.001478 = 0.003777 on the day of that earnings call. The interaction term is able to explain 0.001478/0.003777 ∗ 100 = 39.1% of this increase in volatility. Just the contribution of the dummy and the interaction term already account for a shock of 0.00406/0.00026 ≈ 15.6 times the mean variance. This example shows that the interaction term is able to explain a significant portion of the shock in variance caused by a new earnings call for Apple, but does this also hold for Microsoft?

Figure 2 shows the conditional variance for the logarithmic returns of Microsoft. The 6 peaks are the results of the earnings calls and compared to the conditional variance of Apple, Microsoft’s conditional variance seems to be more jagged with a slight dip after each peak. The mean variance for the conditional variance of Microsoft is 0.0001978. Consider what happens when a new earnings call comes in for Microsoft which just so happens to also deviate one standard deviation (4.0%) from its mean. The positive shock in variance caused by the earnings call dummy is δ0 ∗ 1 = 0.000471 and the interaction term is responsible for a 4.0 ∗ δ1 = 4.0 ∗

(19)

Figure 2: Conditional variance Microsoft

0.000095 = 0.000380 increase in variance. The total shock being 0.000471 + 0.000380 = 0.000851, the deviation from the mean is able to explain 0.000380/0.000851 ∗ 100 = 44.7% of this shock. The spike in conditional variance from the dummy and the interaction term account for a shock of 0.000831/0.000197 ≈ 7.2 times the mean variance.

In summary, I find relatively large values for α1 and relatively small values for β1, which are typical characteristics of ’spiky’ markets. The large spikes in the conditional variance can be explained by the earnings calls and in line with the theory, I find positive values for δ0 and negative values for δ1. Although δ0 is highly significant for all companies, δ1 is only significant for Apple. This shows that there is a significant positive shock in variance on the days of the earnings calls. Similarly, I find positive values for the interaction term coefficient γ0 and negative values for the lagged coefficient γ1. The unlagged interaction term coefficient is significant for Apple and Microsoft, but not for Google. These insignificant results for Google are probably caused by the fact that a standard GARCH(1,1) model does not fit the daily logarithmic returns for Google.

Moreover, to give a meaningful interpretation to the coefficients, I examined two stylized hypothetical scenarios for Apple and Microsoft. I show that the interaction term is able to explain respectively 39.1% and 44.7% of the total shock in variance of a new earnings call that deviates one standard deviation in terms of content. My findings suggest that it is indeed possible to quantify the content of earnings calls with LDA and relate it to the volatility of stock returns.

(20)

5 Conclusion

This paper investigated whether the relationship between the content of the earnings calls and the volatility of the stock price could be quantified using LDA. This was done in two steps. First, a 20-topic LDA model was fitted to the 6064 earnings calls transcripts for the companies listed in the S&P500. With the topics resulting from the LDA estimation, a measure for the relative change in the content of the earnings calls was constructed, namely, the percentage deviations from the mean for the most important topic of each company. This measure was then added to a GARCH(1,1) model. Second, this modified GARCH(1,1) model was fitted to the logarithmic returns of Google, Apple, and Microsoft to estimate the effect of a change in content on the volatility of the returns.

The main results of this paper are summarised as follows. First, LDA is able to categorise the content of the earnings calls in broad topics quite accurately. Second, I find the surprising result that the bulk of the content of the earnings calls revolve around a single topic. These topics are ’advertising’ for Google and Apple, and ’technology’ for Microsoft. Third, the regression results are in line with the theory and I find evidence that the volatility spikes caused by the earnings calls last for only one day. I find significant results for Apple and Microsoft, but not for Google, which I suspect is caused by the misspecification of Google’s standard model. I find that the changes in the main topic are able to explain a significant portion of the shocks in variance caused by the earnings calls. More precisely, I examined two stylized examples in which a new earnings call comes in that deviates one standard deviation from the mean in terms of content. I show that for Apple and Microsoft the measure created with LDA is able to explain respectively 39.1% and 44.7% of the shock in variance, caused by the earnings call. In this regard, LDA proves to be a valuable tool to quantify the content of the earnings calls.

As for future work, the methods used in this paper are easily extended to other companies and industry sectors. Different combinations of time series models and text models could also be used. A suggestion for such a time series model is the Non-linear Asymmetric GARCH(1,1) (NAGARCH) model, introduced by Engle and Ng (1993), which incorporates asymmetric responses in volatility to different types of news. The NAGARCH model reflects the fact that negative returns often have a larger impact on the volatility than positive returns, something that is not taken into account in the GARCH model. For a different text model, instead of filtering out the names manually, one could use the LDA-dual model, proposed by Shu, Long,

(21)

and Meng (2009). The LDA-dual model is an extension of LDA which models text corpora with two different sources of information: words and names.

(22)

References

Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? the information content of internet stock message boards. Journal of Finance, 59(3), 1259–1294.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307–327.

Bollerslev, T., Chou, R. Y., & Kroner, K. F. (1992). Arch modeling in finance: A review of the theory and empirical evidence. Journal of Econometrics, 52(1), 5–59.

Bowen, R. M., Davis, A. K., & Matsumoto, D. A. (2002). Do conference calls affect analysts’ forecasts? Accounting Review, 77(2), 285–316.

Campbell, J. Y., Lo, A. W.-C., MacKinlay, A. C., et al. (1997). The econometrics of financial markets (Vol. 2). Princeton University press Princeton, NJ.

Curme, C., Preis, T., Stanley, H. E., & Moat, H. S. (2014). Quantifying the semantics of search behavior before stock market moves. Proceedings of the National Academy of Sciences, 111(32), 11600–11605.

Das, S. R., & Chen, M. Y. (2007). Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375–1388.

Davis, A. K., Piger, J. M., & Sedor, L. M. (2006). Beyond the numbers: An analysis of optimistic and pessimistic language in earnings press releases. Federal Reserve Bank of St. Louis Working Paper Series(2006-005).

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41(6), 391–407.

Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 987–1007.

Engle, R. F., & Ng, V. K. (1993). Measuring and testing the impact of news on volatility. Journal of Finance, 48(5), 1749–1778.

Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. Journal of Finance, 25(2), 383–417.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228–5235.

(23)

Hansen, S., McMahon, M., & Prat, A. (2014). Transparency and deliberation within the fomc: A computational linguistics approach. Centre for Economic Performance Discussion Papers, dp1276.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57).

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Journal of Machine learning, 37(2), 183–233.

Lee, J. A. (2014). Scripted earnings conference calls as a signal of future firm performance. Available at SSRN 2426504.

Mayew, W. J. (2008). Evidence of management discrimination among analysts during earnings conference calls. Journal of Accounting Research, 46(3), 627–659.

Mayew, W. J., Sharp, N. Y., & Venkatachalam, M. (2013). Using earnings conference calls to identify analysts with superior private information. Review of Accounting Studies, 18(2), 386–413.

Mitchell, M. L., & Mulherin, J. H. (1994). The impact of public information on the stock market. Journal of Finance, 49(3), 923–950.

Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54(1), 209–228.

Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY, USA: McGraw-Hill, Inc.

Shu, L., Long, B., & Meng, W. (2009). A latent topic model for complete entity resolution. In 25th international conference on data engineering, 2009 (pp. 880–891).

Strawderman, R. L. (2001). Continuous multivariate distributions, volume 1: Models and applications. Journal of the American Statistical Association, 96(454), 782–783.

(24)

Appendix A

A1 - Dirichlet distribution

The k-dimensional Dirichlet random variable θ is defined on the (k − 1)-simplex (non-negative elements and the elements sum to 1) and has the following probability density function on that simplex: p(θ|α) = _QΓ (α0) k i=1Γ (αi) θα1−1 1 . . . θ αk−1 k (3) Where α0 = Pk

i=1αi, and the k-vector α has all elements αi > 0. Because the Dirichlet distribution is a multivariate generalisation of the beta distribution (Strawderman, 2001), it can be interpreted as a multivariate distribution over distributions.

A2 - Example of a word matrix

Consider the following vocabulary{animal, play, food}. An example of a word matrix for this vocabulary is: β =       

animal play food cat 0.80 0.20 0 dog 0.50 0.50 0 fruit 0 0.05 0.95       

The rows sum to one because topics (cat, dog, and fruit) are probability distributions over words. This implies that p(wn|zn, β) is the probability of the word wnbeing chosen from the word matrix, given that the topic is zn. Furthermore, β11= p(wn= animal|zn= cat, β) = 0.80 is simply the probability of drawing the word ’animal’ from the word matrix, given that the topic (row) is cat.

(25)

Appendix B - figures and tables

(a) Google share price (b) Google log returns

(c) Apple share price (d) Apple log returns

(e) Microsoft share price (f) Microsoft log returns

(26)

(a) Topic proportions Google (b) Deviations from mean (topic 6) for Google

(c) Topic proportions Apple (d) Deviations from mean (topic 6) for Apple

(e) Topic proportions Microsoft (f) Deviations from the mean (topic 8) for Microsoft

Modeling the volatility of earnings calls : using latent dirichlet allocation to quantify the content of earnings calls