• No results found

Expert versus machine : the predictive power of the sentiment of conference calls for future performance of the firm

N/A
N/A
Protected

Academic year: 2021

Share "Expert versus machine : the predictive power of the sentiment of conference calls for future performance of the firm"

Copied!
40
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Expert versus machine: the predictive power of the sentiment of

conference calls for future performance of the firm

Oswin Frans 10053166

University of Amsterdam Faculty of Economics and Business

Master Thesis

November 2014

Abstract

Textual analysis has become more and more prevalent in a variety of domains including the domain of finance. It is often employed to ascertain the tone of a corporate disclosure or document with the aim to assess the amount of information within the wording of said corpus. These techniques have also been employed on quarterly earnings conference calls because they are one of the most important vehicles for corporate disclosure. Recent research has found that the tone of these earnings conference calls have significant explanatory power on the abnormal returns in the short term and a longer 60 trading day period. This research looks at linguistic based tonal measures and Naïve Bayesian Classifier (NBC) based tonal measures. The author confirms a significant relationship between a linguistic based tone and the abnormal return of a firm in the short term, but not in the long term. Additionally, no significant effect of the NBC based tones over context specific dictionary tones was found nor for a linguistic tone on which a term frequency–inverse document frequency algorithm was applied. Moreover, the NBC based tonal measures exhibited almost an inverse relationship with the cumulative abnormal return in the quartile analysis.

Keywords: investor announcements, conference calls, machine learning, disclosure, textual analysis, Naïve Bayesian Classifier

Data Availability: The dataset is described in section IV of this paper.

MSc Business Economics, Finance Thesis supervisor: Dr. Torsten Jochem

(2)

2

Acknowledgements:

I would like to thank my family and friends for their steadfast support throughout my studies. Especially many thanks to Erik Blokland and Floris Busscher for their invaluable advice and friendship. It has been a great boon to both my study and my life. Additionally, I would like to profusely thank the staff of the Finance group, whom have imparted upon me an excellent education. Moreover, I would like to especially thank Dr. Torsten Jochem for his guidance and advice. Lastly, I would like to thank Sietse Boontje, Nikki Doorhof and Melvin Roest for their feedback on my thesis.

Statement of Originality:

This document is written by Oswin Frans who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

3

Table of Contents

I. Introduction ... 4

II. Related Literature ... 5

III. Hypothesis Development ... 8

IV. Data and Methodology ... 10

IV.1 Sentiment ... 14

IV.1.A Linguistic Approach... 14

IV.1.B Naïve Bayes Classifier ... 15

IV.2 N-grams ... 16

V. Results ... 19

VI. Discussion ... 27

VII. Conclusion ... 30

Bibliography ... 31

Appendix I – Correlation of variables ... 35

Appendix II – VIF Factors ... 36

Appendix III - Shapiro-Wilk test for normal data ... 38

Appendix IV - Henry (2008) word list... 39

(4)

4

I. Introduction

An earnings conference call is the conference call between managers and analysts shortly after the release of the quarterly earnings, with the aim of providing context and additional enumeration to the numbers released. Current research has emphasized the importance of the conference call in disseminating information (Kimbrough, 2005; Frankel et al. 2010). The conference call allows for the market to gain otherwise private information and to enable it to better predict future firm performance. However, discerning the informational content of such a conference call is difficult because it is harder to glean information from words than from numbers.

To quantify the informational content of the conference call this research turns to sentiment analysis. This is the process of determining the sentiment of a particular sentence, word or text and has been extensively researched in the new millennium (Kumar and Sebastian, 2012; Pang et al. 2002). One would reason that the tone of the manager/conversation could reveal information that is not included in the earnings release to investors. Hence if this argument holds, a link between the stock price and the tone of the conference call would be expected. This link has been proven by Price et al. (2012), who have shown that the linguistic tone of a conference call holds predictive power over the abnormal return in the long and short run.

A lot of recent research has been done analyzing the sentiment and tone of various corporate documents. Loughran and McDonald (2010) developed a domain specific dictionary and find them to have explanatory power. Other finance related research has also utilized a linguistic method to ascertain the tone of various corpi (Demers and Vega, 2008). For example, Henry (2008) used a custom dictionary to find a significant relationship between the tone of the earnings release and the abnormal return. In this paper, the author focuses on six measures of tone and tries to ascertain if they are significant predictors of abnormal return and which measure has the highest predicting value. Three of the six tonal measures are linguistic based, while the other three are based on machine learning.

There has been a rise in machine learning partly caused by the recent ubiquitousness of data and the exponential increase in processing power. Thus, alternative ways of modeling data such as artificial neural nets and random forest walks become more feasible, powerful and interesting as has been discussed in Sebastiani (2002). Machine learning techniques are based on mathematical procedures employing statistic and multidimensional spaces to find the best related matches. The Naïve Bayesian classifier was chosen for this study because it is calculated by evaluation of a closed form expression instead of an iterative procedure, thus being computational power and time saving. Moreover, there has been longstanding success with Naïve Bayesian Classifiers in the textual analysis space (McCallum and Nigam, 1998; Schrauwen, 2010). Consequently, this research trains a Naïve Bayesian Classifier and compares this model to more linguistic based measures.

To evaluate the relationship between the tonal measures and stock prices this research utilizes an event study. An event study is based on the assumption that new information is quickly incorporated in the stock price and closely looks at the abnormal return. The abnormal return is the return of the firm in excess of the expected return given no new information. It is calculated by taking the difference between the stock of the firm in question and a broader sample of firms of the market (Brown & Warner, 1980, 1985).

In this study six tonal measures of conference calls are evaluated as predictors of the future performance of the firm. Three are based on a domain specific dictionary and three are based on a Naïve Bayesian Classifier (NBC). The Naïve Bayesian Classifier (NBC) is a machine learning based

(5)

5

algorithm that applies Bayes rule to a corpus to assign a probability of it being a certain class after it has been trained on another data set. The methodology has had great success in other domains such as social media (Go et al., 2009) due to its ability to pick up on novel patterns and nuance in a text. The six measures are evaluated in a regression on the abnormal returns, while controlling for various firm characteristics, presence of a dividends release and surprise earnings. Moreover, they are evaluated using the partial correlation given the various regression variables. This research has found partial out of sample confirmation of the results found by Price et al. (2012) by confirming the significance of the tonal measures developed by Henry (2008) when predicting the abnormal returns. However, the long term effect they reported was not found. Moreover, the other tonal measures tested including the machine learning based ones did not display any significance. These results were found both in the regression outputs and in the partial correlation scores. While the findings of this research are subject to limitations as have been outlined in section VI, the findings of this study suggest that current machine learning implementations in the domain of finance on textual analysis are still less effective than those based on expert knowledge. Future research can investigate novel machine learning techniques that push the boundaries of the current capabilities.

The collection of the conference calls was done via a webs crawler1 programmed by the author. Moreover, all the tonal measures were extracted from said conference calls through original scripts written by the author. Furthermore, data munging and cleanup was also achieved through Python. The implementation was done in Python with the aim of efficiency and the code is available upon request. In the following sections of the author firstly discusses the related literature. Secondly, the development of the hypotheses are discussed. Thirdly, the data and methodology are examined. Fourthly, the results are looked at. Fifthly, weaknesses of the research and possible future avenues are discussed. Finally, the last section concludes the research.

II. Related Literature

A seminal work in the area of sentiment analysis was done by Loughran and McDonald (2010) in which they test different word lists on companies’ 10-K documents2. They find that most common dictionaries are not suited to financial texts, partly because words such as vice, mine and tax have a very different interpretation. Consequently, they developed their own word list. Another interest work was done by Hobson, Mayew and Venkatachalam (2012), who showed that vocal dissonance markers, which indicate cognitive dissonance and thus possibly financial misreporting, correlates positively with said misreporting. Engelberg, Reed and Ringgenberg (2012) showed that short sellers gain their main advantage on the rest of the market by their ability to quickly analyze publicly available data. Thus this implies that the ability to quickly process information can gain you an edge

and be profitable.

Additionally, Ramalingegowda (2014) showed that investors with a longer horizon are better informed than their peers, hence illustrating the importance of information. Moreover, Matsuuto, Pronk and Roelofsen (2011) showed that the informational content of conference calls exceeds that of traditional documents and statements such as earnings releases. Hence, this underscores the importance of conference calls. Further underscoring the importance of conference calls is research done by Mayew, Sharp and Venkatachalam (2013) which showed that analysts who participate more

1 A web crawler is a bot that is programmed to automatically surf the internet. The author also programmed it to

scrape the conference calls.

(6)

6

in earnings conference calls outperform their peers when controlling for other factors3.

Also researched by Garcia (2013) is the effect of sentiment as dictated by financial columns on the stock market during recessions. His research showed a significant relationship between sentiment and stock return, thus validating the type of relationship this study attempts to research. Furthermore, Larcker and Zakolyukina (2012) create a dictionary for detecting deceptiveness based on conference calls and show that deceptive executives use certain class of words more often than their more honest counterparts. Additionally, Rogers, Van Buskirk, and Zechman (2011) analyzed the vocal tone of earning announcements to better be able to forecast shareholder litigation. They find that litigation risk increases when managers are abnormally optimistic. Moreover, Loughran and McDonald (2013) also applied textual analysis to IPO filings and found that if the description of the firm’s business strategy and operations were more uncertain the stock would have higher first-day returns, absolute offer price revision and volatility.

Other recent studies in the financial domain utilizing textual analysis are the work done by Tetlock (2007) and Tetlock et al. (2008). It was found by Tetlock (2007) that high media pessimism can predict a decline in the market prices and the reverse. Moreover, Tetlock et al. (2008) found that the amount of negative words in the news can predict can predict accounting earnings and stock return of an S&P500 firm. Engelburg (2008) delved further into this area by employing Tetlock's methodology, counting negative words in the Dow Jones News Service regarding earnings announcements. He found that the text based predictions added to the precision of the numerical earnings information predictions. Thus, there has been a significant amount of evidence suggesting the merit of using text based content analysis to examine finance related questions.

Part of the question of interest is the so called post earnings announcement drift. This occurs where abnormally high/low returns follow positive/negative earnings announcement. It is a phenomenon that has been studied and has not been satisfactorily answered by the research of Ball and Brown (1968) and Bernard and Thomas (1989, 1990). New research has proposed that this drift could be caused by uncertainty (Lewellen and Shanken, 2002; Brav and Heaton, 2002; Liang, 2003; Zhang, 2006; and Francis et al., 2007). The argument is that when there is a large portion of uncertainty, investors want and need to learn about the future prospects of the firm and the information could be slowly accounted for in the asset prices. This mechanism could be an explanation of the drift. The tone of the conference call may be the additional information the investors seek, but the drift could still be linked to this tone. Tonal information may be harder to process and/or access, resulting in the drift under scrutiny.

This phenomenon of the drift should not exist if the efficient market hypothesis holds. The informational content of the earnings release should be fully incorporated into the asset prices after a short period after the information became available to the public. Bernard and Thomas (1989, 1990) tried and failed to fully explain the drift in terms of risk adjustments, market frictions and psychological biases. Their work does suggest that the drift is a delayed response and that information simply is not fully incorporated in the short term. This could be caused by the market participants not fully being able to process the information within the earnings release, which is what Abarbanell and Bernard (1992) suggest. This failure to fully incorporate the information immediately could be caused by the difficulty of processing non-numerical data as has been suggested by Engelberg (2008). Hence, this research turns to textual analysis4 to provide a quantification of text based information.

3 They control for unexpected earnings, linguistic tone, size, growth and risk.

4 Textual analysis is the analysis of the word occurrences in a text. The frequency is counted and one can

compare and contrast with other texts. Moreover, inferences can be made on the basis of said occurences. First used by Mosteller and Wallace (1963).

(7)

7

Let us now turn our attention toward the investigation of the conference call and the research that has shown that it is relevant to study. Kimbrough (2005) notes in his research that the conference call provides managers with a platform to comment on the recently released earnings and share their opinion on their implications on future firm performance. Kimbrough has found that the conference calls correlate with a significant lowering of the post announcement drift. Thus, this seems to imply that conference calls have a positive effect on the efficiency of the market. The importance of conference calls as a vehicle to disseminate information is further underscored by Irani (2004). Irani suggests that conference calls are used more frequently to disclose information and the National Investor Relations Institute indicates that conference calls are the second most widely used service to distribute information. Therefore, it is plausible that the conference call after an earnings release would contain additional information not immediately processed by the market.

Previous research has looked at the tonal information of the earnings conference call. Frankel et al. (2010) used the tonal measure of the conference calls as a proxy for the relation between a firm and its investors when investigating investor relations cost. The data seems to suggest a positive relation between conference call returns and the tonal measure. Also pertaining to this problem is the screening of the information given by the manager in the conference call. Dye (2001) has found that this is a special case of game theory and that firms will only share information that is favorable to them and will thus screen out negative information. Hence, the interpretation of the information should be done with this incentive scheme of the manager in mind. However, a distinction can still be made between less positive and more positive. Moreover, Healy and Palepu (2001) discuss the incentives for managers to screen their voluntarily given information. This could imply that the information is heavily screened and is of limited value to investors. On the other hand, due to the question and answer format of the conference calls, they argue that new information is bound to surface. In the tonal area Mayew and Venkatachalam (2012) have done great research. They analyzed the audio files of the conference calls instead of the text. They found that the voice inflection of the manager when controlling for the firm fundamental, firm performance and number of negative words had predictive power for the firm’s financial future.

The research is extremely similar to that of Price et al. (2012) who also researched the incremental informativeness of the textual tone of earnings conference calls. Their model employs textual analysis to find that tone is a significant predictor of abnormal returns testing both a context-specific as well as a general dictionary. The research replicates this method for a larger time period (2004-2007 versus 2008-2014). An additional contribution to the literature is done by testing different methods of textual analysis such as the Naïve Bayes method. These methods will be expounded on in the methods section. This is a relevant contribution because machine learning models in general are not yet widely used in finance research. Moreover, currently a lot of different methods are used to measure tone and this research could illuminate the different strengths and weaknesses of this method in the area of finance or at least pertaining to conference calls.

Outside of the finance context a lot of work has been done on sentiment analysis and its myriad of applications. Heavily researched for example by Kumar and Sebastian (2012) is how one can determine the sentiment of a document, sentence or word. Moreover, it has been researched by Pang and Lee (2004) how to distinguish between objective and subjective statements. Furthermore, Bethard et al. (2004) researched how to ascertain information about the author of the statement to better determine if there is an angle or not.

Other research tries to combat the limitations of sentiment analysis. For instance, Carvalho et al. (2009) found that heavy punctuation marks, quotation marks and emoticons could aid in discovering the presence of sarcasm within a text, which is an arduous task sometimes even in real life. Jia et al. (2009) tried to tackle the problem of negation. It is difficult to determine the range of negating words such as 'not' in a sentence. This is of importance because it can change the whole

(8)

8

meaning of a sentence. The proposed solution is a combination of heuristics and exceptions to these heuristic rules. Moreover, it has been shown by Wiegand et al. (2010) that differences in language greatly impact sentiment analysis due to differences in characteristics such as synonym-count and vocabulary usage. Another problem strongly related to this research is domain specificity. As has been previously discussed, words can have a totally different meaning in finance than they have in everyday use. A classifier will also have a hard time adjusting to this (Tan et al., 2009); however in this research the author does not switch domains so it should be a minimal problem.

Apart from the differences across domains, e.g. movies (Pang et al., 2002) and financial news (Das and Chen, 2007), one also needs to account for the medium in which the information is conveyed. It has already been shown that news articles (Yu and Hatzivassiloglou, 2003), blog posts (Godbole et al., 2007) and user-generated-content (Schrauwen, 2010) all require a different approach. Hence, the need to account for the differences across the above dimensions creates the requirement for more customized models in sentiment analysis. This is part of the reason this research looks into Naïve Bayesian Classifiers. They are trained for a specific context and it is theorized that they will be able to pick up on more nuance, specific patterns and other phenomena than more traditional linguistic based classifiers.

III. Hypothesis Development

It is expected that the surprise earnings control variable (SURP) will be the most important predictor in both the initial reaction period and the post announcement drift and will yield a significantly positive coefficient estimator. Moreover, it is expected that the various tone measures will be significant, given that it is expected that additional information is disclosed during conference calls as has been shown by Price et al. (2012). The efficient market hypothesis implies that these two effects will no longer be significant in the larger time window because they will be incorporated in the price after a short time period. On the other hand, the results found by Price et al. and the presence of the post announcement drift seems to indicate that they are still present in the longer run. Another argument that strengthens this line of reasoning is the research done by Engelberg (2008). This research suggest that soft information such as the tone of a conference call are better predictors for the long run because it could be that qualitative information is processed comparatively slower than quantitative information. In this research three different types of tonal measures are looked at: the Henry tone5 measures, the dif tone6 measure and the Naïve Bayesian Classifier based tonal measures. Given the previous research and reasoning the following hypotheses are formulated:

H1: The Henry tone measures have a positive relationship on a firm’s future performance. H2: The dif tone measure has a positive relationship on a firm’s future performance.

H3: The Naïve Bayesian Classifier based tonal measures have a positive relationship on a firm’s future performance.

This study looks at six tone measures, three of which, to this author’s knowledge, have yet to be used in finance literature and three others which have been proven to effectively identify the signal of the tone. These three new measures are all based on Naïve Bayesian Classifiers at different levels of

5 These are the tonal measures developed by Henry (2008) and will be expounded upon in the methodology

section.

6

Dif refers to a difference measure of positive minus negative sentiment and will be explained in detail in the methodology section.

(9)

9

grams. The main advantage of the Naïve Bayesian method compared to the linguistic approach is the fact that all words and word groupings (features) are considered rather than only single words that are present in the chosen dictionary. Thus, theoretically these classifiers should be able to identify deeper and unexpected relations within the corpus and should be better able to predict the nature of a corpus. A disadvantage of this method is a common problem of a large set of machine learning implementations that is the problem of overfitting. Utilizing this method, one can conceive that not only the signal was identified, but that also a large part of noise is included in the training of the model. This criticism can be somewhat mitigated by the use of a large and randomly selected training set because that will ensure that classifiers will be stronger and more properly trained. Moreover, the basis on which the proposed model is trained are the words used and while it is conceivable that there is some noise among the words chosen they do represent a large part of the signal of one’s tone. As has been shown by Carvahlo et al. (2009), machine learning approaches can detect more specific patterns and other phenomena such as abbreviations or trend and time related words. Hence, it can be argued that more not less signal is identified. On the other hand, as discovered by Mayew and Venkatachalam (2012), the use of one’s voice can also reveal additional information not captured by the choice of words. Another problem commonly associated with machine learning models is the fact that they are not transparent and one is simply employing a black box without properly understanding what is actually happening or the logic behind the process. This criticism is less pertinent to the implementation in this study, because the classifiers are based on relatively simple Bayesian logic. Although there are drawbacks to the Naïve Bayesian method, due to the advantages in employing the whole corpus it is expected that the Naïve Bayesian Classifiers will be better able to capture the tone of a conference call and thus be a more dominant predictor of the cumulative abnormal return while controlling for surprise earnings.

This study looks at three different levels of Naïve Bayesian Classifiers, one trained at the unigram level, one trained at the bigram level and one trained at trigram level7. Evidence of which of these models performs best is mixed across different studies and different domains. Pang et al. (2002) has found that adding bigrams has a harmful effect on accuracy, while the research done by Bermingham et al. (2010) and Go et al. (2009) suggests that adding bigrams to the model will increase performance. Considering the conflicting findings and the underlying theory the author expects the bigram classifier to outperform the other classifiers. One could argue that due the nature and professionalism of the earnings conference calls it could be that the executive will avoid simple constructions and utilize more negation and convoluted speech patterns. The manager could use these patterns and habits to obscure information. Hence, a bigram classifier will be better suited to pick up on these nuances, but a trigram classifier will pick up too much noise.

The performance of the tone measures will be quantified through the partial correlation of the various tone measures with cumulative abnormal return partialling out the control variables. Partial correlation is a metric for the degree of association between two variables. It measures the marginal contribution of one explanatory variable when all others are already included in the model. Thus it can be regarded as a good indicator together with the regression output of which tonal measure is the most fit for capturing the extra information of an earnings conference call. Accordingly, the author arrives at the following hypothesis:

H4: The Naïve Bayesian Classifier based measures are better predictors than the other more linguistic based measures.

7

An n-gram is as sequence of n items, in this study words, and will be expounded upon in the methodology section.

(10)

10

IV. Data and Methodology

The sample period of this research is 2008-2014 due to the availability of the conference calls focusing on North-America. Required of the firms in the sample is sufficient stock data from CRSP and Bloomberg to measure the post-announcement drift (Bernard and Thomas, 1989) and calculate the abnormal return (at least 60 days), earnings-per-share (EPS) and other firm data from Compustat and analyst data from Thompson-Reuters. The transcripts were obtained from the Seeking Alpha website8, utilizing a webcrawler. Seeking Alpha has a dedicated staff to the release of the transcripts of the earnings conference calls. They cover over 4500 firms partly based on the interest of their user base in particular companies. Although Seeking alpha has a very wide selection of conference calls covered this study only includes them in the sample if they were traded at least 60 trading days before and 60 trading days after the earnings conference call. Moreover, the firm also needed to have data available on the control variables included in this research. The webcrawler was programmed to crawl the conference calls semi-randomly taking into account time as a factor to ensure that the entire time period was properly represented in the sample. This was done by sorting the earnings conference calls by release date then randomly selecting one of every tenth conference call. Then this pool was winnowed by applying the criteria described above. In table 1 some information regarding the sample of firms is displayed. As can be seen size and the measure of analyst coverage does not really differ across the sample, while especially the leverage heavily fluctuates. Most observations come from the middle years with a relatively small contribution from 2008 and 2014. Only 7 firms have consecutive observations across the years, hence the capture of firm fixed effects in the regression could be problematic. One could only show significance of a very large effect due to this feature of the data set.

Table 1: Descriptives of Sample of Conference Call Firms

Year N Unique Firms Size Leverage ANL Equity ROA

2008 9 5 23,044 123,318 2,968 1583075 11,017 2009 60 58 23,084 114,661 2,728 1328965 -3,514 2010 45 45 23,199 38,115 2,853 1482213 324,800 2011 64 61 23,091 81,370 2,964 778044,7 120,192 2012 50 49 23,190 92,655 3,000 1364646 928,210 2013 33 32 23,036 2,001 2,974 886737,3 1373,812 2014 8 5 23,092 1,888 3,044 2789441 5,716 All 269 248 23,119 74,076 2,894 1208727 425,415

In this table an overview of the sample of the conference calls is given. Amount of observations and some descriptives of the firms are given per year and for the total sample. The reported descriptive statistics are the means of the data of interest. Size is the log of firm market capitalization from the previous quarter. ROA is return on assets, in percent, calculated as net income divided by total assets multiplied by one hundred. Leverage is expressed as the ratio of total liabilities to total assets. ANL is the log of the number of analysts who cover a given firm. Equity is the total equity of the firm at the end of the quarter.

A potential problem of this research is that the effect of release of the earnings announcement is measured and not any additional information provided by the conference call. To prevent this problem a surprise earnings measure is included in the regression. Moreover, the release dates have been examined; Price et al. (2012) have shown that over 85% of conference calls take place on the same day as the earnings announcement. This has further been corroborated by the author, who randomly sampled 15 conference calls and manually checked to find that 13 of the 15 took place on the same day as the earnings announcement. Another issue related to this is the release date of the

(11)

11

conference calls, while the conference call may be recorded on the same day of the earnings announcement this does not imply they are released to the general public on the same day. This could imply that information that the conference call provides is taken into account much later and that some investors may be privy to private information on the basis of earlier access to the conference call.

Table 2: Correlations of control variables

ROA trad vol leverage Equity dod size

return sd ANL ROA 1 trad vol -0,0067 1 leverage -0,0254 0,0063 1 Equity 0,0317 -0,0114 0,092 1 dod -0,0491 -0,078 -0,0604 0,0174 1 size -0,0629 0,3292 -0,0513 0,0639 0,14 1 return sd 0,0495 0,0147 -0,0574 0,024 0,0219 0,1201 1 ANL 0,0565 0,2926 -0,0843 0,0176 -0,0135 0,2466 0,0898 1

This table provides correlations for various uncertainty proxies for the full sample of 296. Dod (declaration of dividends) is an indicator variable equal to 1 if the firm pays dividends, and zero otherwise. Size is the log of firm market capitalization from the previous quarter.ROA is return on assets, in percent, calculated as net income divided by total assets multiplied by one hundred. Leverage is expressed as the ratio of total liabilities to total assets. trad vol is the log of the total share trading volume on day zero. Return sd is in percent and is calculated as the standard deviation of daily returns for the ninety trading-day period ending 10 days prior to the conference call multiplied by one hundred. ANL is the log of the number of analysts who cover a given firm. Equity is the total equity of the firm at the end of the quarter.

Displayed in table 2 are the correlations of the control variables. Furthermore, the correlations of the explanatory variables were checked, to avoid over fitting (Babyak, 2004). This was done because too many variables can negatively influence a regression. Thus the author calculated the correlations between the variables and eliminated any whose correlation was above 0.90, because that would imply they do not significantly add predictive power to the regression. Fortunately, as can be seen from the table, this is not an issue. Moreover, as can be seen in appendix I there is no issue with multicollinearity. The correlations of the independent variables of interest with the controls are low enough to not pose an issue and bias the estimators.

The cumulative abnormal returns are calculated to quantify the reaction of the market to the event under study. This was calculated as the difference between the raw return for stock j on day t and the mean return of a portfolio of all firms (Bernard and Thomas, 1989, 1990). This can be seen in the following formula:

(1.1)

Where is the abnormal return for firm j on day t, is the return for firm j on day t, and is

the mean return on day t for all firms in the same size decile as firm j. These are then cumulated across the time periods. The cumulation happens across two time windows: an initial 3 day window to measure the initial effect and a larger 60 day time window to capture the post announcement drift. The goal of including this larger window is to more closely examine the relationship between the tone measures and the post announcement drift. The window of 60 days was chosen based on research done by Campbell et al. (2009), which found that drift only survives 60 days. Hence, the cumulative abnormal return of stock, j in time t is given by:

(12)

12

(1.2) (1.3)

Before running the regressions the power of the tone measures is more closely examined by use of quantiles created based on these tone measures. To this end, the means and medians of both the high and low quantiles are compared utilizing Wilcoxon rank-sum tests and t-tests.

To measure earnings surprise the four quarter lagged earnings per share measure is chosen in favor of a more analyst based one. This was done because Collins et al. (2009) has shown that analyst only forecast GAAP earnings 65% of the time, thus having made an erroneous forecast a large percent of the time. Moreover, Graham et al. (2005) has shown that managers believe their four quarter lagged earnings per share are the most important metric. This is especially relevant because in this study the conference call tone is under scrutiny of which managerial tone could be an important part. Hence, unexpected earnings are calculated according to the following formula:

(1.4)

The change in the earnings per share over a year is divided by the stock price of that same time period.

Cross-sectional regressions are run to further investigate the effect of the tonal measures on the cumulative abnormal return, while taking into account surprise earnings. These are formalized in the following way:

(1.5)

Where represents the cumulative abnormal returns for conference call j, represents a firm fixed effect, represents the 6 different tone measures for conference call j, SURP represents the surprise earnings, CONTROLS the various control measures and represents the error term for conference call j. The control measures include size, which is the log of firm market capitalization from the previous quarter. Also included is the return on assets (ROA)9, calculated as net income divided by total assets multiplied by one hundred. This is included because it is assumed investors demand more information when a firm is not profitable. Additionally, leverage (leverage) is included as control to account for the desire for more information when there is financial distress. This control is expressed as the ratio of total liabilities to total assets. Trad vol is the log of the total share trading volume on day zero. Volatility (return sd) is also included and is in percent and is calculated as the standard deviation of daily returns for the ninety trading-day period ending 10 days prior to the conference call multiplied by one hundred. This control variable is included because it could pertain to investors’ possible search for more information facing more uncertainty. Furthermore, the total equity of the firm is included (Equity) as well as the number of analyst covering the stock taken as a log (ANL). Because dividends can significantly impact stock prices especially in the short run (Michealy, 1991) a binary variable indicating the declaration of dividends within the short three day window (dod) is considered. This variable is 1 if there is an announcement and 0 if this is not the case. Size, ROA, trad vol, return sd and equity were taken from CRSP, Bloomberg and COMPUSTAT and ANL from Thomson Reuters. Standard errors are adjusted for heteroscedasticity following White (1980) and for clustering by firm following Petersen (2009).

(13)

13

In order to make statistically correct conclusions, the data may have to be corrected for the six assumptions behind a multi-linear regression.

Firstly, a linear relationship exists between the dependent and independent variables. To enhance model specification, a comparison between the fit of nominal values and natural logarithms of the independent variables under study is conducted. For every single variable, absolute values appear to describe a linear relationship more accurately. Hence none of the variables are transformed into a (natural) logarithm.

Secondly, the independent variables are not random, and there is no exact linear relation between any two or more independent variables. The latter condition is also known as multicollinearity, which arises when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. In practice, multicollinearity is often a matter of degree rather than of absence or presence, especially for financial data (Li, 2014). Indeed, the correlation matrix indicates that there are significant correlations between independent variables. Moreover, scrutinized are the VIF factors or the variance inflation factor. The VIF can be calculated by performing a regression of an independent variable on the other independent variables. The VIF is then and can be computed relatively easily It can be interpreted as a measure of how much the variance is inflated by the relations with other predictors. As can be seen from Appendix II most VIFs are not particularly high and are well below the rule of thumb of 5 or 10. Size, trad vol and ANL are quite a bit higher with scores even far above 300. However, all these three variables are control variables and thus do not need to be removed or partialled out. This is not a problem because they are not collinear with the variables of interest i.e. surprise earnings or tonal measure and their collinearity does not impact their ability to be control variables.

Thirdly, the expected value of the error term, conditioned on the independent variables, is zero. This assumption is tested through a regression of the error term on the respective independent variables in the model. A significant linear relationship, also referred to as endogeneity, would conclude violation of this very assumption. Potential solutions to this problem include a different model specification or a two-stage least square regression with instrumental variables (Stock and Watson, 2011). The regressions of the error term did not indicate a problem.

Fourthly, it is assumed the variance of the error terms is constant for all observations, this is also known as homoscedasticity. When the error term is conditionally heteroskedastic this can cause biases in the estimator. This problem has been prevented through correction of the errors with the White method.

Fifthly, the error term for one observation is not correlated with that of another observation. Serial correlation, also known as autocorrelation, arises when this would be the case. Positive serial correlation is most common in economic and financial data (Li, 2014). This would typically cause the standard error to be too small, resulting in inflated t-statistics. Autocorrelation in the residuals can be tested by means of the Durbin-Watson statistic. However, the errors in the research are autocorrelation-consistent and thus do not experience this problem.

Sixthly, the error term is normally distributed. Razali and Wah (2011) find that the Shapiro-Wilk test has the highest statistical power for a given significance to test normality; hence it is utilized in this study. As can be seen from Appendix III, the errors do not seem to be normally distributed. However, as per the Gauss-Markov theorem the errors only need to have equal variance to make ordinary least squares the best solution. Moreover, Lumley et al. (2002) have found that the requirement of normality is not needed for a linear regression to have merit. Furthermore, due to the errors being corrected for heteroscedasticity using the method proposed by White (1980) even the assumption of equal variances of the error term is not needed.

(14)

14

IV.1 Sentiment

Sentiment analysis consists of several approaches to determine the tone of the conference calls. One method to classify documents on sentiment is based on a linguistic approach, using a predetermined and often handmade lexicon of words that is expanded (Godbole et al., 2007) by utilizing online dictionaries and other linguistic resources such as WordNet1. The lexicon contains information of the polarity of single words (a positive or negative score) and often other attributes like subjectivity and relevance. The information pertaining to the words present in the document is used to determine the total polarity score of the text. A second option is to employ machine learning (ML) techniques such as the Naïve Bayes method or Support Vector Machines. These techniques are based on mathematical procedures, making use of statistics and/or multidimensional spaces and closest related matches. The classifiers are trained with labeled data (corpi with known polarity) and learn which words and structures tend to be used more often in a positive way, and which in a negative way. This is called supervised learning (Pang et al., 2002). These different approaches will be taken and compared aiming to have the best possible way to classify the tone of a body of text. The author will primarily focus on the Naïve Bayes method and two linguistic approaches.

IV.1.A Linguistic Approach

For the linguistic approach the Henry measure will be used based on the Henry dictionary in two forms for robustness as done by Henry (2008) as well as the dictionaries created by Loughran and McDonald (2011) and their accompanying approach.

This measure consists of the categories Positive and Negative, both measured according to a simple relative measure ( ) for category i and conference call j.

(2.1)

These Positive and Negative relative measures are then used to calculate two ratios to represent tone. (2.2) (2.3)

Loughran and McDonald (2011) employ a different weighting to their approach and employ a different dictionary. Their weighting scheme is formalized in the following way:

(2.4)

Where N represents the number of documents in the sample, the number of documents containing at least one instance of the word, the count of the word in the document and a the

average word count in the document. Due to the log transformation one takes into account the high frequency terms in the first term of the equation. Conversely, the second term of the equation regulates impact based on the number of occurrences. If a word occurs less it will have a higher weight. This weighting scheme is often referred to as term frequency–inverse document frequency (tf-idf) and is employed to control for the fact that some words are simply more used than others. On the

(15)

15

other hand, one could reason that more frequently used words still convey more information than less frequently used words and thus invalidate this measure.

They also created six distinct dictionaries solely for the field of finance, pertaining to negative words, positive words, uncertainty words, litigious words, strong modal words and weak modal words. This research will solely focus on the first two dictionaries containing the negative and positive words because the aim of this research focuses on a methodology´s ability to assess the positivity and/or negativity of a text. Hence, a difference measure was constructed (dif) which is simply a subtraction of the negative score from the positive score.

The matrices to calculate the scores were themselves constructed utilizing the so called bag of words representation of a document. The first step is tokenization of the whole corpus. This entails that all documents are split into tokens (unigrams) consisting of individual words or punctuation. Punctuation characters are considered as separate tokens because the presence of, for example, a question or exclamation mark may also indicate certain sentiment.

Then the presence or lack of presence of the words in the specific word list is counted for. In this way the relevant data can be constructed as a matrix, where the rows and columns are the documents and the tokens. Once the counting is complete one can calculate the weight of each individual word utilizing the aforementioned formula. Then, the score of a document can be computed by simply adding up the relevant words.

IV.1.B Naïve Bayes Classifier

The Naïve Bayes Classifier applies Bayes rule to calculate the probability of a class given the current knowledge.

(2.5)

Where C is the class variable (positive or negative) and a feature variable indicating presence or absence of a feature. The division has no influence in selecting the right class, since its value is the same for all classes. The value for is calculated by assuming conditional independence of features:

(2.6)

Where represents the probability of a certain feature value to be part of the class in C. The final formula is:

(2.7)

The feature vector is the input of the Naïve Bayes Classifier and contains n feature variables signaling the presence or absence of features in a document. All features are present that pass a certain threshold in the corpus to filter out words that are too rarely used. is estimated using the Maximum A Posteriori (MAP) estimation. MAP is a mode of the posterior distribution and is used in this instance to obtain a point estimate of .

(16)

16

An example will clarify the nature of the features. Let us examine two sentences The project was a success!

The project was a failure.

The feature vectors of these two sentences are given in Table 3.

fv1 fv2 the 1 the 1 project 1 project 1 was 1 was 1 a 1 a 1 success 1 success 0 failure 0 failure 1 . 0 . 1 ! 1 ! 0 sentiment 1 sentiment 0

Table 3: Example feature vectors

As can be seen, the vector contains all words and punctuation present in the body of work, including the absent words in the particular set. In addition to the words a sentiment feature is added to indicate whether the vector has a positive sentiment (1) or negative sentiment (0). One should also note that the numbers in the feature vector indicate frequency and are not binary. Thus, a feature vector for a longer text will have far larger numbers for commonplace words.

A strength of this method is that features can be multiple words; an n-gram can be a bigram such as ‘very good’ or a trigram for instance ‘not very bad’. Furthermore, features can also be the length of the document 0 for documents that have less than 750 words and 1 if there are more than 750 words. Features can even check if there is a certain n-gram in the last part of the text.

To be able to generate the results, the classifier must first be trained on a set of conference calls of which the sentiment is known. Due to the training on a known dataset, the classifier will learn what value for each feature variable F corresponds to which class and will be able to predict the sentiment that corresponds to a feature vector when the sentiment feature is unknown.

IV.2 N-grams

An n-gram is a sequence of words and/or punctuation that is found in a text and can be of any length. A unigram is an n-gram consisting of a single word, while a bigram is an n-gram consisting of two words and so on. n-grams of differing length are all treated similarly, the only difference is the number of possible features in existence. Formalized an n-gram is defined as:

(2.8) Where are tokens.

A set of bigrams is defined as:

(17)

17 A set of trigrams is defined as:

(2.10)

An example at the word level will help clarify. If we take the sentence ‘Reading a book is very fun.’ we get the following possible n-grams:

Unigram: Reading, a, book, is, very, fun, .

Bigram: Reading a, a book, book is, is very, very fun, fun.

Trigram: Reading a book, a book is, book is very, is very fun, very fun.

It is important to realize that this example is at the word level and the n-grams used in this research are all words and punctuation themselves. However, one can also dissect individual words into n-grams.

The strength of the naïve Bayes method compared to the linguistic approach is that features need not to be single words, but can be a large variation of n-grams. For instance, it can be a bigram, like ‘very bad’ or ‘not good’, thus being able to capture more nuance of the different n-grams than the linguistic method. The choice of the specific n-gram determines multiple variations of this method.

As has been stated, one can use different models based on the different n-grams to use with the classifier. The first one is the simple unigram model, which makes use of examples and statistical information to assign scores. The feature vectors in this model are established in three steps. The first step is the tokenization of the document. Everything is split into tokens, in this case unigrams, consisting of individual words and/or punctuation. The second step is the feature collection in which all unique tokens are examined and collected and the rare ones are discarded. The third and final step is the feature vector creation. In this step the vector indicating the presence of certain features is created. After these three steps, the classifier will calculate the probability to which class (positive or negative) the feature vector belongs.

Another model is the bigram model. In this model every word can be matched to create new features. For example, the unigram model can have only one feature for the word ‘good’, while the bigram model can add many more such as ‘very good’ and ‘not good’. This is applicable to all words, but the rarity filter will still curtail the amount of features. This model can capture a lot of extra information that is missed by the unigram model. For example, the word ´good´ will often be based in a positive context; however the addition of ‘very’ or ‘strong’ can negate or strengthen this connation. Hence, bigrams can add a great deal of value to the model.

Trigrams are similar to bigrams and can convey additional information. One should for instance consider the trigram ‘not really good’. On the other hand, trigrams will rarely pass the rarity threshold due to the large amount of combinations available.

The NBC based classifiers were created utilizing the following procedures. Firstly, randomly half of the sample was chosen as the training set, while the other half was assigned a tonal measure based on the trained classifier. The state of the training set was determined at either positive or negative on the basis of the linguistic tonal measures. When the linguistic tonal measures did not both indicate the same state the author manually checked the manuscript to ascertain positivity or negativity. A weakness of this approach is of course that if the other tonal measures are weak then the training of the classifiers would also be weak and the classifiers would be of no use. However, previous research has validated the linguistic measures making it an appropriate baseline and it is expected that these

(18)

18

measures capture at least some semblance of the sentiment. Moreover, it was deemed unfeasible to manually annotate every earning conference call. Secondly, tokenization took place. The next step was feature collection in which the whole corpus except the training data is examined and all unique tokens are collected. Then, the feature vectors are created, with a feature vector for each conference call consisting of tuples indicating whether or not the features is present or not in a conference call. Finally, the classifier is trained on the training set and then gives a measure of the likelihood of a given document being positive or negative.

The implementation of these methods was done in Python because of its time efficiency and large amount of available packages. The anaconda distribution was utilized to aid with managing the data and the creation of the tonal measures. Scikit-learn was the module that was used for the machine learning implantations. Scikit-learn utilizes other scientific libraries of python present in the anaconda distribution and contains tools to implement most modern machine learning applications.

(19)

19

V. Results

The correlations and descriptive statistics of the cumulative abnormal returns, earnings surprise and tonal measures can be seen in table 4. The range of CAR (-1, 1) is -0.3011 to 0.2516, which indicates a relatively small range, and this is corroborated by the 25th and 75th percentile, which are -0.0365 and 0.0275 respectively. The mean of CAR (-1, 1) is negative at -0.0054 indicating a small net loss on average in the short term window. The range of CAR (2, 60) is more than two times that of CAR (-1, 1) ranging from -0.7755 to 1.4924. However, interquartile range paints a less extreme picture with the 25th percentile and 75th percentile at -0.0979 and 0.0996 respectively. This larger value of CAR(2, 60) seems to indicate that the market does not fully digest the information provided by the release of the earnings and the conference call in the shorter time window of CAR(-1,1). The surprise earnings variable has a range of -12.4998 to 7.8501 and a mean of -0.0936. Hence, these numbers suggest that the surprise earnings are on more average negative and are indicating a slightly bearish market within the sample period. Of the six tonal measures examined, four suggest that the overall tone of the conference calls is positive. This is in line with expectations because a manager would want to

Table 4: Correlations and descriptive statistics of tonal measures

Panel A CAR(-1,1) CAR(2,60) h tone 1 h tone 2 dif uni_Prob bi_Prob tri_Prob SURP

CAR(-1,1) 1 CAR(2,60) 0,0786 1 h tone 1 0,0712 0,1257 1 h tone 2 0,1227 0,0551 0,6964 1 dif 0,0233 -0,0084 0,4572 0,4679 1 uni Prob -0,0776 0,0029 0,4579 0,4478 0,6868 1 bi Prob -0,0748 -0,005 0,4342 0,4783 0,5512 0,9129 1 tri Prob -0,0624 0,0509 0,3312 0,4184 0,3999 0,7388 0,8797 1 SURP 0,0878 0,147 0,0515 0,0569 -0,0304 0,137 0,1455 0,0927 1 Panel B Mean -0,0054 0,0017 5,1905 0,5351 -49 0,49073 0,51603 0,50940 -0,0936 Min -0,3011 -0,7755 0,2 -0,6667 -1.554 0,08905 0,23752 0,26131 -12,4998 P25 -0,0365 -0,0979 2,063 0,3469 -249 0,34979 0,45259 0,47849 -0,00009 P50 -0,0036 -0,0038 4,024 0,6019 -46 0,48466 0,51723 0,50825 0,00005 P75 0,0275 0,0996 6,414 0,7302 166 0,63539 0,58379 0,54093 0,00212 Max 0,2516 1,4924 34 0,9429 939 0,91352 0,76327 0,74238 7,8501 Std.Dev 0,0687 0,2616 4,5861 0,2656 352 0,17933 0,09415 0,05444 1,3371 N 296 296 296 296 296 189 189 189 296

T his table provides correlations (Panel A) and descriptive statistics (Panel B) for the full sample of 296 CAR(-1,1) is the 3-day cumulative abnormal return, in percent, where day 0 is the conference call date, where the abnormal returns are estimated using size-adjusted returns calculated as ARj,t = Rj,t - Rp,t, where the abnormal return for firm j on day t is the difference between the return for firm j on day t and the mean return on day t for all firms . CAR(2,60) is calculated in the same manner as CAR(-1, 1) only it is cumulated from days 2 through 60. CARs represent percentage points. SURP is the earnings surprise, in percent, calculated as {[EPS(qtr) - EPS(qtr-4)]/Stock Price (end of qtr-4) 8 100}.H tone 1 is the ratio of Positive words to Negative words and h tone 2 is the ratio (Positive-negative)/(Positive + Negative) where each category reflects the proportion of words in a given call as defined by the custom earnings dictionary of Henry (2008). Dif is the difference between the positive and negative scores of a corpus utilising term frequency- inverse document frequency measure and dictionary of Loughran and McDonald (2011).Uni Prob is the probability that a corpus is positive in tone given a Naïve Bayesian classifier trained at the unigram level, bi Prob is that probability given a classifier trained at the bigram level and tri Prob is that probability given a classifier trained at the trigram level. T he NBC only have 189 observations due to the 107 of the observations being utilized as training set.

(20)

20

represent his or her company as best as possible. Let us first look at h tone 1 with a range of 0.2 to 34 and a mean of 5.19048. This implies that on average, positive words are used 5 times more than negative words. The next tonal measure h tone 2 is bounded between -1 and 1 with a range of -0.6667 to 0.94288 and a mean of 0.5351. Hence once again showing a substantial skew towards being more positive. The tonal measure of dif has a range of -1544 to 939 and a mean of -49. Thus being indicative of on average the corpus being more negative than positive when one corrects for number of appearances. The remaining three tonal measures of uni Prob, bi Prob and tri Prob are bounded from 0 to 1. Thus, a value of over 0.50 indicates more positivity. The range of uni Prob is 0.089 to 0.9135 with a mean of 0.4907. The mean seems to indicate that on average this measure is slightly negative. The tonal measure of bi Prob has a minimum value of 0.2375 and a maximum value of 0.76327. The mean of bi Prob is 0.5160, which is slightly higher than 0.5 and thus indicates that bi Prob in contrast to uni Prob is on average slightly positive. The last tonal measure of tri Prob has a mean of 0.5094, a maximum value of 0.7424 and a minimum value of 0.2613. The range and the mean of this variable further underscore the trend of the tone being slightly more positive on average.

The correlations of the discussed variables can be seen in panel A of table 4. The correlations between the various tonal measures are all fairly high, but not close to 1. Hence, this indicates that the measures differ substantially from each other. This is further corroborated by the different correlations the metrics have with the CAR (-1, 1) and CAR (2, 60). H tone 1 and h tone 2 have, as one would expect, very strong correlations with each other at a value of 0.6964 and maintain strong correlation with the rest of the metrics, being all values except that with tri Prob are above 0.4. The dif factor holds similar correlations with bi rob at 0.5512 and uni Prob at 0.6869. The classifier based measures all have a very high correlation with each other. For example, uni Prob has a correlation of 0.9129 with bi Prob and bi Prob a correlation of 0.8797 with tri Prob.

The measures are further scrutinized in the analysis in table 5, which looks at high-minus-low CAR differences of the measures when separated into quartiles. To create this table, quartiles based on the tonal measures were used, then the CAR in this quartiles were calculated and the highest quartile CAR was subtracted from the lowest quartile. This was done with the aim of checking the effect of metrics If the difference of the CARs between the highest and the lowest quartile was significantly different from zero then the CAR’s in the high tonal measure quartile are different from those in the low tonal measure quartile, thus indicating some strength of the instrument. As can be immediately seen from the table, the effects of 4 of the 6 metrics seem to be significant in at least the short term window. Only uni Prob and the dif tonal measure do not seem to have any effect. When one looks at the mean and median progression of the NBC based classifiers it becomes apparent that they do not follow the logical progression. The quartile with the lowest scores tends to have the highest cumulative abnormal return instead of the inverse progression. Hence, this undermines the reliability of the NBC based tonal measures and one cannot infer that the findings support Bermingham et al. (2010) that adding bigrams is helpful to a classifier. H tone 2 seems to perform the best retaining significance in the short term as well as the drift period, while h tone 1 seems to only produce significant high-minus-low CAR difference in the initial window. Consequently, one would expect going forward h tone 2 to perform as the most dominant predictor at the firm level in both the regression and on partial correlation.

(21)

Table 5: Tests of differences of means and medians Panel A

h tone 1 CAR(-1,1) CAR(2,60) h tone 2 CAR(-1,1) CAR(2,60)

mean median N mean median N mean median N mean median N

1 -0,0101 -0,0098 88 0,0353 -0,0047 88 1 -0,0172 -0,0067 89 0,0147 0,0096 89 2 -0,0067 -0,0092 61 -0,0069 0,0102 61 2 -0,0061 -0,0053 59 -0,0215 -0,0221 59 3 -0,0092 -0,0040 64 -0,0092 -0,0042 64 3 0,0009 -0,0029 72 0,0011 0,0035 72 4 0,0034 -0,0031 83 -0,0194 -0,0153 83 4 0,0029 0,0035 76 0,0049 -0,0131 76 Mean Q1-Q4 -0,0136 * 0,0547 -0,0202 ** 0,0098 t-Statistic (-1,3360) (1,1989) (-1,7523) (0,2039)

Wilcoxon rank-sum test

Median Q1-Q4 -0,0066 0,0106 -0,0102 * 0,0227

z-Statistic (-0,504) (0,668) (-1,648) (0,337)

Panel B

dif CAR(-1,1) CAR(2,60) uni Prob CAR(-1,1) CAR(2,60)

mean median N mean median N mean median N mean median N

1 -0,0032 -0,0047 89 0,0243 0,0128 89 1 -0,0047 -0,0028 51 0,0083 0,0015 51 2 -0,0182 -0,1174 69 -0,0499 -0,0217 69 2 -0,0032 -0,0072 43 0,0217 0,0083 43 3 -0,0016 0,0016 54 -0,0152 -0,0044 54 3 -0,0153 -0,0184 48 -0,0088 -0,0295 48 4 0,0003 -0,0025 84 0,0308 -0,0087 84 4 -0,0048 -0,0106 47 0,0171 0,0233 47 Mean Q1-Q4 -0,0035 -0,0065 0,0078 -0,0087 t-Statistic (-0,3425) (-0,1467) (0,5628) (0,1353)

Wilcoxon rank-sum test

Median Q1-Q4 -0,0035 0,0215 0,0078 -0,0218

(22)

22 Panel C

bi Prob CAR(-1,1) CAR(2,60) tri Prob CAR(-1,1) CAR(2,60)

mean median N mean median N mean median N mean median N

1 0,0035 -0,0036 48 0,0288 0,0133 48 1 0,0022 -0,0032 43 0,0178 0,0015 43 2 -0,0146 -0,0042 41 0,0187 -0,0040 41 2 -0,0032 -0,0077 42 0,0213 0,0073 42 3 -0,0067 -0,0100 44 -0,0295 -0,0197 44 3 -0,0125 -0,0032 54 0,0088 -0,0066 54 4 -0,0199 -0,0085 56 0,0159 0,0167 56 4 -0,0225 -0,0107 50 -0,0079 -0,0003 50 Mean Q1-Q4 0,0234 * 0,0234 * 0,0247 * 0,0256 t-Statistic (1,5784) (1,5784) (1,5329) (0,3953)

Wilcoxon rank-sum test

Median Q1-Q4 0,0049 -0,0034 0,0075 * 0,0015

z-Statistic (1,539) (-0,124) (1,657) (0,408)

T his table shows the differences in CARs when sorted into h tone 1, h tone 2, dif, uni Prob, bi Prob and tri Prob quanntile portfolios. T est statistics (z) are in parentheses. CAR(-1,1) is the 3-day cumulative abnormal return, in percent, where day 0 is the conference call date, where the abnormal returns are estimated using size-adjusted returns calculated as ARj,t = Rj,t - Rp,t, where the abnormal return for firm j on day t is the difference between the return for firm j on day t and the mean return on day t for all firms . CAR(2,60) is calculated in the same manner as CAR(-1,1) only it is cumulated from days 2 through 60. H tone 1 is the ratio of Positive words to Negative words and h tone 2 is the ratio (Positive-negative)/(Positive + Negative) where each category reflects the proportion of words in a given call as defined by the custom earnings dictionary of Henry (2008). Dif is the difference between the positive and negative scores of a corpus utilising term frequency- inverse document frequency measure and dictionary of Loughran and McDonald (2011). Uni Prob is the probability that a corpus is positive in tone given a Naïve Bayesian classifier trained at the unigram level, bi Prob is that probability given a classifier trained at the bigram level and tri Prob is that probability given a classifier trained at the trigram level. *** p < 0.01. **p < 0.05. * p < 0.1.

(23)

Table 6 displays the regression results of the tonal measures and the earnings surprise on the CAR in the short term window and the CAR in the long term window with and without control variables. Panel A holds the results for the initial reaction and panel B those of the post announcement drift. As can be seen in panel A, the surprise earnings are significant for all instances. H tone 1 and h tone 2 are also both significant in the regressions without controls; however, h tone 2 loses its significance when introducing control factors. The significance of the earnings surprise is in line with what one would expect from the high-minus-low CAR differences. Moreover, the significance of the h tone 1 and SURP corroborates the results found by Price et al. (2012) that found the strength of finance related dictionaries in explaining the cumulative abnormal return. The main difference compared to the study done by Price et al. (2012) is that h tone 2 is not significant in both the short and the long term window. They found that the results of h tone 1 and h tone 2 are nearly identical, which is what one would expect given that they are ratios that should measure the same thing. However, this was not found in this study and could possibly be explained as a quirk of the data.

That h tone 1 does not retain significance in the long run window seems to disprove the research done by Engelburg (2008) and Demers and Vega (2008). They have shown that qualitative information is processed less efficiently and slower than quantitative information, which would imply that the tonal measures would be significant in the larger window while the surprise earnings variable would not be significant. This is not what the data tells us and thus there is no corroboration of the aforementioned studies. An alternative explanation for the lack of finding the long term effect could simply be a power issue. The winnowed sample used in this study could be too small to pick up on the effect in the long run. The lack of significance of the other tonal measures seems to indicate that they do not provide an improvement in describing the tone of a conference call. Although one still needs to examine the partial correlation, it stands to reason that this metric will paint a similar picture. These results suggest that a classifier trained at the unigram, bigram and trigram level or a term-frequency level model do not provide an improvement over a simpler ratio constructed on the basis of a domain specific dictionary. This could be caused either by the training set being too small to have been trained enough to pick up a satisfactory level of information, the conference call lacking uniform enough characteristics to be ascertained through a Naïve Bayesian classifier or by the lack of sophistication of the classifier implementation10. Another reason could be that the training data set was insufficiently attuned to the sentiment of the earnings conference call due to their state being determined by other textual analysis processes. A possible reason that the dif tonal measure was not a significant predictor is that Loughran and McDonald (2012) did not create the dictionaries for earnings releases related documents. The dictionaries were originally meant for the analysis of 10-Ks and thus it could be that the dictionaries were not medium specific enough. Another reason could be that the tf-idf does not take into account semantic similarity when correcting for document presence. While the other tonal measures also do not account for this it is far more pertinent to the dif tonal measure because of the weighting scheme. It could be conceived that some terms have their weight reduced while others who only share semantic similarity and not string similarity remain relatively untouched. Hence, this could result in a bias. Given the significance levels one can reject hypothesis 2 and 3 and one can ascertain there is mixed evidence for hypothesis 1.

When one looks at the control variables in the regression, a couple of things are clear right away. In the short term window, dod is significant for the linguistic method based tonal measures and loses significance in the long term. Thus, this result suggest that Kane et al. (2012) and Sponholtz (2005) were correct in their assessment that the announcement of dividends play a significant role in the cumulative abnormal return when one accounts for surprise earnings. The control variables ROA and leverage are significant in the short period for the classifier based tonal measures and in the drift

Referenties

GERELATEERDE DOCUMENTEN

More specifically, this paper intends to investigate the impact of CSR-related compensation on financial performance and the firm value and if these relationships

The assumption that CEO compensation paid in year t is determined by previous year’s firm performance (Duffhues and Kabir, 2007) only holds in this study for

of the three performance indicators (return on assets, Tobin’s Q and yearly stock returns) and DUM represents one of the dummies for a family/individual,

Having journeyed through the history and construction of the Dutch asylum system, the theory of identity, the method of oral history and the stories of former asylum seekers for

performance of women-owned small ventures. Do more highly educated entrepreneurs matter? Asian-Pacific Economic Literature, 27, 104-116.. Sustainable competitive advantage in

On the other hand, I found that the acquiring firm’s firm size had a positive moderating effect on this relationship, insinuating that the positive effect of alliance experience

Appendix 9: Distribution of return on assets during economic downturn (left side = publicly owned companies, right side = privately owned companies).. Most of the research

The present research aimed to show that changes in company performance are important antecedents of CEOs’ positive and negative emotion expressions, as found in