The predictive power in forecasting default with market and accounting data using machine learning

(1)

The predictive power in

forecasting default with

market and accounting data

using machine learning

Leon van Vliet

University of Amsterdam, Amsterdam Business School

MSc Finance

Quantitative Finance

August 2018

(2)

Statement of Originality

This document is written by Student Leon van Vliet, who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

This study examines the relative predictive power in calculating a Probability of Default (PD) by using three methods. Firstly, we em-ploy the model of Merton (1974) to derive a forward-looking PD based on market data. Secondly, we use machine learning and logistic regression on accounting data from the leasing firm DLL, to derive a backward-looking PD. We find that all methods provide substantial predictive power, but that the differences in predictive power in terms of the Accuracy Ratio, and the AUC is very similar. A surprising result is that the machine learn-ing method does not perform better than the logistic regression method. This result has implications for banks and leasing companies, in their credit decisioning.

(4)

1 Introduction

In financial markets any risk can, and should, be quantified. Getting better estimates of the risk anyone is taking means that financial instruments can be priced more fairly. This is also the case for default risk, or credit risk. Credit risk is the kind of risk that arises whenever a counter party fails to honour an obligation. In this case the lender of the debt often misses out on getting the loan value back; whenever bankruptcy occurs large amounts of money get lost in the default process.

Being able to forecast a failure thus is very valuable. Improved forecasting allows the lender to get a better estimate of the risk he is taking. The improved risk estimation allows for better pricing of debt and better decisions on the amount of risk to accept. This in turn should lead to a better risk-return ratio for the lender.

Predicting default can be approached from two possible ways. firstly an analyst could study the market: as financial market participants should price fi-nancial instruments fairly, the price of default should also be incorporated in the price of any financial instrument of which the price is affected by the possibility of default. This is not only the case for bonds, but also for stocks Vassalou & Xing (2004). The market contains signals indicating a higher likelihood of fail-ure like the volatility of the equity prices. Combined with the height of the debt, this can tell a lot about the probability of default. We could call this method a forward-looking method, as the market contains signals about expectations of investors.

Likewise, a financial analyst can also tell a lot about the probability of default (PD) from the current financial data of the company. This can be information like credit data; what industry is the company in? What is the company’s paying behaviour? What are last year’s profits? The analyst could then find relationships between these firm characteristics and default and infer that whenever a firm displays these characteristics that the probability that this firm defaults is higher.

Numerous studies have been done examining these two methods; The market approach often uses a method developed by Merton (1974). This method uses market data to infer the volatility of the asset value, and combines this with data on the value of debt to calculate the Distance to Default (DD). This method is being used both by practitioners (e.g. KMV) and theorists (Vassalou & Xing, 2004; Bharath & Shumway, 2008, e.g.).

On the other hand, studies have been done examining how accounting data can be used to predict default. To do this one must find a relationship between the firm’s characteristics and the default. This can either be done using logistic regression (Chava et al., 2004; Titan & Tudor, 2011; Altman, 1968; Ohlson, 1980) or machine learning (Kruppa et al., 2013; Angelini et al., 2008, e.g.).

Although studies have investigated and compared (Campbell et al., 2008) the predictive value of using market data and using logistic regression on ac-counting data, no studies have investigated how the machine learning method using accounting data compares to the market method of deriving a Probability

(6)

of Default (PD) in terms of predictive power.

Machine learning methods often achieve better accuracy than logistic regres-sion in predicting default, because it is better able to extract the information inherent in the data, especially when there are a lot of variables with a lot of interactions (Breiman, 2001). This comparison thus gives a better view of whether accounting or market data has better predictive capabilities.

The goal of this paper is to compare the predictive power of these three meth-ods: the market method, the logistic regression and machine learning methods. The question being answered in this paper follows from this: what method, pre-dicting default by means of accounting data or by means of market information, is more powerful?

This question is both of practical and theoretical value. Banks and leasing companies using accounting data when a new applicant requires a loan could improve their forecasting and credit decisioning by employing market data in their analyses. Having better predictions leads to better pricing of loans and leases, and less value lost in case of defaults, thus higher profits, and a better risk return ratio. If other methods than the logistic regression or machine learn-ing method prove valuable, a bank or leaslearn-ing company could incorporate this additional information in the application information.

Predicting defaults also adds value to market participants. All debtors, including bondholders, take a risk whenever they lend money to a company. The company can default, and when it does the lender doesn’t get the required amount back. Thus, investors require a spread above the risk-free rate of interest to get compensation for taking this risk. Getting a better estimate of the risk taken leads to better pricing and thus better working, and fairer markets.

The PD is relevant not only for bonds, but also for Credit Default Swaps (CDS), and asset pricing. Vassalou & Xing (2004) have shown that default is a relevant variable for asset pricing, above and beyond the size and the book-to-market ratio of the firm. Getting a better estimate of the probability of default might prove very valuable; for example, when a swap is priced cheaply, while the correctly estimated PD is high, one could profit from buying the swap.

This question is also theoretically interesting as a test of the efficient mar-ket hypothesis; one does not expect the accounting method to perform better when the accounting data is freely available, has predictive power, and the effi-cient market hypothesis is true. Whereas the accounting method is inherently backward-looking, and thus only has past information available to predict the default, the market incorporates multiple sources of information and thus the market method should perform better, if the Merton DD method is appropriate. Another theoretical contribution lies in the comparison between the machine learning and logistic regression methods; if the machine learning method per-forms better than the logistic regression method, this can mean that there are more complex interactions between the independent and dependent variables than expected, and that the logistic regression function could be misspecified.

Literature on forecasting defaults using market information often uses Mer-ton’s (Merton, 1974) model. This model uses option pricing methods to derive the probability of default. The model views the equity of a firm like a long

(7)

call option on the firm’s assets, with the strike price at the book value of debt; whenever the value of the firm goes below the value of the debt, shareholders receive nothing, whereas whenever the value of the firm is above the value of debt, the shareholder’s value goes up linearly.

Equivalently, the value of debt is like a short put option: when the value of the firm is above the value of debt, bondholders don’t receive additional returns above the interest rate on the bond. Whenever the value of the firm is below the face value of the debt, bondholders lose money in a linear fashion; for every dollar the asset loses, the bondholder loses a dollar. The asset value of the firm is equal to the debt plus the equity. Thus, the value of the long call option and the short put option can be combined, making the total value of the firm linear. These facts, combined with other assumptions, like a log-normal distribution of the asset value, and efficient markets, can be used to obtain a probability that the face value of the debt becomes lower than the value of the firm. In this scenario the firm cannot pay its obligations any more, and the firm defaults.

Whereas other literature has studied the market method, or the accounting method using logistic regression, this study compares the predictive ability of applying the machine learning method on accounting data and the method applying the model of Merton (1974).

To make a comparison between the machine learning, logistic regression, and Merton method, we create PD predictions for a range of companies. For the accounting method data is used from the leasing firm DLL. DLL provides leases mostly to smaller, non-listed companies. As the machine learning method requires larger amounts of data to create consistent results we use the data from the smaller companies to create the backward-looking PD, for both machine learning and the logistic regression methods.

For the Merton method we use data from larger, listed companies, as only for these companies data is available on the stock price, and the debt level. These will be needed to create a market-based PD. The PD is created using the method from Merton (1974). For both the accounting and market method we use data from the period from 2008 to 2014 from the United States.

To make a comparison we assume that the population of smaller and larger companies is homogeneous enough to extrapolate the predictive power of the machine learning method to larger companies. It is assumed, as larger compa-nies have to divulge more information to the markets, and there are more checks on this information, that the predictive power of the machine learning method is equal or better for larger companies.

The predictive ability is then tested using the Accuracy Ratio (AR), pro-posed by Moody’s and described in Vassalou & Xing (2004), and the Area Under Curve (AUC) of the Receiver Operator Curve (ROC), as described by Davis & Goadrich (2006). A measure that is commonly used to indicate predictive power is the root mean squared error (RMSE). However, the RMSE is inappropriate for this comparison, as it is scale dependent, and should be used when the forecasts are on the same dataset. We will still report it however, as an indication.

The AR and the AUC gives an indication of how well the PD that has been created can predict actual default, and can thus provide an estimation

(8)

of which method has a higher predictive power. For the accounting method both in-sample as well as out-of-sample estimates of performance are given. The market method doesn’t require an out-of-sample estimate as the method doesn’t use the default indicator to infer a relationship between the dependent default indicator and the independent variables.

The paper is organized as follows. Section 2 begins by addressing the the-ories of credit risk; predicting default using accounting data, both by logistic regression and machine learning; and Mertons Distance to Default, and how it can be used to obtain a probability of default. Section 3 presents the methods applied in this paper. Special attention is given to the machine learning in this section. Section 4 discusses the data and displays summary statistics. Section 5 discusses the results of the methodology. The paper concludes in section 6 with a discussion.

2 Literature

It has been shown that predicting defaults is of high value. So how can com-panies make an estimate of the risk they are taking? How can the probability of default best be estimated? As explained, the probability that a company defaults can be derived in a couple of ways, each with its own benefits and downfalls. These methods for deriving the probability of default can be cate-gorised into methods using accounting data, and methods that use market data. We will discuss both here, and make a comparison in the end.

2.1 Predicting default using accounting data

Accounting data often tells analysts a lot about the state the company is in. One way to predict a future default of any company is to look at all companies, including the ones that defaulted, and look at all the obtainable characteristics of all these companies. If the companies that default display different charac-teristics than firms that do not default you could make inferences about the relationship between characteristics of the company and default. For example, if the sales have been dropping and the Cost of Goods Sold has been increasing, the analyst may infer that the probability of default has increased. Thus, the PD can be determined by means of accounting data.

One good way to study these relationships is by using logistic regression. Here a relationship is found between independent variables x, and a dependent variable y, where y ranges from 0, to 1, as the probability of default cannot be negative or higher than 100%.

A lot of research has been done finding relationships between default and firm characteristics using logistic regression. The most predominant accounting methods were developed by Altman (1968), and Ohlson (1980). Altman (1968) developed a Z-score based on five ratios: working capital over total assets, re-tained earnings over total assets, earnings before interest and taxes over total assets, market value of equity over total liabilities and sales over total assets.

(9)

Likewise, Ohlson (1980) developed a similar score, called the O-score, which takes into account different accounting ratios.

This tradition has continued with Chava et al. (2004); Campbell et al. (2008). These studies find that firm characteristics such as higher leverage, lower prof-itability and lower past stock returns are predictive for failure, especially in the short term (Campbell et al., 2008).

Modelling the relationship between the characteristics and default by logis-tic regression has a few pitfalls. For example, the interactions between inde-pendent variables has to be correctly specified, as the functional form of the logistic regression assumes that the relationship between the independent and dependent variable are pre-specified by means of a dummy variable. More com-plex interactions between variables will have to be specified in advance (e.g. β3low prof it ∗ high leverage could indicate an interaction between profit and

leverage).

Related to this is another disadvantage of logistic regression. The variable effects are globally the same; the logistic regression does not allow for a different interaction between two variables when a third variable is higher than some value, unless this is pre-specified. Also, strong assumptions have to be made regarding the normality of the distributions of the variables, and the exogeneity of the included variables.

One solution to this could be using non-parametric methods like machine learning. Machine learning is the collective name of many different methods like random forests, gradient boosting or neural networks. Numerous studies have been done modelling the relationship between the characteristics of firms and their default using machine learning.

Angelini et al. (2008) found that the new Basel regulation enabled banks to implement their own risk models. The authors study the use of Neural Net-works, a machine learning technique, to forecast defaults and finds that the model achieves very low error rates, provided that careful training is performed. Likewise, Khandani et al. (2010), using Machine Learning, finds an R2 _of

fore-casted delinquencies of 0.85, and estimates that 6 to 25 percentage of total losses can be saved, compared to logistic regression, indicating the possible added value of machine learning in credit scoring.

Whereas Angelini et al. (2008); Khandani et al. (2010) study Neural Net-works and Support Vector Machines (SVM) on accounting data of the firms, Addo et al. (2018) investigate the use of tree based methods, and find that these produce more stable and reliable results. He concludes that the choice of method used, either different machine learning techniques or logit or probit, has a large impact on the results, and that it very much dependent on the structure of the data; He recommends developing, if possible, multiple models based on multiple methods. Sandberg (2017) confirms these findings, concluding that different datasets, with different levels of complexity, require different methods to create the most accurate forecast. In her study the more advanced machine learning methods were only slightly better than the logistic regression method.

(10)

2.2 Predicting default using market data

Using accounting data brings a few problems when estimating the risk that a firm will default. As the accounting information is retrieved from financial statements, and these have been created in the past, the accounting method is inherently backward-looking, and doesn’t take into account what the market’s view is of the firm. For example, if the firm’s equity displays high volatility, this doesn’t lead to an increase in default risk, as long as the firm’s financial statements don’t display reason for panic.

The latter method, where the probability of default is determined by means of market data, can be further categorised into different methods. One method uses the rating of corporate debt. Such rating have been provided by companies like Standard & Poor. Each rating bucket, like ’BAA’ signifies a range of prob-ability of default. Thus, using ratings from these companies creates forecasts of default that are not very precise. Each rating bucket (e.g. AAA) contains companies that have a PD in a range (e.g. 0.1% to 0.5%), such that an exact PD cannot be obtained.

The rating method makes assumptions about the extent to which the bond and equity market are integrated, and are efficient. If the bond and equity market are not well integrated the PDs created from the rating method gives outdated and possibly biased results. Vassalou & Xing (2004) note that other methods can predict a deterioration of creditworthiness months before a credit rating agency’s rating gets adjusted. This gives an indication of the extent, and speed with which quickly changing company information gets incorporated into the credit ratings. Credit ratings thus are not ideal for predicting default.

The price of corporate debt can also be used; as bondholders require a return for the risk they take for when a company defaults, the price of corporate debt is lower than default-free debt, like debt from sovereign countries. The credit spread is the yield on the corporate bond, minus the yield on a treasury bond. From here an implied hazard rate can be calculated. The hazard rate is the PD in one year given that the company has not defaulted earlier. This implied hazard rate however is not equal to the empirical hazard rate. This is because the risk premium.

Another method, that does not rely on the information from the bond mar-ket, or ratings from rating-agencies, applies the framework of Merton (1974). This model has been adopted by KMV, and applied in academic research (Vas-salou & Xing, 2004; Duffie et al., 2007; Bharath & Shumway, 2008).

The Merton model relates the PD to the movements of the value of the firm. The value of the firm is equal to the value of the equity and debt combined: At= Et+ Dt. The firm will default when the value of the firm in the future,

At+T becomes lower than the face value of the debt F .

The Merton model assumes that the equity of the firm is like a call option that can be exercised as long as the value of the firm is above the face value of the firm’s debt. This is because the equity holders are the residual claimants on the assets of the firm. Thus, the strike price of this option is equal to the book

(11)

value of debt F , and, using option pricing, the value of the equity is:

Et+T = max(At+T − F, 0) (2.1)

Here Et+T is the future value of firm’s equity, AT +T is the future value of the

firm’s asset value, and F is the face value of the debt.

Likewise, the value of the debt is like a short put option with a strike price of F . This is because whenever the value of the firm is above the value of the debt, the lenders don’t receive additional returns above the interest rate of the debt. However, when the firm value is below the face value of debt the lenders only receive the money that is left.

The pay-off of the debt thus is equal to

Dt+T = F − max(F − At+T, 0) (2.2)

Here Dt+T is the future market value of the firm’s debt.

Using these comparisons the model makes strong assumptions, like Black & Scholes (1973), that the asset price of the firm is log-normally distributed, and that the asset value of the firm follows geometric Brownian motion:

dA = µAdt + σAAdW (2.3)

Here A is the asset value of the firm that has a drift of µ, and a volatility of σA.

W is the standard Wiener process.

This way the option pricing techniques from Black & Scholes (1973) can be used, where Et= Atφ(d) − e−rfTDφ(d − σA √ T ) (2.4) and Dt= e−rfTDφ(d − σA √ T + Atφ(−d) (2.5) where d = ln(At/D) + (rf+ σ 2 A/2)T σA √ T (2.6)

rf is the risk-free rate and φ is the cumulative density function of the

stan-dard normal distribution.

To implement this, knowledge is required of At and σA. These, unlike Et

and σE are not directly observable. Ito calculus can be used to derive:

Et= Atφ(d) − erfTDφ(d − σA

√

T ) (2.7)

Thus:

σEEt= φ(d)σAAt (2.8)

With equations 2.8 and 2.4 we have two equations and two unknowns. Thus we can solve for Atand σA.

The risk neutral probability of default is then:

P r(At+T < D) = φ(σA

√

(12)

Here d − σA

√

T is equal to the Distance to Default (DD). The DD signifies the number of standard deviations the asset value is removed from default. The model thus uses the volatility of the share price and the face value of debt to infer a Distance to Default (DD). The smaller the DD the bigger the probability of default.

Obtaining the market value of the assets and the volatility of the asset price is a complex process. Bharath & Shumway (2008) have studied Merton’s model and its functional form, and found that using a naive solution to the Merton model has a high correlation with the original solution, used by e.g. Vassalou & Xing (2004). The predictive power of this solution is sometimes higher than the full solution. The naive solution uses the z-score specified by Merton (1974), but takes the face value of debt as the market value of debt, and uses the volatility of the equity returns of the last year as the volatility of the asset value. Bharath & Shumway (2008) have shown that the predictive power of the Merton model lies in its in functional form, and not in the variables used in the solution.

2.3 Comparing methods

There are stark differences between the two methods. The accounting model is a more traditional modelling method, where relationships between the in-dependent variables and the in-dependent variable are found using either logistic regression, or machine learning.

The accounting method thus is backward-looking and does not incorporate expectations of what will happen with the company in the future. For example, a pharmacological firm that is in the process of developing a new medicine might have excellent financial reports, but could also be on the brink of bankruptcy if the new medicine doesn’t get accepted by medical regulators. The accounting methods don’t recognize this kind of information.

The accounting model assumes that the past relationships between variables should also count for the future. For example, if in the past a high growth and low debt indicated a low probability of default, this should also be the case in the future. This of course doesn’t have to be the case, especially for more complex interactions.

The Merton model takes a different approach, and assumes that the volatility of the past, and the debt level should be telling enough about the probability that a firm will not be able to pay its obligations in the future. Thus, a firm might hypothetically have a very high debt, but a low volatility of its equity prices and thus a low probability of default.

The model assumes that the firm volatility is stationary; we expect the past volatility to be the same as the future volatility. It has been shown that volatility has a high autocorrelation; tomorrow’s volatility is likely to be high, if today’s volatility is high. Still, as volatility tends to move a lot, one could take an itera-tive procedure to derive it, like a GARCH method. With this method the share price volatility could be inferred, where today’s price movements weigh more heavily than the price movements from the distant past. This could result in better estimates of tomorrow’s volaility. However, Bharath & Shumway (2008)

(13)

have shown that more advanced techniques for deriving the volatility do not result in increased predictive power when predicting default using the Merton method.

Hillegeist et al. (2004) found that the probability of default estimates from the Merton method had significantly more power than the other two accounting based methods from Altman (1968) and Ohlson (1980). Likewise, Vassalou & Xing (2004) finds that the market based prediction method is more powerful than the method based on ratings from for example S&P.

In contrary, Du & Suo (2007) find that the Merton predictor is no good predictor of credit scores, and does not adequately capture the information that should be in the stock market about the credit quality. However, Du & Suo (2007) use credit ratings as a measure of credit quality. It is possible that the credit ratings do not entirely capture firm’s credit quality in a timely manner, as explained by Vassalou & Xing (2004).

This research coincides and deviates with earlier research in a manner of ways. An important similarity between this research and earlier research is the fact that we compare the accounting based method with the Merton method to derive a PD. Another similarity lies in the fact that we use machine learning to derive a PD on the accounting data. This has also been done multiple times, and the predictive ability of machine learning has been shown.

One of the most important aspects in which this research is different lies in the comparison. Other studies have investigated the use of machine learning, and compared this to more traditional modelling methods, like logistic regres-sion. Here results show that the machine learning method often performs better in terms of predictive power. Other studies investigated the use of the method from Merton (1974), and compared this to logistic regression. Here diverging results have been observed; some research shows that the accounting method works better, other research displays that the market method has more pre-dictive power. No earlier study however has investigated the use of machine learning and compared this to the market based method.

Research on machine learning has found that the machine learning methods often achieve better predictive power than logistic regression methods. Breiman (2001) notes that, especially when a dataset has many variables, where each variable contains only a small part of the information, the predictive power of logistic regression or simple decision trees is only slightly better than random selection. Random forests get lower error rates in these situations. Thus, there might still be hidden complexities in the accounting data that could explain the default of a company.

literature on the prediction of default by means of market data shows that the functional form of the method from Merton (1974) has significant predictive power. Some research (Du & Suo, 2007) has found that the market based method does not have predictive power. This could however be due to the fact that the dependent variable in this research was not actual default, but rather the credit rating given from rating agencies. These ratings are not flawless, and probably are not able to achieve perfect predictions.

(14)

it is unclear which method, the accounting or the market method, has more predictive power. The machine learning method however has often been shown to have superior predictive power than the logistic regression method. Thus, we hypothesize the following:

Hypothesis 1: The accounting method, using machine learning has higher predictive power than the market method from Merton (1974).

We will now elaborate on the methods used to test this hypothesis. Firstly we will discuss the accounting methods. After explaining both the logistic regres-sion and the machine learning methods we will elaborate on the market method created by Merton (1974), and adopted by Bharath & Shumway (2008).

3 Methodology

3.1 The accounting methods

The goal of the accounting method is to find the default probability, given the independent variables P (yi= 1|xi). Here yi is a dichotomous variable, which is

equal to 1 if a company is in default, and 0 otherwise. xiis the vector containing

the independent variables of company i.

We will first elaborate on the logistic regression, and machine learning meth-ods, and their specific modelling choices. After, we will discuss general modelling choices issues like data-manipulation, data-selection, selection bias, and missing data.

3.1.1 Logistic regression

As the dependent variable is binary; either 0 or 1, logistic regression is used to model the probability of default. The logistic function can be written as

P (x) = e

β0+β1x

1 + eβ0+β1x (3.1)

here

P (x) = P (def ault = 1|x) (3.2)

In a logistic regression model one unit increase in the variable x changes the log odds by β1. The log odds is log(_{1−P (x)}P (x) ). The logistic regression model is fit by maximum likelihood estimation.

3.1.2 Machine learning methods

It has been shown (Malley et al., 2012) that probabilities from nonparametric learning machines are consistent if the mean squared error converges to 0 when the number of observations approaches infinity. i.e. limn→∞E( ˆf (xi)−f (xi))2=

0. Here ˆf (x) is the modelled relationship, and f (x) is the true relationship, between the vector x and y. This consistency is apparent in multiple machine learning methods, like random forests, or Nearest Neighbours, but not in all.

(15)

For example, gradient boosting has been shown to have high predictive power, but its probability estimates to be inconsistent.

To acquire consistent PD estimates of the companies we should therefore use a consistent method, like random forests or k-nearest neighbours. It has been shown that Nearest Neighbours is less powerful (Sandberg, 2017) in estimating PD’s. This is due to the fact that the nearest neighbours method is an unsuper-vised learning method. An unsuperunsuper-vised learning method is a method where the true class, or the dependent variable of the observation is unknown. The true class is known for supervised methods. For predicting the default probability we have data available on defaults, thus a supervised learning method is more appropriate. Therefore, the machine learning method used in this paper is a random forest.

The random forest method has been introduced by Breiman (2001). He explains that using an ensemble of classification trees where each individual tree is inferior to a single tree results in more consistent results. Breiman (2001) defines random forests as follows: ”A random forest is a classifier consisting of a collection of tree-structured classifiers h(x, Θk)k = 1, ... where the Θk are

independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x”. Random forests are thus similar to Bootstrap Aggregation, or bagging (Breiman, 2001). To understand random forests (and bagging) it is best to first understand decision trees, as a random forest consists of a multitude of decision trees.

Decision trees are a method that focuses on splitting a dataset into contin-uously smaller parts, where splits are made such that the resulting smaller sets are more homogeneous with regards to the dependent variable. This can both be done for categorical as numeric dependent variables. In the former case, the decision tree is called a classification tree. This kind of decision tree is also rel-evant for this paper, as we try to create a model where the dependent variable is either ’default’ or ’no default’.

A simple fictional decision tree is displayed in figure 1. The decision tree algorithm first splits the full dataset into a set where the applicant age is larger than and smaller than twenty-five years. Next it splits the set where the appli-cant age is bigger than 25 further into two sets: one where the company assets are more than 500,000, and one where they are less. Every split is made based on a criterion where error is minimized. The error in the classification tree is calculated by the log loss.

LL =

N

X

i=1

−(yilog(pi) + (1 − yi)log(1 − pi)) (3.3)

Here N is the number of observations, yi is the response variable, and pi is

the probability that the model assigns of observation i. When the true value is 1, and the model assigns a probability of 0.98, the error, or log loss, increases by a minimal amount. Likewise, when the true value of an observation is 0, and the model assigns a probability value of 0.5, the log loss increases by −log(0.5). The probability that the algorithm assigns to each observation is derived

(16)

from the ratio of positive values with respect to the number of total observations in each bucket. For example, the probability assigned to the bucket where the applicant age is less than 25 is 10/100 = 0.1.

A resulting visualization of the mapping is depicted in figure 2. Here you can see that only when the applicant age is above 25 years and the company worth is above 500k, the applicant will be accepted. These relationships are harder to model with logistic regression, especially in multiple dimensions, and with more complex interactions. Whereas logistic regression requires ex ante knowledge, or intuition about the variable interactions, the random forest algorithm is able to infer these interactions from the data.

Figure 1: A decision tree

Figure 2: Plot of decisions of decision tree

Decision trees tend to over-fit the data by modelling too many interactions, thereby not only modelling the general relationships, but also the noise. E.g. in the earlier example a decision tree could be built splitting the leftmost node further by age, until all final nodes contain only declines, or only accepts. This

(17)

is called overfitting. Using this model on unseen data will likely result in a higher classification error.

To circumvent this problem we can either trim the trees to avoid modelling the noise, and to only model the general, relevant relationships. This is called pruning. With pruning an additional cost, or classification error is added. The more complex the tree becomes, the higher this cost is. Pruning however is not ideal, as the error rate tends to still be relatively high (Khandani et al., 2010). An alternative could be to use random forests. Breiman (2001) has shown that random forests are resistant to overfitting, and generally get low error rates. Random forests build on the concept of the decision tree, by combining multiple trees together, where each individual tree performs worse than the single large tree, as it is trained on a sample of total dataset. The decision trees are built on different bootstrapped datasets, and only a random sample of the independent variables is chosen as possibilities for each split in the decision trees.

The following algorithm is repeated k times, where k is the number of trees in the random forest:

1. Select m independent variables randomly (column-wise)

2. Select a random subsample of the total observations (row-wise)

3. Using this subsample make the best splits with respect to the log loss, creating a decision tree. The creation of the decision tree stops when a criterion is met, like a minimum improvement in log loss.

When all k trees are created the model is ready. The prediction of a new entry is then the predictions of each individual tree added together and divided by the number of trees. E.g. in a forest with two trees, tree one might predict that the PD of company A is 0.05, and tree two might predict a PD of 0.10. The PD predicted by the whole model is then 0.075.

To develop a random forest a few additional choices regarding the data have to be taken into consideration to create an unbiased, accurate model. Besides the general modelling choice, we also have to consider the fact that the data is imbalanced. Imbalanced data means that one of the two categories of the dependent variable is more abundant than the other variable. In the case of default prediction the non-default category is a lot more abundant than the default category. This class imbalance creates problems for the machine learning algorithm, as the algorithm could classify every instance as non-default and get a low error rate. This model however will not perform well on unseen data, and it will perform much better if the class imbalance problem is solved. This can be done in two ways.

The class imbalance can be solved by randomly resampling the training set. One option is to undersample the abundant category. Another option is to oversample the scarce category. Both have downsides; oversampling leads to a bigger computation time. Undersampling leads to a loss of information. A more advanced solution is Tomek Link (T-link), and Synthetic Minority Oversampling Technique (SMOTE), developed by Chawla et al. (2002). These techniques are also applied here, and should lead to lower bias in the predictions.

(18)

3.1.3 General modelling choices

The creation of the accounting methods requires some additional considerations. We should take account of the data-manipulation, data-selection, selection bias, and missing data. The accounting method requires some pre-processing, as the raw data acquired from DLL might not be specified in the way required, or may not be used for any reason. Firstly there is data that is excluded for ethical reasons: these are variables like names, addresses, or sensitive data that is irrelevant for the credit process.

Other variables are excluded because they were unknown at the relevant moment. Variables, like the contract period or the amount of credit required, are excluded because, even if they have predictive power, they could easily be manipulated by the applicant, creating a biased view of the default probability of the applicant. Highly correlated variables (correlation is higher than 0.8) are also excluded.

Like Addo et al. (2018), to determine which variables should be included in the model, we include all variables in a machine learning model. Next, the variable importance is derived by calculating the relative improvement of the error rate whenever a variable is used. Variables that have relatively little predictive power are excluded from the model. The exact number of included and excluded observations in the data are stated in the data description section. The variables that are included are winsorized at the 1% and 99% quantiles, after which they are normalized and standardized. For this a min-max linear transformation is used: x = _{max(x)−min(x)}x−min(x) . The scaling is done to ensure that the machine learning method treats all independent variables the same. If they are not scaled, the machine learning method will tend to overemphasize the importance of the variables that have a bigger range.

Also relevant is setting the default definition. Defining when a legal entity is in default is often a difficult issue. Whereas in the stock market default often means that some judicial path is chosen to divide the residual value of the firm, like a chapter 11 liquidation, for smaller firms it is often harder to exactly determine default. The standard in credit scoring is to declare a firm as defaulted whenever it has not paid its obligations 90 days past the due date. The default variable is therefore set at 1, whenever the firm is 90 days in arrears. The data has to be partitioned into two different sets to ensure the validity of the model. 80% of the data is taken for the machine learning algorithm to learn. This set is called the training set, or the in-sample set. The other 20% is taken as a test set, or the out-of-sample set. This set is created to ensure that the model is robust to new data. If the model heavily overfits this will be seen in the results of the test data set. The test set is only used at the final stage.

We also use 10-fold cross validation for the machine learning method. x-fold cross validation involves retraining the model x times on (1 − 1/x) ∗ 100% of the training dataset. For every retraining a different part of the total training set is excluded from training. After retraining the model x times we can create performance statistics on the 1/x% of the training set, as we have x different performance measures. It is likely that the performance on the training set is

(19)

very high, as the method learns the specifics of the data, and slightly lower on the cross-validation set, and even slightly lower on the test-set. A much lower performance measure on the test-set indicates that the model over fits the data. Selection bias is also a big problem for modelling the probability of default of firms that apply for a lease. All applicants with good characteristics should get accepted, and the applicants with bad characteristics should get rejected. The leasing firm will only record the performance of the accepted firms, and thus a selection bias gets created. We want to estimate the probability of default for the population of firms that get accepted and declined. As the accepted popu-lation probably has better characteristics than the non-accepted popupopu-lation, we can assume that the predicted probabilities of default are more optimistic than should be.

To account for this bias in the predictions we could also create performance measures for the declined, non-booked population. This is called reject infer-ence. For reject inference additional data is acquired, that is not used in the machine learning or logistic regression models. This data is used to create a default indicator for the population that did not get a lease. Reject inference is applied at DLL by using D&B data.

Missing data is a problem for any model. Coping with the missing data re-quires knowledge of the reason why the data is missing. Rubin (1976) researched missing data and concludes that there are three types of missing data: Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR).

These three types of missing data are categorized on the basis of why the data is missing. E.g., MCAR signifies that the missing of the data is completely random. It is not due to any observable or unobservable variable. MAR signifies missing data that is missing due to some observable variable. For example, with the collection of income data, some individuals may be more willing to share this information, based on how high the income actually is; higher income individuals may be more willing than lower-income individuals. If all individuals with a high income live in one area, and all individuals with a low income in another area, we could include an area variable, to account for the fact that some data is missing. MNAR is missing data for which there is a non-random reason for the fact that the data is missing. Using the earlier example, this would be the case if information on the area is unknown or unobtainable.

We assume that the missing data is either MCAR or MAR. This assumption cannot be verified, and thus possibly introduces a bias in the results. Using this assumption, we can replace the missing data with the mean of the variable.

3.2 The Merton method

For the Merton method we utilize the findings from Bharath & Shumway (2008). In their article they compare a naive solution to the Merton (1974) model, to the actual model from Merton (1974). The naive model finds the asset volatility, market value of debt, and growth rate of assets in a quicker way, and more naive. The extended solution requires developing an iterative procedure to find the

(20)

return volatility, and thus entails a bigger process, involving more assumptions. The authors find that the nave solution works equally well, if not better, than the extended version.

The market value of the debt of each firm (D) is set equal to the face value of the debt (F ). The volatility of the debt is set equal to a quarter of the return volatility plus 5%:

Dnaive= F (3.4)

σD,naive= 0.05 + 0.25σE (3.5)

The five percent represents the term structure volatility, and the 25% rep-resents volatility associated with default risk. The volatility of the returns is calculated by taking the standard deviation of the past year’s returns. The volatility of the assets then becomes a weighted average of the equity volatility and the naive debt volatility:

σV,naive= E E + Dnaive σE+ Dnaive E + Dnaive σD,naive (3.6)

The expected return is set equal to the return of the equity over the last year:

µnaive= ri,t−1 (3.7)

.

This results in a calculation of the Distance to Default:

DDnaive=

ln((E + F )/F ) + (ri,t−1− 0.5σ2V,naive)T

σV,naive−

√

T (3.8)

.

The Distance to Default states the number of standard deviations the future asset value is from going into default. The probability of default thus is:

P Dnaive= φ(−DDnaive) (3.9) or P Dnaive= φ(− ln(E + F/F ) + (rr,t−1− 0.5σ2V,naive)T σV,naive− √ T ) (3.10)

This PD should give the probability that a firm will default within the comin year.

3.3 Performance measures

As we use different datasets to compare the predictive performance of two meth-ods, we require a performance metric that is not scale, and not dataset specific. For example the root mean squared error (RMSE) calculates the squared er-ror for every observation. However, because we are making estimates on two different datasets, the RMSE is not an ideal measure to compare performance.

(21)

The performance of the three methods will be tested with two metrics: the Accuracy Ratio, described by Vassalou & Xing (2004), and the Area Under Curve (AUC) of the Receiver Operator Characteristic (ROC). We will first elaborate on the AUC.

To understand the ROC curve it is necessary to understand the confusion matrix. An example of a confusion matrix is displayed in table 1. Here, TP stands for true positive, FP stands for false positive, FN stands for false negative, and TN stands for true negative. The ROC curve plots the false positive rate, against the true positive rate. Here, the false positive rate is equal to _{F P +T N}F P , and the true positive rate is: _{T P +F N}T P .

The ROC curve is created by changing the threshold level from 1 to 0, with some intervals (the smaller the steps the more the graph looks like a curve). The threshold level determines which predictions will be classified as positive, and which as negative; if a prediction is above the threshold it will be classified as positive, and vice versa. For all threshold levels we calculate the false positive rate, and the true positive rate. An example of an ROC curve is shown in figure 3. The Area Under the Curve (AUC) of the ROC curve is then the area under the ROC curve. The higher the AUC, the better the model predicts defaults. An AUC of 0.5 indicates a straight line, also depicted in figure 3. This indicates that the model has no predictive power and chooses default or no default by random chance.

Figure 3: An ROC curve

The Accuracy ratio is similar to the AUC. The default frequency in the full dataset is θ = M_N. Here M is the number of defaulted firms in the dataset, and N is the total number of firms. After the probability estimates have been created, starting at the firms with the highest default risk, for every integer λ between 0 and 100, we calculate the number of firms that actually defaulted in the top λ percent firms. This number is then divided by M. The results is f (λ). As λ increases, f (λ) increases. The quicker f (λ) increases, the better

(22)

Table 1: A confusion matrix

Truth

1 0

Prediction 1 TP FP

0 FN TN

the algorithm. A similar curve to the ROC curve can be created where λ is plotted against f (λ). Again, the area under this curve is calculated to derive a performance measure. A confidence interval of the performance measures can be given by means of bootstrapping.

4 Data description

The different ways for deriving the probability of default require different datasets. The accounting data originates from the leasing company DLL. The market data is obtained from both CRSP and Compustat. We obtain data for the period 2008-01 to 2014-01 on both leases and market data in the United States. First we will provide descriptive statistics of the market data.

4.1 Market data

For the Merton DD method we need two types of data: data from the stock market on the equity prices, and data on the face-value of the debt. We obtain the data in the intersection of the daily stock prices from CRSP and the data on the debt level of the firms, obtained from Compustat. This data also ranges from 2008 to 2014. Financial firms, with SIC codes 6021, 6022, 6029, 6035, 6036, are excluded.

From CRSP we obtain the number of shares, the share price, and the default indicator. The default indicator is created with the delisting code variable. The delisting code 574 indicates ’bankrupt, declared insolvent’. A variable is created that indicates ’bankrupt within a year’. The market value of the equity is calculated as the share price times the number of shares outstanding. The volatility of the market returns is calculated as the standard deviation of the daily returns in the last year, or 260 business days.

From Compustat we obtain the face value of the debt. Like Vassalou & Xing (2004); Bharath & Shumway (2008), we calculate the face value of debt that has to be paid in one year as the debt to be paid in one year, or the ’current liabilities’ plus half of the ’long term debt’. The long term debt is included because over this debt interest will have to be paid. We also include the one year risk free rate, which is obtained from the Federal Reserve Board.

The extreme values that are included in the observations in the CRSP and Compustat datasets are winsorized at 99% to ensure that the outliers don’t influence the results too heavily. table 2 provides the summary statistics for the CRSP and Compustat data. Table 3 reports the number of bankruptcies

(23)

Table 2: Descriptive statistics of the market data

Quantiles

Variable Mean Std. dev Min 0.25 Median 0.75 Max

cl 1063.69 3324.36 0.50 19.071 93.88 462.97 24656.95 ltd 1309.97 3801.81 .00 .023 67.47 684.19 26531.24 F 1667.99 4854.47 .58 23.66 149.67 831.34 35064.47 E 2189.54 6143.43 2.60 70.98 291.22 1246.20 43293.57 V 4575.32 10528.15 10.54 165.15 683.26 3136.32 60440.60 σE (%) 52.14 37.01 6.38 26.67 42.73 66.19 209.70 σD (%) 18.04 9.25 6.60 11.67 15.68 21.55 57.27 σV (%) 47.2 27.15 12.4 28.00 40.60 59.30 155.70 µ(%) -.030 3.13 -11.10 -1.23 .00 1.12 11.09 r(%) .31 .57 .00 .06 .10 .16 3.19

This table reports summary statistics for all the variables used in the Merton model. cl is the current liabilities in millions of dollars. ltd is the long term debt in millions of dollars. F is the assumed face value of debt due in one year, estimated at the current liabilities plus 0.5 times the long term debt. E is the market value of the equity in millions, and is calculated as the product of the share price and the number of shares outstanding. V is the value of the firm on the market in millions of dollars, calculated as the sum of the face value of debt, plus the equity value. σE is the yearly volatility of the equity of the firm. This is calculated as the standard deviation of the log returns per year per firm times√260. σDis the volatility of the firm’s debt, calculated naively as 0.05 + 0.25σE. σV is the volatility of the firm’s assets in percentages per year, calculated as (E/V )∗σE+(F /V )∗σD. µ is the expected return on assets measured in percentages. This is calculated as the log-return over all firms over all years. r, the risk free interest rate is the three month treasury-bill rate, denoted in percentages.

per year in the market data. We record a total of 114 defaults in 6 years. Other research, like Vassalou & Xing (2004) often uses data with more default occurrences. The CRSPS delisting code thus may not be the most appropriate default indicator.

4.2 Accounting data

DLL provides leases to a wide range of companies, but mostly to smaller firms that are not listed on the stock market. Because the machine learning method requires larger amounts of data we obtain lease-data on the full range of firms, not only larger firms. Table 4 reports the selection steps taken to obtain a representative dataset. The main reasons for excluding applications is because they were not booked, and there was no data available for reject inference, i.e. we could not infer whether these applicants would have defaulted or not. Also excluded are companies that are from governmental agencies, as these are not representative for the target population. The final dataset includes 155,409 applications for which default data is available.

(24)

Table 3: Bankruptcies per year

year Number of firms Number of bankruptcies Percent bankrupt

2008 7082 18 .25 2009 6838 46 .67 2010 6818 16 .23 2011 6848 12 .18 2012 6866 14 .20 2013 6909 8 .11 Total 9204 114 1.24

This table reports the number of firms and bankruptcies in the sample per year. A bankruptcy is indicated by a delisting code of 574 in the CRSP database. Percent bankrupt denotes for each year the percentage of firms that went bankrupt, that were active in that year.

Table 4: data selection steps

Step Selection Excluded Remaining

Full data, no cleaning 974,169

1 Exclude non-booked applications with no RI data 131,654 842,515

2 Exclude existing customers 292,770 549,754

3 Exclude applications after 2012/06. No data available 55,565 494,180 4 Exclude applications before 2010/10. No RI possible 224,256 269,924 5 Exclude government, and irrelevant applications 114,515 155,409

Remaining applications 155,409

This table reports the number of observations, and the selection steps for the accounting data. Most data is removed for the purpose of Reject Inference (RI). This is also the reason why existing customers are not included.

data. Including all variables results in a dataset with 515 variables. Not all of these variables could be used for a couple of reasons. Firstly, the variables with personal information should be excluded. These are variables like name, or address. Other variables are removed because they are not available at all at the moment of scoring. Second, variables that have no predictive power are removed. This is done using the variable-importance method described in the methods section. The variables that have high correlation are also removed. This is because a variable that is highly correlated with another variable likely doesn’t have any added predictive power. The final dataset contains 33 independent variables.

(25)

Table 5: Variable selection accounting data

Step Selection Excluded Remaining

Full data, no cleaning 515

1 General exclusions 144 371

3 Exclusions based on predictive power 238 133

4 Exclusions based on correlation 100 33

Remaining variables 33

This table reports the number of variables used in the accounting method. General exclusions are because of private or administrative information or because they were not available at the moment of scoring. The exclusions based on the predictive power were done using the machine learning variable importance method. Variables of which the correlation is higher than 0.8 were also removed.

5 Results

We will present a number of empirical findings. As mentioned, we employed the naive solution from Bharath & Shumway (2008) to derive a PD for the market method. As such, the market method doesn’t require training on the observed dependent variable. For this reason we will only mention the out-of-sample results of the market method every time. Firstly, we will note the estimates of the probabilities of default in the out-of-sample dataset for all three methods. The next sections present the in-sample and out-of-sample results.

Table 6 provides summary statistics for the probabilities of default calculated by the Merton, logistic regression and machine learning methods. The logistic regression and machine learning probabilities report the statistics for out of sample data. The Merton method report statistics for all firms. As can be seen, all methods are able to differentiate between firms going bankrupt, and firms that remain solvent. All methods obtain results that are significant at the 1% level. The mean bankruptcy rate for non-defaulting firms is relatively high, at almost 6% for the machine learning method.

Table 7 reports the UAC, AR and RMSE, and their standard deviations of the Market method, and logistic regression and machine learning methods on the in sample-sample set. The AUC, AR and RMSE of all methods are very close. For example, the AUC of the Merton method is 87.65%, for the logistic regression method 87.11%, and for the machine learning method 87.14%. Likewise, the rounded RMSE of each method is around 23%.

For the market method estimates of the standard deviation of the mean are calculated by bootstrapping. For the accounting methods estimates of the standard deviation of the mean are calculated by 10 fold cross validation. In each of the ten folds in the cross validation an AUC is calculated on the 10% left out of the training set, such that 10 times an AUC gets calculated. We perform a t-test on the mean with the data from table 7. The machine learning method is not significantly different from the market or the logistic regression method

(26)

in terms of AUC, AR, or RMSE (p = .820). This means that all methods have the same predictive power when trying to forecast a default.

As an extra test of reliability of results we also test the performance of the accounting models on unseen data, in the test-set. Table 8 reports the out-of-sample unseen test-set AUC, AR and RMSE of the accounting methods, and the full sample results of the market method. The AUC of the Merton method is 87.65%, for the machine learning method it is 87.15%, and for the logistic regression it is 87.11%. This indicates that the performance for these two measures is relatively close to each other, also in the test set.

An estimate of the reliability of these performance measures is given by bootstrapping the performance measures for all methods. Using the metrics on the spread of the performance measures, a test on the means of the performance metrics shows that none of the methods are significantly different from each other (p = 0.594), on any performance metric. A plot of the ROC curve for the market method is displayed in graph 4. Graph 5, the ROC curves of the accounting methods, shows similar results.

These out-of-sample results verify the findings on the in-sample set that all three methods have the same predictive power when forecasting default. All methods are able to differentiate between firms going bankrupt within a year, and firms that remain solvent for the coming year, as indicated by the mean PD of each method.

(27)

Table 6: Probability of default estimates for bankrupt and solvent firms

Quantiles

Variable Status Mean st. dev Min 0.25 Median 0.75 Max

Merton prob Solvent 2.69 4.63 .00 .0032 .47 3.27 100

Bankrupt 10.51∗∗∗ 4.79 .00 7.77 9.93 13.38 100

Log-reg prob Solvent 5.72 9.64 .19 1.04 2.27 5.64 85.23

Bankrupt 31.72∗∗∗ _25.36 _.45 _8.69 _25.4 _50.82 _92.30

ML prob Solvent 5.89 10.27 .00 .37 2.11 6.43 96.00

Bankrupt 35.54∗∗∗ 27.15 .00 10.90 30.65 56.22 99.33

This table reports summary statistics for the probability of default for the Merton, logistic regression and machine learning methods. For each method both the PD for the solvent as well as the non-solvent firms is given. A t-test for equal means is performed on the means of the PDs of the methods to test whether the solvent firms have a significantly different mean. (∗∗∗significant at the 1% level. ∗∗significant at the 5% level,∗significant at the 10% level)

(28)

Table 7: Performance metrics on the cross-validation set, for two metrics: lo-gistic regression and machine learning

Variable Merton Logistic regression Machine learning AUC (%) 87.65 (.48) 87.11 (.78) 87.14 (.77) AR (%) 59.22 (.39) 57.34 (.83) 58.03 (.65) RMSE (%) 22.54 (.22) 23.08 (.37) 22.55 (.98)

This table reports the Area Under Curve, Accuracy Ratio and Root Mean Squared Error of the two accounting methods and the Merton method to derive a PD. The statistics for the accounting methods are reported on the cross-validation set. The statistics for the Merton method are reported on the full dataset. The standard deviation for the Merton method is obtained by bootstrapping. Standard deviations are noted in parentheses.

Table 8: Performance metrics on the out-of-sample set, for three metrics: Mer-ton, logistic regression and machine learning

Variable Merton Logistic regression Machine learning AUC (%) 87.65 (.48) 87.35 (.90) 87.15 (.79) AR (%) 59.22 (.39) 56.29 (.78) 55.11 (.54) RMSE (%) 22.54 (.22) 22.67 (.62) 22.86 (.50)

This table reports the Area Under Curve, Accuracy Ratio and Root Mean Squared Error of the three methods to derive a PD. The statistics are reported on the test set. Standard deviations are reported in parentheses, and calculated by bootstrapping

(29)

6 Discussion

This thesis looked at the predictive power of using different techniques on market and accounting data to derive the probability of default. The techniques used were the Merton market method, machine learning and logistic regression. We employed a random forest as the machine learning method.

We hypothesized that the machine learning method would have higher pre-dictive power than either the logistic regression and the market method. Results show however that the three methods are very similar in their predictive power to distinguish between defaulting and non-defaulting companies. Contrary to predictions, it is shown that the random forest machine learning method is not better than the logistic regression method. The out-of-sample predictions con-firm these findings; the machine learning, and logistic regression methods both achieve similar performance to the Merton market method. With the bootstrap-ping method an estimate has been given of the confidence in these measures, and the extent to which they differ. This method shows that the machine learning method applied on the accounting data is not better than market method.

How do these results compare to existing literature? Khandani et al. (2010) has found that machine learning offers better predictive power than logistic regression in predicting default. And research from for example Du & Suo (2007) found that the Merton predictor doesn’t adequately capture the stock information. However, several researchers (Hillegeist et al., 2004; Vassalou & Xing, 2004) found that the market method has significantly more predictive power than accounting methods. Vassalou & Xing (2004) has found similar Accuracy Ratios to this research.

The most surprising finding is that the machine learning method did not achieve better predictive power than the logistic regression method. These find-ings thus coincide and differ with existing research in different ways. This re-search coincides partly with earlier rere-search in the fact that the market method has high predictive power, similar to that of accounting methods. The results of this research differ with earlier research in the fact that the machine learning did not achieve higher predictive power than the logistic regression method. This might however be due to the size of the sample.

This study had a couple of limitations regarding the data. Firstly, the fact that two different datasets were used to create PD estimates for the market and accounting methods, creates two problems: we have to make the assump-tions that the predictive power on the smaller companies using the accounting methods is the same for bigger companies. Thus, we have to make assumptions about the homogeneity of the total population of bigger and smaller companies, at least with regards to which certain variables predict default.

The fact that we use different datasets also creates a methodological prob-lem: we cannot use the market variable as an independent variable in a logistic regression or a machine learning algorithm to see what the variable importance is, and whether it still has predictive power. It might be the case that all information inherent in the accounting variables is already captured by the PD-variable created in the Merton method. We are not able to test this, as we have

(30)

no accounting data on large companies, or stock data on small companies. Another limitation regarding the data comes from the default indicator; other studies Vassalou & Xing (2004) have more default cases, and use for their market method default indicator data from multiple paid sources. For this study this data was unavailable, making the CRSP delisting code the most reliable source for default. Thus, there might be a bias in the dependent variable in the market method. Also, the default indicator for the accounting data is defined differently from the market data; whereas the default indicator for the market data is defined as legal bankruptcy, for the accounting data it is defined as 90 days past due date. We assume that the inability to not pay for 90 consecutive days leads to default. This might create another bias in the results. One limitation regarding the method comes from the fact that the naive solution created by Bharath & Shumway (2008) might not be as powerful as the full solution to the model of Merton (1974).

These findings have implications for business-practice and theory. Banks and leasing companies that receive applications from consumers and businesses have to determine whether these agents are reliable, and likely to pay back the money they borrow. These results imply- if some information that is in the market model is not already captured by the accounting model- that predictions on the company defaulting or not can be improved by incorporating market information in the applications. This will however only be useful for banks and leasing companies that offer loans and leases to listed companies. Likewise, as research (Vassalou & Xing, 2004) has shown that asset prices are impacted by the probability of default, incorporating the probability of default created by machine learning methods applied on accounting data may improve returns of investors.

The results of this paper also have implications for financial theory, and specifically the efficient market hypothesis. The efficient market hypothesis states that all asset prices reflect all available information. Assuming that the accounting data captures some of the information not captured by the mar-ket method, this research gives an indication that the marmar-ket is not perfectly efficient with incorporating accounting data.

We suggest that future research could focus on applying the machine learning method on accounting data, for which also stock data is available, and make comparisons on the added informational value of the Merton DD variable to the informational value of the accounting data.

(31)

References

Addo, P., Guegan, D., & Hassani, B. (2018). Credit Risk Analysis Us-ing Machine and Deep LearnUs-ing Models. Risks, 6 (2), 38. Retrieved from http://www.mdpi.com/2227-9091/6/2/38 doi: 10.3390/risks6020038 Altman, E. I. (1968). Financial ratios, discriminant

anal-ysis and the prediction of corporate bankruptcy. The

Journal of Finance, 23 (4), 589–609. Retrieved from

http://www.emeraldinsight.com/doi/10.1108/S1479-8387%282011%290000006010 doi: 10.1108/S1479-8387(2011)0000006010

Angelini, E., Di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk evaluation. The quarterly review of economics and finance, 48 (4), 733–755. doi: 10.1016/j.qref.2007.04.001

Bharath, S. T., & Shumway, T. (2008). Forecasting default with the Merton distance to default model. Review of Financial Studies, 21 (3), 1339–1369. doi: 10.1093/rfs/hhn044

Black, F., & Scholes, M. (1973). The Pricing of Options and Corpo-rate Liabilities. Journal of Political Economy, 81 (3), 637–654. Retrieved from http://www.journals.uchicago.edu/doi/10.1086/260062 doi: 10.1086/260062

Breiman, L. (2001). Random forests. Machine Learning, 45 (1), 5–32. doi: 10.1023/A:1010933404324

Campbell, J. Y., Hilscher, J., & Szilagyi, J. (2008). In search of dis-tress risk. Journal of Finance, 63 (6), 2899–2939. doi: 10.1111/j.1540-6261.2008.01416.x

Chava, S., Jarrow, R. A., & August, R. (2004). Bankruptcy Prediction with Industry Effects. Financial Management (December 2000), 537–569.

Chawla, N., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16 , 321–357. doi: 10.1613/jair.953

Davis, J., & Goadrich, M. (2006). The relationship Between Precision-Recall and Roc Curves. In Proceedings of the 23rd interna-tional conference on machine learning (pp. 223–240). Retrieved from http://pages.cs.wisc.edu/ jdavis/davisgoadrichcamera2.pdf

Du, Y., & Suo, W. (2007). Assessing credit quality from the equity market: Can a structural approach forecast credit ratings? Canadian Journal of Administrative Sciences, 24 (3), 212–228. doi: 10.1002/cjas.27

Duffie, D., Saita, L., & Wang, K. (2007). Multi-period corporate default pre-diction with stochastic covariates. Journal of Financial Economics, 83 (3), 635–665. doi: 10.1016/j.jfineco.2005.10.011

(32)

Hillegeist, S. A., Keating, E. K., Cram, D. P., & Lundstedt, K. G. (2004). Assessing the probability of bankruptcy. Review of Accounting Studies, 9 (1), 5–34. doi: 10.1023/B:RAST.0000013627.90884.b7

Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms. Journal of Banking & Finance, 34 , 2767–2787.

Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Con-sumer credit risk: Individual probability estimates using machine learn-ing. Expert Systems with Applications, 40 (13), 5125–5131. Re-trieved from http://dx.doi.org/10.1016/j.eswa.2013.03.019 doi: 10.1016/j.eswa.2013.03.019

Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability Machines: Consistent Probability Estimation Using Nonparamet-ric Learning Machines. Methods of Information in Medicine, 51 (1), 74–81. doi: 10.3414/ME00-01-0052.Probability

Merton, R. C. (1974). On the Pricing of Corporate Debt: The Risk Struc-ture of Interest Rates. The Journal of Finance, 29 (2), 449. Retrieved from http://www.jstor.org/stable/2978814?origin=crossref doi: 10.2307/2978814

Ohlson, J. A. (1980). Financial Ratios and the Probabilistic Prediction of Bankruptcy. Journal of Accounting Research, 18 (1), 109. Retrieved from http://www.jstor.org/stable/10.2307/2490395?origin=crossref doi: 10.2307/2490395

Rubin, D. B. (1976). Inference and missing data (Vol. 63) (No. 3). doi: 10.1093/biomet/63.3.581

Sandberg, M. (2017). Credit Risk Evaluation using Machine Learning (Unpub-lished doctoral dissertation).

Titan, E., & Tudor, A. L. (2011). Conceptual and Statisti-cal Issues Regarding the Probability of Default and Modeling De-fault Risk. Database Systems Journal , 2 (1), 13–22. Retrieved from http://www.doaj.org/doaj?func=abstract&id=851891

Vassalou, M., & Xing, Y. (2004). Default Risk in Equity Returns. The Journal of Finance, 59 (2), 831–868. doi: 10.1111/j.1540-6261.2004.00650.x

The predictive power in forecasting default with market and accounting data using machine learning