Credit risk for peer to peer consumer credit.

(1)

Rijksuniversiteit Groningen

Master Thesis

Econometrics, Operations Research & Actuarial Studies

Credit risk for peer to peer consumer credit.

Author:

Gijs Immerman

Supervisor Rijksuniversiteit Groningen:

Prof. dr. L. Spierdijk, Dr. C. Praagman

Supervisors Topicus:

R. Bos, MSc, S. Kaastra, MSc

August 20, 2016

(2)

(3)

Abstract

Peer-to-peer (P2P) loans are a growing market for consumer credit.

A P2P website for credit is a digital marketplace where credit is issued and the credit sum is supplied by investors. This research fo- cuses on the risk the investor has while investing in P2P loans. There are differences between the traditional consumer credit and a credit issued by a P2P website. Differences include, data availability, fraud and information asymmetry.

This paper investigates whether the defaults of the P2P loans can be predicted equally well as with consumer credit issued by banks.

This is done by comparing the predictive power of (additive) logistic regression, discriminant analysis, neural networks and support vector machines on the P2P dataset with the predictive power of the methods on ten other datasets. The ten other datasets are on consumer credit issued at banks. In general it seems that the P2P data suffers from the differences, consequently defaults are harder to predict.

(4)

1 Introduction

In the past few years the total debt of households, excluding mortgage, in the Netherlands is growing. In 2004 the debt was 66 billion euro and by the end of 2015 it grew to 92 billion euro¹. The Netherlands is not the only country where consumer debt is increasing, for example in the United Kingdom² and the United States ³ the demand for debt is rising as well. The growing market in consumer credit increases the interest in credit models. Profits can increase with $1000 per loan if firms have a good credit scoring model for consumer credit (Einav et al., 2013).

Consumer credit is not only issued by banks but also by car companies and webshops who allow you to pay the purchase amount in parts. For all these types of financing it is relevant to know what the probability is that a potential customer will default.

The interest rate of a loan is partially based on the risk of default. High risk means that investors want a high return. There have been studies that estimate the probability of default. For example, Baesens et al. (2003) and Lessmann et al. (2015), give an overview of 40 studies that researched default risk of companies and consumers. But none of them included loans issued by a peer-to-peer (P2P) website.

P2P sites are a new way for consumers to obtain credit. The websites con- nect consumers that need a loan and investors that are willing to invest in these loans. The first company offering this service is Zopa. It was founded in the United Kingdom in 2005. In total Zopa lent over £1.54 billion of which almost 30% is issued in the last 12 months.

Not only Zopa has shown an increase. Lending Club and Prosper, both founded in the United States in 2006, issued a total of $ 18.7 billion and $6 billion respectively. Of these amounts, 40% and 33% was issued last year by the websites. Hernandez et al. (2015) estimate that the total market of P2P consumer credit can reach 150 billion dollar in 2025.

Along with the growing market an understanding of the P2P credit market becomes more relevant. There are some differences between the traditional bank credit and the P2P credit. These differences have already been studied.

One of these differences is information asymmetry. Not all information on the borrower can be easily confirmed, since it is an online service. The problem is less likely to occur for banks, because they are able to obtain more information about their lenders than P2P websites (Lee and Lee, 2012). The bank has information on the bank account, mortgage, and other debts of a client, or about the financial status of family over a longer time period (Hand and Henley, 1997). This information is not always avail-

1http://www.cbs.nl/nl-NL/menu/themas/macro-economie/publicaties/artikelen/

archief/2015/4505-schulden-huishoudens-nemen-weer-iets-toe.htm

2http://www.ft.com/cms/s/0/236d2728-9cd9-11e4-adf3- 00144feabdc0.html#axzz46RskUNgO

3http://www.federalreserve.gov/releases/g19/current/

(5)

able for the P2P websites, or the information is a snapshot at one specific moment instead of a time period (Emekter et al., 2015).

Emekter et al. (2015) also conclude that, compared with the traditional financial institutions, the borrowers in the P2P market have lower income and credit rating.

Fraud, such as identity theft, can be a problem for P2P loans (Verstein, 2011) as well. Therefore, the trust that potential investors have in the P2P website and the loan seekers plays an important role in the success of P2P consumer credit (Greiner and Wang, 2010), since there is not an intermediary who is verifying the credit risk.

Greiner and Wang (2009) conclude that social capital, which is related to trust, can influence the chance of being funded. Duarte et al. (2012) add to this that the appearance of the borrower influences the funding chance as well.

However, there is no research that looks at risk for the investors. The differences in, for example information asymmetry (Lee and Lee, 2012), data availability (Hand and Henley, 1997; Emekter et al., 2015), and loan characteristics (Livingston, 2012) between the traditional consumer credit and P2P credit are observed, but there has been no research if these differences influence the credit risk for an investor. Therefore, this research will investigate whether these differences influence the credit risk compared to the traditional credit. The better an investor can distinguish between a loan that will default and a loan that will not default, the lower the credit risk is. The question of this research is; can consumer credit default be equally well predicted for credit that is issued by a P2P website compared to credit issued at a bank? This will give an inside whether the information asymmetry, fraud, lower income, and lower credit score have an effect on the risk of the investor.

In the next section a more detailed explanation of P2P lending is given.

This section will, among other things, elaborate on the differences between bank credit and P2P credit. Thereafter, methods that are used in consumer credit scoring are explained.

The data that is used in this research is from Lending Club. After the explanation of the data, the methods are applied to the data and a comparison between the methods performances is made. Furthermore, a comparison with other research that predicted default in consumer credit is made. After this the research question will be answered.

2 P2P Lending

P2P lending is a recent development in finance. It is an online marketplace that brings individuals that want to invest in loans and those who seek them together, without involvement of a financial institution. Other

(6)

names for this type of financing are people to people lending and social lending. P2P loans are a subtype of crowdfunding where the crowd can invest in loans which will be repaid with interest.

Since the P2P credit is a recent development, regulation still has to develop (Verstein, 2011). In the United States, the Securities and Exchange Com- mission(SEC) is responsible for the regulation of the P2P website. In the Netherlands, the regulation is done by the Autoriteit Financi¨ele Markten (Authority Financial Markets) (AFM).

For borrowers, P2P can be interesting because it can offer lower interest rates than banks do. The interest rate can be lower since P2P websites do not have to collect deposits and have less overhead costs than a traditional bank (Namvar, 2014). Because of the lower interest rate, P2P loans are used, among other things, to refinance current debt. Moreover, banks are not always interested in issuing small loans to consumers (Hernandez et al., 2015).

Another benefit of the P2P loan is that the borrower can usually make early repayments without paying a fine.

If an individual decides that he needs a loan, he can put it on one of the P2P websites. Most P2P websites have a similar operating system. For a credit application at Lending Club, the borrower needs to fill in basic information such as, loan amount, term, name, address, date of birth, and annual income. With this information an indication of the interest rate for the potential borrower is determined.

The interest rate consist of two parts, namely the interest that the investors will receive and the fee for the website. The fee for the website is the only revenue the P2P website has. If the potential borrower accepts the interest rate, he/she can officially apply for the loan.

If the offer is accepted by the borrower, the consumer needs to provide additional information such as social security number, job information, and house ownership. The website requests additional information about the borrower at a credit agency. This information consists of historical credit behaviour.

Next to this information, the borrower is also asked to write why he/she needs the credit. As Greiner and Wang (2009) showed, social capital can increase the probability that the loan gets funded. Social capital is a con- nection between the borrower and lender based on some common characteristic or interest. Due to this resemblance, people feel socially connected and therefore trust each other.

This information together with the credit history, borrower characteristics, and loan characteristics is placed on the website such that investors can decide to invest in the loan or not. After the total loan sum is funded by investors, the loan sum is transferred to the borrower’s bank account.

P2P lending is interesting for investors because the return on the credit is between 3.5% and 26%. Investors can already invest in a loan with a contribution of 25 dollar, which makes it easy to diversify the investment

(7)

over multiple loans. The relatively high interest rate and low minimum investment makes an investment in a P2P loans a good alternative for stocks and/or savings accounts (Namvar, 2014).

Anyone can invest at P2P websites and therefore the investors are not necessarily financial experts.

Investing in P2P loans carries certain risk. One of them is the credit risk.

The loans are unsecured and therefore a default means the investment is lost.

A second risk is the liquidity risk. The duration of the loans is mostly between three and five years and there is almost no secondary market to sell the investments before maturity (Namvar, 2014).

There is also the risk that the P2P website goes bankrupt. For Lending Club and Prosper.com this is important, because as investors you are not directly the owner of the loan. In their prospectus it is noted that investing in a loan actually means that you buy a promissory note of Lending Club or Prosper. In the most recent prospectus of Lending Club it is stated as follows:

”Investors under this prospectus have the opportunity to buy Member Payment Dependent Notes (Notes) issued by Lending Club and designate the corresponding loans to be facilitated through our platform. The Notes will be special, limited obligations of Lending Club only and not obligations of any borrower member. The Notes are unsecured and holders of the Notes do not have a security interest in the corresponding member loans or the proceeds of those corresponding member loans, or in any other assets of Lending Club or the underlying borrower member. ”

Hence, if the website goes bankrupt, the investor has no rights on the loan he or she invested in.

The penalty free early repayments option for the borrower are a risk for the investor, since the investor is uncertain about the future cash flows.

Since there are differences in term structure, loan amount, collateral, interest rate, application process but also differences in information asymmetry, borrower characteristics and fraud risk between P2P loans and bank loans, the credit risk between the traditional banks loans and P2P loans probably also differ. This will be investigated further in this paper.

3 Probability of default methods

Despite the differences in characteristics between a bank and a P2P loan, the goal of modelling the probability of default and therefore, the methods that can be used are the same. In the early 1990’s banks based credit scores on interviewing the loan applicant (Johnson, 1992). With the development of computers and the increase in the data that is available, companies started to estimate the probability of default with more complex

(8)

models. After some notation will be introduced, an overview of common probability default models is given.

3.1 Notation

The total sample size will be denoted by n and for every observation i ∈ 1, ..., n there is yi = {0, 1}, where 0 means default and 1 no default.

Furthermore, n₀ and n₁ denote the number of defaults and the number of non defaults. x_i is a vector with k variables. The variables can be con- tinuous, ordinal or binary. All methods try to estimate pi, which is the probability of default.

3.2 Logistic regression

One of the most common and oldest methods that is used today to predict the probability of default is the logistic regression. Almost all research that has been done in probability of default for consumer credit uses logistic regression as a benchmark model. Among others, Kraus (2014); Baesens et al. (2003); Bellotti and Crook (2009); West (2000) use it as benchmark.

The easy interpretation of the logistic regression is a big advantage. The logistic regression model is defined as followed:

log P (y_i = 1|xi) P (y_i = 0|x_i)

= log

pi

1 − p_i

= β0+ β⁰xi,

(1)

where β is a parameter vector which contains k elements and _1−p^pⁱ

i is called the odd. From this it follows that:

pi = P (yi = 1|xi)

=

1 + e^−(β⁰^+β⁰^xⁱ⁾−1

. (2)

Equation (1) gives a linear relation between log-odds which is easy to estimate and interpret (Hand, 2001). Another advantage is that the computa- tion time of the logistic regression is quick, even if the number of observations is increasing (Kraus, 2014). This explains why the logistic regression is still a common model to use. The estimation is done by maximum likelihood, for which it is known that it is consistent and asymptotically normally distributed. Therefore, statistical interference of the parameters can be easily done.

However, the restriction of linearity in the model can be a drawback. For example, with consumer credit scoring you might expect that an increase in annual income from e28000 to e30000 has a different effect on the log odds then an increase frome60000 to e62000, keeping all other variables

(9)

fixed. With the linearity assumption this effect is the same for both increases. Since there might be a non-linear relation, a non-linear model is estimated in Section 3.3.

Another drawback of the logistic regression is the fact that there is no interaction between the variables. A solution can be adding these interactions by including combinations, for example products of variables to the model. This is a solution, but when the number of variables is increasing the number of interaction variables that needs to be included also has to grow. A model with five variables will have ten interaction terms of two variables, but a model with for example, 40 variables will have 780 interaction terms of two variables. Therefore, this is not a practical solution. The neural network does not have this problem (Kuan and White, 1994). It can find interactions without explicitly defining them in the model. Neu- ral networks assume a more complex structure between the variables. This method will be further explained in Section 3.4.

To estimate the logistic regression, the maximum likelihood is used. The log-likelihood is given in Equation (3).

l(β; x1, ..., xn; y1, ..., yn) =

n

X

i=1

yilog(pi) + (1 − yi) log(1 − pi) (3) To maximize the likelihood, the Newton-Raphson method is used. The calculations can be found in the Appendix A. The estimation for β is given by ˆβ

The estimated ˆβ also has a clear interpretation. The derivative of the log-odds is:

∂

∂xj

ln

pi

1 − pi

= ˆβj. (4)

The log odds change linear in the variables xi. In terms of change in probability, this means that the ratio of two observations, which only differ in one variable by ∆x, is given by:

e^x^∗0ⁱ ^β^ˆ

e^x⁰ⁱ^β^ˆ = e^β^ˆ^j^∆x. (5) Hence, the ratio does not depend on the value of x_j but only on the size of the change in xj and βj. This is what makes logistic regression easy to interpret, but it comes at the cost that the log-odds behave linearly.

To summarize, the logistic regression has as advantages that it is easy to interpret and quick to estimate. The disadvantages are that it assumes a linear relation between the variables and the log-odds which is maybe too simplistic and adding interaction variables causes the model to grow fast.

Therefore in the next section the generalized additive model is discussed.

This will solve the issue of linearity. To handle the interaction effects the neural network will be used.

(10)

3.3 Generalized Additive Model

A possible solution to incorporate a non-linear effect in the logistic regression, is the additive logistic regression. It is a direct implementation of the generalized additive model (GAM) introduced by Hastie and Tib- shirani (1986). It allows to estimate non-linear functions of the variables within the logistic model and it has been used for credit scoring by Kraus (2014),who shows it is performing better than the standard logistic regression.

Instead of using the linear form given in Equation (1) the additive logistic regression is defined by:

log P (y_i = 1|x_i) P (y_i = 0|x_i)

= log

p_i 1 − p_i

= β₀+

k

X

j=1

f_j(x_ij). (6)

Similar as in Equation (2) it follows that:

pi= P (yi= 1|xi)

= 1 + e^−zⁱ−1

. (7)

Where z_i = Pk

j=1f_j(x_i,j). f_j is a univariate smooth function. z_i is an additive model introduced by Friedman and Stuetzle (1981). Stone (1985) showed that using univariate smooth functions gives a good trade off between interruptibility and flexibility. The univariate smooth function f_j can be estimated with different techniques. Two of them are kernel regression and the natural cubic spline. Silverman (1984, 1985) showed that there exists an equivalent relation between kernel and spline regressions.

Since the computational time of the cubic spline is more efficient, (Unser et al., 1993) it is chosen to estimate the univariate functions fj.

A cubic spline is constructed of piecewise third-order polynomials by min- imizing:

n

X

i=1

(˜zi− f (x_i,j))²− λ Z xmax

xmin

f⁰⁰(x)2

dx. (8)

Where ˜zi is the target variable and λ is a penalty. The penalty decides the smoothness of the curve. If λ → ∞ the solution will be the ordinary least squares estimation and if λ → 0 the fitted curve will go trough all the data points. In Reinsch (1967) a solution to find the minimum is shown.

One way of estimating the additive logistic regression is to use a backfitting algorithm by (Friedman et al., 2001). The pseudo code for this estimation is given in Figure 1 and Figure 2. The idea is that the Newthon-Raphson algorithm is used to iteratively calculate target variables. These target variables are then modelled using the additive model. It uses an iterative

(11)

Estimating an Additive Logistic Model

1. ˆβ0= log

y¯ 1−¯y

, with ¯y = _n¹Pn

i=1yi and set ˆgj = 0 2. Let

ξˆi = ˆβ0+

k

X

j=1

fˆj(xij)

ˆ

pi = 1 1 + e^{− ˆ}^ξⁱ

˜

z_i = ˆξ_i+ y_i− ˆp_i ˆ

p_i(1 − ˆp_i)

3. Estimate an additive model as described in Figure 2. This gives new estimates for ˆβ₀ and ˆf_j

4. Repeat step 2 and 3 until the changes are under a given threshold.

Figure 1: Pseudo code for estimating the Additive Logistic Model. The ˜zs are the target variables for the additive model.

procedure for all the variables to estimate the corresponding smoothing spline. The estimation procedure can handle dummy variables. This is no done by estimating a cubic spline, but estimating a linear function instead. A more detailed description of the estimation process is given in Appendix B

This way the k univariate functions are estimated. These functions are used to estimate the probability of default of a consumer. The derivative of the the log-odds is:

∂

∂x_j ln

p_i 1 − p_i

= ∂f_j

∂x_j, (9)

which is not necessarily linear for x_j, hence the log-odds are no longer a linear function of x. Because the functions gj are smooth and univariate, the estimated model can still easily be interpreted (Stone, 1985; Kraus, 2014).

But still there is no interaction between the variables. Using multivariate instead of univariate smoothing splines in the additive model, makes the interpretation of the estimated smoothing splines difficult (Stone, 1985;

Kraus, 2014). Also, multivariate functions are not as easy to interpret.

The drawback is that also the additive logistic regression does not have interaction effects in the model. To incorporate interactions and non-linear functions, the neural network is used in the next section.

(12)

Estimating an Additive Model using Backfitting

1. ˆα0 = ¹_nPn

i=1zi, ˆgj = 0

2. Repeat for j = 1, 2, 3, ..., k, 1, 2, 3, ..., k, 1, 2, 3, ...,

S_j











˜

zi− ˆα0−X

p6=j

ˆ gp(xip)







n

1



= ˆgj

Repeat until the chance ˆg_j is less than the given threshold.

Figure 2: Pseudo code for the backfitting of the Additive Model. Where S_j({ ˜z_i}ⁿ₁) = ˆg_j means fitting a cubic smoothing spline. The ˜z_i are the target variables and ˆgj is the estimated spline.

3.4 Neural Network

Neural networks is a term originated in the machine learning. It is a non- parametric approach. The strength of neural networks is that it can work with complex non-linearities and or interactions in a data set (West, 2000).

In West (2000) the neural network performs similar as the logistic model in predicting default, but in Baesens et al. (2003), the neural network outper- forms all classification methods that are studied. Neural networks do not only work for consumer credit, also in the analysis of corporate bankruptcies it is shown that neural networks obtain good results (Atiya, 2001).

Neural networks were introduced by McCulloch and Pitts (1943). The concept is inspired by the nervous system and the human brain. In this system a neuron had two outcomes, namely on or off. The decision to be on or off is made by an activation function. The neuron is on if the sum of input exceeds a threshold. In Figure 3 an example of a neuron is given. The input variables xj are summed together in a =Pk

i wixi. The output is y = f (a) where f is the activation function. In McCulloch and Pitts (1943) this function was defined as in Equation (10). The function in Equation (10) is also known as the unit step function.

f (a) =

(1 if a > 0

0 if a ≤ 0. (10)

Since the introduction of the neural network, the basis of the neuron did not change, only the activation function f and the algorithm to find the optimal weights got attention in further research. For example, Williams et al. (1986), suggest the multilayer perceptron and described backpropagation to fit the weights. The multilayer perceptron does not use the

(13)

Figure 3: In this figure a single neuron is drawn. x1, x2 and x3 are inputs, w₁, w₂ and w₃ are the weights for the variables. f (a) is an activation function, with a =Pk

i wixi. y is the output of the neuron.

step function as activation function. Williams et al. (1986) suggest to use the logistic function instead. Furthermore, neurons were combined into a network as shown in Figure 4. In the multilayer perceptron the input is first combined in the hidden layer. The hidden layer consists of neurons with an activation function h. The neurons in the hidden layer can best be described as latent variables. The value of these latent variables is for- warded to the output layer to come up with the final probability.

It is also possible to have two hidden layers. Then the output of the first hidden layer is directed to a second hidden layer after which it is send to the output layer.

Through this structure the interpretation of the marginal effects of the variables is hard. Furthermore, there is no theory on how to decide on the number of hidden layers and neurons in these layers, except grid search (Hsu et al., 2003). Grid search which will be explained in Section 5.

Kuan and White (1994); Bishop (1995) have shown that a multilayer perceptron can approximate almost all non-linear functions, given that the number of hidden neurons is large enough. However, the interpretation of the effect of the variables within the model is difficult.

(14)

Figure 4: In this figure a multilayer perceptron is drawn. Each neuron in the hidden layer is constructed as in Figure 3, with an activation function h. The last neuron is the output layer with activation function f .

3.4.1 Estimation of the neural network

In Williams et al. (1986) the backpropagation method is used to estimate the weights in the neural network. Using the logistic function as in Equa- tion (2) as activation function for f and h yields:

pi = (1 + e^w⁰⁺^P^l^j=1^w^j^h^j,i)⁻¹, (11) h_j,i= (1 + e^v^j,0⁺^P^k^t=1^v^j,t^x^t,i)⁻¹ for j = 1...l, (12) where p_i in this case is the outcome of the neural network and h_j is the activation function of the j^th neuron in the hidden layer. The weights w and v need to be estimated by backpropagation. The objective is to minimize:

O = −

n

X

i=1

y_ilog(p_i) + (1 − y_i) log(1 − p_i), (13)

which is similar to Equation (3). The estimation is initialized by random weights for all the model parameters. With these weights pi can be calculated. Using the true observations y_i, the model weights are adapted using the gradient descent algorithm. The new estimated weights are used to calculate a new pi and the process is repeated, until the weights are converged.

(15)

The gradient descent update rules for the weights are:

w^{τ +1}_j = w^τ_j − η∂O

∂w_j, (14)

v^{τ +1}_j,t = w^τ_j − η ∂O

∂vj,t

. (15)

In these update rules, η is the learning rate. It is a parameter that needs to be set before estimating the neural network weights. Setting η too low can lead to slow convergence and setting it too big can lead to diversion of the minimum. Calculating the derivatives is done in Appendix C.

The motivation behind the update rule is that, if the derivative of the objective function is positive, decreasing the weight will lead to an improvement of the objective. If the derivative is negative, increasing weight will lead to an improvement of the objective. The change in the weight also depends on the absolute size of the derivative. The bigger the absolute value is, the bigger is the change of the weight.

It is not guaranteed that this algorithm reaches the global minimum (Son- tag and Sussmann, 1989; Gori and Tesi, 1992; Randall and Martinez, 2003).

Therefore, the mini-batch backpropagation is used instead of the normal backpropagation.

The difference between mini-batch and normal backpropagation is the number of observations that is used in Equations (62) and (63). Backprop- agation uses the whole sample, whereas the mini-batch algorithm uses a sub-sample of the observations every update step. The advantages of the mini-batch algorithm are that it is computational less demanding and is less likely to reach a local minimum. (Randall and Martinez, 2003).

3.5 Discriminant analysis

Another traditional technique that is used is discriminant analysis, a classification method based on Bayes rule. Linear discriminant analysis (LDA) was first used in finance for the prediction of bankruptcies (Altman, 1968).

If the assumption of the normal distribution is true, then discriminant analysis is more efficient than the logistic regression (McFadden, 1976; Lo, 1986). But even when the assumptions are not likely to be met, discriminant analysis can still have a good predictive power.

The basis of discriminant analysis is the Bayes rule. Bayes rule gives:

P (y_i = 1|x_i) = P (x_i|y_i = 1)P (y_i= 1)

P (xi) , (16)

P (y_i = 0|x_i) = P (x_i|y_i = 0)P (y_i= 0)

P (x_i) . (17)

The probability for belonging to the class y_i= 1 is given by:

p_i = P (y_i = 1|x_i)

P (y_i = 0|x_i) + P (y_i = 1|x_i) (18)

(16)

In theory every distribution function can be used for the distribution of x|y = 1 and x|y = 0. The two most often used methods are, LDA and quadratic discriminant analysis (QDA). Both assume that the data is normally distributed.

3.5.1 Linear Discriminant analysis

In Fisher (1936) LDA was first introduced. Fisher (1936) solved the problem of classification by maximizing the distance between means of class 1 (µ1) and means of class 0 (µ0) proportional to the within group variance.

For the calculations see Appendix D.

When using LDA to calculate a probability as in Equation (18), there are two assumptions, namely the xi variables are multivariate normally distributed and the covariance matrix for both classes are equal. Hence:

x|y = 1 ∼ N (µ₁, Σ), (19)

x|y = 0 ∼ N (µ₀, Σ). (20)

Therefore we get that the probability that an observation belongs to class 1 is:

p_i = f₀(x)P (y_i = 0)

f0(x)P (yi = 0) + f1(x)P (yi = 1) (21) where f_l for l = 0, 1 is the multivariate normal distribution with mean µ_l and covariance matrix Σ. The estimated probability, which is given by ˆp_i, is calculated with ˆµ0, ˆu1, and ˆΣ defined in Equations (22), (23), and (24).

µˆ0= 1 n0

X

yi=0

xi, (22)

µˆ1= 1 n1

X

yi=1

xi, (23)

Σ =b 1 n − 2

2

X

l=1

X

yi=l

(xi− ˆµl)². (24)

Here bΣ is the within-group variance.

3.5.2 Quadraditic Discriminant analysis

An extension of the LDA technique is the QDA. The assumption that both groups have the same covariance matrix is dropped, but the assumption of multivariate normality still holds. The probability is calculated in a similar

(17)

way as in equation (21) except that the normal distributions f_l are calculated with their corresponding Σ_l and ˆµ_l, as defined in Equations (22), and (23), and:

Σˆ0 = 1 n₀− 1

X

yi=0

(xi− ˆµ0)², (25) Σˆ1 = 1

n₁− 1 X

yi=1

(xi− ˆµ1)². (26)

3.6 Support Vector Machine

The last method that is used to estimate the probability of default is the support vector machine (SVM). The SVM is a mathematical technique to separate the defaulted loans from the non-defaulted loans. Instead of coding the defaulted loans as y = 0, the SVM uses y = −1 for the defaulted loans. The non-defaulted loans are still coded as y = 1. The initial idea for SVM was to find a linear hyperplane to separate the classes such that the minimum distance between the two classes is maximized (Vapnik, 1963;

Vapnik and Chervonenkis, 1964). In other words, the SVM is looking for a hyperplane that maximizes the distance between the two classes. An example is given in Figure 5⁴. The SVM finds a hyperplane such that for all i = 1...n:

b + w⁰x_i≥ 1 if y_i = 1,

b + w⁰x_i ≤ −1 if y_i = −1. (27) If the constrain is binding for an observation, the distance to the separating hyperplane, given by b + w⁰x_i, and that observation is equal to _kwk¹ . This is the distance that the SVM tries to maximize by:

min1 2w⁰w such that yi(b + w⁰xi) ≥ 1.

(28)

This is a problem that can be solved with the Kuhn-Tucker conditions.

The observations, for which the constrains are binding, are called the supporting vectors. Instead of getting a probability, as with logistic regression, discriminant analysis, and neural network, the SVM method gives every observation a score which equals scorei = ˆb + ˆw⁰xi. Where ˆb and ˆw are the solutions of the minimization.

The problem, however, with this model is that there is only a solution if

4Source of picture:

http://blog.pengyifan.com/tikz-example-svm-trained-with-samples-from-two-classes/

(18)

Figure 5: An example of the SVM model. The filled dots and empty circles in this plot are the two classes that the SVM tries to classify. The solid line is the separating hyperplane that maximizes the distance between the two classes. The two dashed lines are the constrains. The red points are the supporting vectors.

(19)

the data is separable by a hyperplane. In many cases this is not possible.

Therefore, an extension of the model is suggested by Cortes and Vapnik (1995). They include the soft margin in the model, allowing for the hyperplane to not perfectly separate the data. This is done by relaxing the constrains in (27) at the cost of a penalty. The model becomes:

min1

2w⁰w + C

n

X

i=i

_i such that

yi(b + w⁰xi) ≥ 1 − i,

_i≥ 0.

(29)

The C is the penalty assigned to the cost of relaxing the constrains, which needs to be set by the researcher. Using the Kuhn-Tucker Lagrange conditions the problem can be solved, see Appendix E.

Finding the maximum of the objective function is done by the SMO algorithm as explained by Platt (1999).

The constrains, as defined in Equation (28) and (29) are linear in xi, which might not be optimal. For this Cortes and Vapnik (1995) suggest the use of a kernel function. Let:

φ(xi) : R^k→ R^d, (30)

where φ(x_i) can be any function. The SVM conditions can be written as:

min1

2w⁰w + C

n

X

i=i

_i such that yi(b + w⁰φ(xi)) ≥ 1 − i,

_i ≥ 0.

(31)

Doing similar calculations as in Appendix E gives:

maxλ≥0 −1 2

n

X

i=1 n

X

j=1

λ_iy_iλ_jy_jφ(x_i)⁰φ(x_j) +

n

X

i=1

λ_i, such that

n

X

i=1

λ_iy_i = 0.

C ≥ λi≥ 0.

(32)

Cortes and Vapnik (1995) did not define the function φ explicitly but defined,

φ(x_i)⁰φ(x_j) = K(x_i, x_j). (33)

(20)

The function K is called a kernel function. The most common used is RBF kernel function given by:

K(x_i, x_j) = e^−γkxⁱ^−x^j^k². (34) In the overview provided by Lessmann et al. (2015) it can be seen that almost one third of the estimated models used the RBF kernel. Hsu et al.

(2003) reason that using the RBF kernel is a good choice. This is because the value of K is bounded between 0 and 1 and it has only one parameter that needs be set, namely γ. But most essential, it is very flexible. To show this, the RBF Kernel in a one dimension space is elaborated. Take:

x₁, x₂ ∈ R, (35)

K(x₁, x₂) = e^−γ(x¹^−x²⁾²

= e^−γ(x²¹^+x²²⁾e^2γx¹^x²,

(36)

using the Taylor expansion gives

K(x1, x2) = e^−γ(x²¹^+x²²⁾

∞

X

i=0

r(2γ)ⁱ i! xⁱ₁

r(2γ)ⁱ i! xⁱ₂

= φ(x₁)⁰φ(x₂).

(37)

Hence

φ(x) = e^−γx²

"

1,

r(2γ) 1! x,

r(2γ)² 2! x²,

r(2γ)³ 3! x³, ...

#0

. (38)

The Kernel function maps R to an infinite dimensional space. The γ can be seen as a weight that determines the influence of higher dimensions.

Similar calculations can be done when x₁ ∈ R^k and x₂∈ R^k. Rewriting Equation (32) gives the mathematical problem:

maxλ≥0 −1 2

n

X

i=1 n

X

j=1

λiyiλjyjK(xi, xj) +

n

X

i=1

λi

sucht that

n

X

i=1

λiyi= 0, C ≥ λ_i ≥ 0.

(39)

The score for an observation is given by score_j = ˆb +P_n

i=1λ_iy_iK(x_i, x_j).

Note that for all observations that are not binding, λ_i = 0. Hence, it is enough to sum over the supporting vectors. The infinite space of the RBF Kernel can also be a reason not to use it. It is unclear on basis of which dimension the separation of the two classes is made. SVM is

(21)

also used for the prediction of default, where it is performing better than logistic regression and neural networks in predicting the bankruptcies of companies (Min and Lee, 2005; Van Gestel et al., 2006). In Bellotti and Crook (2009) logistic regression and the SVM models perform similarly for consumer credit.

As with the size of the neural network, there is no general theory that gives an indication which values to choose for γ and C. The only possible way to find the values for C and γ is using grid search, this will be further explained in Section 5.

4 Performance measure

In default analysis, the datasets are commonly unbalanced. In Baesens et al. (2003); Bellotti and Crook (2009); Kraus (2014) default rates are between 2.5% and 44%. In this research the default rate is 19%. Since this research is about the differences in performances of the methods compared with other research, a performance measure that is not influenced by the different prior class distributions is selected.

The area under receiver operating characteristic (AU ROC) has this prop- erty (Bradley, 1997). The AUROC was first suggested as performance measure for binary classification by Hanley and McNeil (1982).

The Receiving Operating Characteristic (ROC) is a curve that plots, for different thresholds, the true positive rate (T P R) and the false positive rate (F P R) againts each other. The threshold decides if an observation is classified as default or non-default. The logistic regression outcome is used as an example how the ROC curve can be drawn.

With the logistic regression, every observation has a probability as outcome. Setting the threshold at 0.5, meaning that all observations with a probability higher than 0.5 are classified as non-default and all others as default. Using these predicted classes, a contingency table can be completed as in Table 1. This table gives an overview of the relation between the true classes and the predicted classes. The corresponding T P R and F P R can be calculated in the following way, T P R = ^{T P}_n

1 and the F P R = ^{F P}_n

0 . The T P is the number of true positives, and the F P is the number of false positives.

Similar calculations can also be done by setting the threshold at other numbers between zero and one. If the threshold is increased, the T P R and F P R will both decrease. A decrease of the threshold will lead to an increase in both the T P R and F P R. Doing this for a range of thresholds gives combinations of T P Rs and F P Rs which can be plotted to get the ROC curve.

The ROC curve gives an overview for the ”benefits” (correctly classified non defaulted loans) and the corresponding ”cost” (misclassified defaulted loans) to get those ”benefits”. An example of an ROC curve is given in Figure 6. On the horizontal axis the F P R can be found and on the verti-

(22)

cal axis the corresponding T P R can be found. In the lower left corner the threshold equals 1. All loans are classified as default, hence the T P R = 0 and F P R = 0. In the upper right corner the threshold is 0, hence all observations will be classified as non default and the T P R = 1 and F P R = 1.

The area under the ROC curve is the AUROC. The AUROC is the probability that a randomly selected observation with yi= 1 has a higher score then a randomly selected observation with y_i = 0. Its minimum is 0.5, which means that the classification performs as good as randomly assign- ing scores to observations. The maximum value is one, meaning perfect separation in the scores between the two classes.

One way of calculating the AUROC is using the Mann-Whitney U test statistic. The Mann-Whitney U statistic is calculated by:

U = X

yi=1

X

yj=0

1_{[ ˆ}_p_i_{> ˆ}_p_j_], (40)

where, 1_{[ ˆ}_p_i_{> ˆ}_p_j_] is an indicator function and ˆp is the estimated probability.

U is the number of times that a non-defaulted loan gets a better score in a pairwise comparison than a defaulted loan. The AU ROC is:

AU ROC = U

n₀n₁. (41)

The AU ROC can be interpreted as the probability that a random non- defaulted loan gets a higher score then a random defaulted loan. There is also a direct link between the AUROC and the Gini coefficient, namely Gini = 2 × AU ROC − 1 (Kraus, 2014).

Predicted non default Predicted default Total

Non default True Positive (T P ) False Negative (F N ) Total Positives (n1) Default False Positive (F P ) True Negative (T N ) Total Negatives (n₀)

n

Table 1: An example of a contingency table. It gives an overview of the number of predicted defaulted loans and the number of predicted non- defaulted loans, corresponding to their true number of defaulted and non- defaulted loans.

(23)

Figure 6: An example of an ROC curve. The diagonal line is the ROC curve for a random classifier using random scores. The curved line is an ROC curve of an estimated model. On the horizontal axis the F P R can be found and on the vertical axis the corresponding T P R can be found.

In the lower left corner the threshold equals 1. All loans are classified as default, hence the T P R = 0 and F P R = 0. In the upper right corner the threshold is 0, hence all observations will be classified as non-default and the T P R = 1 and F P R = 1.

(24)

4.1 Testing difference between two AUROCs

To compare the AU ROC scores of different methods, the test procedure introduced by DeLong et al. (1988) is used. This test, also called DeLongs test, compares two AUROCs, which are generated by the same data. The DeLongs test will be used to compare the methods performance on the Lending Club data set. Furthermore it will be used in the forward and backward method, explained in Section 5.1, to select the variables that are included in the logistic regression, additive logistic regression, LDA, and QDA.

The null hypothesis of the test is that r⁰s = 0, where r is a contrast vector and s a vector with the AU ROC scores. Since, this paper tests if two AU ROC scores significantly differ or not, r = 1

−1

The test statistic, also called Z-score, they derived is given by:

Z = r⁰s

r⁰Sr (42)

where S is the covariance matrix of the AU ROC scores. If the Z-score absolute value is to big the null hypothesis that the two AU ROC are equal is rejected. DeLong et al. (1988) have shown that the test static Z follows a standard normal distribution.

5 Methodologies

To see which of the methods is able to make the best prediction whether a credit will default or not, the data of Lending Club is split into two sets.

One set will be called the train set. The other set will be the test set.

The train set will be used to estimate or ”train” the models described in the Section 3. The test set will then be used to calculated the AU ROC and compare the performance of the methods. The split is made to pre- vent overfitting and is suggest in Baesens et al. (2003); Hand and Henley (1997); Lessmann et al. (2015). Overfitting means estimating a model too specific to fit the train data and the model does not perform well when the predictions are outside the train set. As an investor the prediction that if a loan will default or not, will also be made out of sample, therefore testing the models out of sample performance is important as well.

Overfitting can occur for all the methods. For logistic regression and discriminant analysis overfitting can occur when variables are included in the estimation that do not influence the probability of default. This can be solved by the forward or backward method, explained in Section 5.1.

For the neural network and the SVM, parameters need to be chosen before the models can be estimated. Using the test set to conclude how many neurons are needed in the neural network can still lead to overfitting. Therefore cross validation is used, as in Baesens et al. (2003); Hand and Henley (1997); Lessmann et al. (2015). Because of the computational