Credit risk and survival analysis: Estimation of Conditional Cure Rate

(1)

Stochastics and Financial Mathematics

Master Thesis

Credit risk and survival analysis:

Estimation of Conditional Cure Rate

Author: Supervisor:

Just Bajˇ

zelj

dr. A.J. van Es

Examination date: Daily supervisor:

August 30, 2018

R. Man MSc

Korteweg-de Vries Institute for Mathematics

(2)

Abstract

Rabobank currently uses a non-parametric estimator for the computation of Conditional Cure Rate (CCR) and this method has several shortcomings. The goal of this thesis is to find a better estimator than the currently used one

This master thesis looks into three CCR estimators. The first one is the currently used method. We analyze its performance with the bootstrap and later develop a method, with better performance. Since the newly developed and currently used estimators are not theoretically correct with respect to the data, a third method is introduced. However, according to the bootstrap the latter method exhibits the worst performance. For the modeling and data analysis the programing language Python is used.

Title: Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajˇzelj, just.bajzelj@gmail.com, 11406690

Supervisor: dhr. dr. A.J. van Es

Second Examiner: dhr. dr. A.V. den Boer Examination date: August 30, 2018

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105-107, 1098 XG Amsterdam http://kdvi.uva.nl

Rabobank Croeselaan 18, 3521 CB Utrecht

(3)

Aknowledgments

I would like to thank my parents who made possible for me to finish the two years Masters in Stochastics and Financial Mathematics in Amsterdam, that helped me to become the person I am today.

I would also like to thank people from Rabobank and the department of Risk Analytics, thanks to whom I have written this thesis and, during my six-month internship, and showed me that work can be more than enjoyable.

In particular, I would like to acknowledge all my mentors, Bert van Es who always had time to answer all my questions, Viktor Tchistiakov who gave me challenging questions and ideas, which represent the core of this thesis and, Ramon Man, who always showed me support and cared that this thesis was done on schedule.

(4)

Introduction

Under Basel II banks are allowed to build their internal models for the estimation of risk parameters. This is known as the Internal Rating Based approach (IRB). Risk parame-ters are used by banks in order to calculate their own regulatory capital. In Rabobank, the Loss Given Default (LGD), Probability of Default (PD) and Exposure at Default (EAD) are calculated with IRB.

Loss given default describes the loss of a bank in case that the client defaults. Default happens when the client is unable to pay monthly payments for the mortgage for some time or one of other default events, usually connected with the client’s financial diffi-culties, happen. Missed payment is also known as arrear. After the client defaults his portfolio is non-performing and two events can happen. The event when the clients portfolio returns to performing one is known as cure. Cure happens if the costumer pays his arrears and has no missed payments during a three-months period or he pays the arrears after the loan restructuring and has no arrears in a twelve-months period. The event in which the bank needs to sell the client’s security in order to cover the loss is called liquidation.

Two approaches are available for LGD modeling. The non-structural approach consists of estimating the LGD by observing the historical loss and recovery data. Rabobank uses another approach, the so-called structural approach. While the bank deals with LGD in a structural way, different probabilities and outcomes of default are considered. The model is split in several components that are developed separately and later combined in order to produce the final LGD estimation, as can be seen from Figure 0.1. In order to calculate the loss given default we firstly need to calculate the probability of cure, the loss given liquidation, the loss given cure and indirect costs. Rabobank assumes that loss given cure equals zero, since cure is offered to the client only if there is a zero-loss to the bank and any indirect costs that are suffered during the cure process are taken into account in the indirect costs component. Therefore, in this thesis sometimes the term loss is used instead of the term liquidation.

Indirect costs are usually costs that are made internally by the departments of the bank that are involved into processing of defaults, e.g., salaries paid to employees and admin-istrative costs. The equation used for LGD calculation, can be seen in Figure 0.1.

The parameter probability of cure (PCure) is the probability that the client will cure

before his security is liquidated. All model components, including PCure depend on

co-variates from each client.

A big proportion of cases, which are used to estimate PCure, are unresolved cases. A

client that defaulted and was not yet liquidated nor cured is called unresolved. The easiest way to imagine an unresolved state is when every client needs some time after default in order to cure or sell their property. The non-absorbing state before cure or

(6)

Figure 0.1: LGD = PCureLGC + IC + (1 − PCure)LGL.

liquidation is called unresolved state. Unresolved cases can be treated in two ways: • Exclusion: the cases can be excluded from the PCure estimation

• Inclusion: with expected outcome: we can include unresolved cases into the PCure

estimation by assigning an expected outcome or Conditional Cure Rate to them. If unresolved cases are excluded from the PCure estimation, it can happen that the

parameter estimator will be biased. In other words PCure will be estimated on a sample

where clients are cured after a short time. Consequently, clients, who would need more time to be cured would get a smaller value for PCurethan they would deserve. Since PCure

tells us the probability that a client will be cured after default and such a probability is not time dependent, treatment of unresolved cases with exclusion would be wrong. One approach within Rabobank is to treat unresolved cases by assigning them a value called Conditional Cure Rate. Conditional Cure rate (CCR) can be estimated with a non-parametric technique. This thesis will be about developing a new model able to eliminate the existing shortcomings of the currently employed model.

0.1 Background

CCR tells us the probability that a client will cure after a certain time point conditioned on the event that client is still unresolved at that point. Rabobank’s current model estimates CCR with survival analysis, which is a branch of statistics that is specialized in the distribution of lifetimes. The lifetime is the time to the occurrence of an event. In our case the lifetime is the time between the default of the client and cure or liquidation. Current CCR is a combination of Kaplan Meier and Neslon Aalen estimators, which are two of the most recognized and simple survival distribution estimators. The current CCR model has some shortcomings. The goal of the present research is to develop a new CCR model which will be able to outperform current model.

(7)

0.2 Research objective and approach

The objective of this thesis is to find new ways to estimate CCR that would be able to improve CCR estimation. New techniques will be compared with the currently used CCR estimator. The research goal of this thesis is:

Develop an alternative CCR model which is better than the current one. In order to reach this research goal, we will need to answer the first research question:

What type of model is natural for the problem?

To answer this question first, we study the currently used model. Second, we will get familiar with basic and more advanced concepts of survival analysis. Once this is done, we will be able to derive more advanced and probably better estimators. In particular we will get some basic ideas about the techniques that can be used for the CCR modeling. In order to find a better model than the one which is currently used we will have to be able to know what better means. In other words, we will need to be able to answer the second research question:

(8)

1 Survival analysis

Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The random variable which denotes a lifetime will be denoted by T . Survival analysis has its roots in medical studies as they look into the lifetime of patients or the time to cure of a disease. In our case the lifetime of interest is the time which a client spends in the unresolved state and the events of interest are cure and liquidation. One of the most attractive features of survival analysis is that it can cope with censored observations.

The estimator which is currently used for CCR estimation is a combination of two es-timators of two different quantities which are specific for survival analysis. In order to be able to understand and to model CCR as it is currently model by the Rabobank, we first have to know what the quantities are that are modeled by the Rabobank, how these quantities are estimated and how to model these quantities as a client can become resolved due to two reasons. Censoring is represented in Chapter 1.1. The most basic concepts of survival analysis and its specific quantities are introduced in Chapter 1.2. In Chapter 1.3 we will look into basic estimators of survival analysis. In Chapter 1.4 the theory and estimators when two outcomes are possible are presented, since our model assumes that client can become resolved due to two reasons, cure and liquidation.

1.1 Censoring

When lifetime distributions are modeled with a non-survival analysis approach only ob-servations where the event of interest took place are used. For simplifying reasons we will look into an example from medical studies. For instance if a study on 10 patients has been made and only 5 of them died, and the other 5 are still alive, in order to model the distribution of lifetimes only 5 observations can be used. Survival analysis is able to get some information about distribution also from the other patients. Patients which are still alive at the end of the study are called censored observations and survival analysis is able to cope with censored data. A visualization of censored observations can be seen in Figure 1.1. The time variable on the x axis represents the month of study, meanwhile on the y axis the number of each patient can be found. The line on the right side represents the end of the study. The circle represents in which month of study the patient got sick, while the arrow represents the death of a patient. It can be seen that deaths of patients 6 and 4 are not observed, because the study ended before the event of interest happened. The event of interest cannot be seen at patient 2 as well, because the patient left the study before the end of the study. Such phenomenons are called censored observations. Despite the fact that the event of interest did not happen,

(9)

Figure 1.1: Censored observations and uncensored observation

such observations tell us that the event of interest happened after the last time we saw the patient alive and consequently has an impact on the output distribution. Types of censoring when we know the time when the patient got sick, but we do not know the time of the event of interest are called right censoring (Lawless 2002).

For now we will assume that we have just one type of censoring, right censoring. In our case this means that a client which has defaulted is observed, but the observation period has finished before client was cured or liquidated. To understand why such a phenomenon happens it has to be taken into consideration that in the Rabobank’s data there are observations where a client is cured after 48 months (4 years), consequently it can happen that if clients which have defaulted in year 2016 are observed, some of them are still unresolved, despite the fact that they will be cured or liquidated sometime in the future.

(10)

1.2 Definitions

In order to use techniques from survival analysis some quantities need to be defined and are going to be directly or indirectly modeled. For simplicity it will be assumed that a client can become resolved due to one reason only, cure. Later this assumption will be relaxed. All definitions and theory from this chapter can be found in any book about survival analysis. For a review of survival analysis see Lawless (2002), Kalbfleisch and Prentice (2002) and Klein and Moeschberger (2003).

Definition 1.2.1. Distribution function: Since T is a random variable it has its own distribution and a density function, which are denoted by F and f or

F (t) = P (T ≤ t) = Z t

0

f (t)dt.

The distribution function tells us the probability that the event of interest, cure, will happen before time t.

In survival analysis we can also be interested in the probability that the client is still unresolved at time t.

Definition 1.2.2. The survival function represents the probability that the event of interest did not happen up to time t and it is denoted by S or

S(t) = P (T > t) = 1 − F (t). (1.1) The survival function is a non-increasing right continuous function of t with S(0) = 1 and limt→∞S(t) = 0.

Since we can also be interested in the rate of events in a small time step after time t, we define the following quantity.

Definition 1.2.3. The hazard rate function is a function that tells us the rate of cures in an infinitely small step after t conditioned on the event that the client is still unresolved at time t. The hazard rate function is denoted by λ or

λ(t) = lim

∆t→0=

P (t ≤ T < t + ∆t|T ≥ t)

∆t .

It is important to note that the hazard rate function can take a value bigger than 1. If λ(t) ≥ 1, it means that the event will probably happen at time t. In order to interconnect all the defined quantities we need to represent the cumulative hazard function which is defined as

Λ(t) = Z t

0

λ(s)ds.

(11)

con-nected. We know that {t ≤ T < t + ∆t} ⊆ {t ≤ T }. Consequently we get λ(t) = lim ∆t→0 P (t ≤ T < t + ∆t|T ≥ t) ∆t = lim ∆t→0 P (t ≤ T < t + ∆t) ∆t × P (t ≤ T ) = f (t) S(t) = d dt(1 − S(t)) S(t) = −_dtdS(t) S(t) = −d dtlog(S(t)). (1.2)

From the last equality it follows that

S(t) = exp(−Λ(t)). (1.3)

We can see that the hazard rate function, the survival function, the density function and the distribution function uniquely define the distribution of T .

1.3 Estimators

Until now we looked into quantities that define the distribution of T . In this chapter simple estimators of these quantities will be presented. Later those estimators will be used for the modeling of CCR.

The assumption that a client can become resolved due to one reason only is still valid. In the data there are n defaults. The observed times will be denoted by ti, i ∈ {1, 2, . . . , n}.

Each of those times can represent the time between default and cure or liquidation and censoring time. The variable δi is named the censoring indicator and it takes value 1 if

client i was not censored. Consequently, in cases where there is only one event of interest the data is given as {(ti, δi)|i ∈ {1, 2, . . . , n}}. Once we observed times are obtained we

have to order them in an ascending order t(1) < t(2), · · · < t(k), k ≤ n. With D(t) we

denote the set of individuals that experienced the event of interest at time t or D(t) = {j|Tj = t, j ∈ {1, 2, . . . , n}}.

With di the size of D(t(i)) will be denoted. The set of individuals that are at risk at

time t is denoted by N (t). The individuals which are in N (t) are the individuals that experienced the event of interest at time t or are still in our study at time t or

N (t) = {j|Tj ≥ t, j ∈ {1, 2, . . . , n}}.

With ni we will denote the size of N (t(i)).

The most basic estimator of a survival function is known as the Kaplan-Meier estimator and it is also used in the current Rabobank model. The Kaplan-Meier estimator is given

(12)

by the following formula, ˆ S(t) = Y j:t(j)≤t (1 − dj nj ). (1.4)

The estimator which is used for the estimation of the cumulative hazard rate in the current CCR model is called the Nelson-Aalen estimator and it is given by

ˆ Λ(t) = X j:t(j)≤t di ni . (1.5)

Using the Nelson-Aalen estimator we can also model the hazard rate as a point process which takes value di

ni at time t(i) Let us look into the following example. We have an

observation which consists of 10 clients. The censoring indicator and observed time can be seen in Figure 1.2. The times which each client needs in order to be cured after default are observed. In order to compare traditional statistical techniques and survival techniques, survival curve will be estimated with the empirical distribution function and the Kaplan-Meier estimator.

Figure 1.2: Example data

The survival curve estimated with the Kaplan-Meier estimator is shown in Figure 1.3. In the figure it can be seen that jumps happen only at the times when an event of interest happens. Observed censored times are denoted with small vertical lines and they have no influence on the jumps, but rather on the size of the jumps, since censored observations only have influence on the risk set which is in the nominator of equation (1.4). The more censored observations occur before the observed time the smaller the risk set will be and the bigger the jump on the survival curve will be.

(13)

formu-Figure 1.3: Survival curve estimated with Kaplan-Meier estimator lated as ˆ Fn(t) = 1 n n X i=1 1Ti≤t.

and obtain the survival curve from the identity (1.1). Doing so we obtain an estimator for the survival function which is equal to

ˆ Sn(t) = 1 n n X i=1 1Ti>t.

It is important to add that if we use the empirical distribution function we can only use observations where the event of interest interest did happen. In our case we can just take the observations where cure happened, i = 4, 6, 7. Results from the estimator of the survival curve with empirical distribution is shown in the Figure 1.4.

From Figure 1.4 it can be observed tha bigger jumps occur in comparison with the Kaplan-Meier estimator. An intuitive explanation for this phenomenon would be that censored observations also bring us an important information for the estimator. For instance, if we want to estimate the survival curve at time t and we have only censored observations which happened after time t, then the probability of an event happening before time t is probably low.

At this point note that if we use the Kaplan-Meier estimator with non-censored observa-tions only and the empirical distribution for the estimation of the survival curve we will obtain the same results. If we have times t1< t2· · · < tk and at the time ti di events of

(14)

Figure 1.4: Survival curve estimated with the empirical distribution t ∈ [ti, ti+1) it follows ˆ S(t) = (1 − d1 n)(1 − d2 n − d1 ) . . . (1 − di n − d1− · · · − di−1 ) = (n − d1)(n − d1− d2) . . . (n − d1− · · · − di) n(n − d1)(n − d1− d2) . . . (n − d1− · · · − di−1) = n − d1− · · · − di−1 n = 1 n n X i=1 1ti>t. (1.6)

1.4 Competing risk setting

Up to this point we assumed that clients can become resolved only by being cured, but in reality that is not the case. It is known that a client can become resolved due to two reasons, cure and liquidation. The modeling approach in the CCR model is that the client becomes resolved due to the reason which happens first. With other words there are two types of events which have the influence on the survival function. If all liquidated events would be modeled as censored, then the survival function would overestimate. Since one of the building blocks of CCR is the survival function, CCR estimates where liquidated events are considered as censored would give us wrong results. Modeling of CCR in the setting where two outcomes are possible is in survival analysis known as the competing risk setting. In this chapter we will look into quantities in the competing risk setting. For a review of the competing risk model see M.-J. Zhang, X. Zhang, and Scheike (2008).

(15)

In order to model CCR(t) we have to introduce more notation. Variable ei represents

the reason due to which subject i failed. In our case ei = 1 if the subject i was cured.

The variable ei will take value 2 if subject i was liquidated. Consequently, the data is

given as (Ti, δi, ei), i ∈ {1, 2, . . . , n}. Since a subject can fail due to two reasons, a cause

specific hazard rate needs to be used. By Di(t) we denote a set of individuals that failed due to reason i at time t or

Di(t) = {k, t = Tj ∧ ek= i, j ∈ {1, 2, . . . , n}}.

With di_j we denote the size of the set Di(t(j)).

Definition 1.4.1. The cause specific hazard rate is a function that tells us the rate of event e = i on an infinitely small step after t conditioned on the event that a subject did not fail up to time t. The cause specific hazard rate function is denoted by λi(t) or

λi(t) = lim

∆t→0=

P (t ≤ T < t + ∆t, e = i|T ≥ t)

∆t .

In order to calculate the survival function in a competing risk setting we need to define the cause specific cumulative hazard function, which is given as

Λi(t) =

Z t

0

λi(s)ds.

Then the analog of Formula 1.3 in the competing risk setting is

S(t) = exp(−Λ1(t) − Λ2(t)). (1.7)

It is important to add that when a subject can fail due to 2 reasons the Kaplan Meier estimator of the survival function takes the following form

ˆ S(t) = Y j:t(j)≤t (d 1 j + d2j nj ).

The Nelson-Aalen estimator is the estimator of the cause specific cumulative hazard rate, and it is given as ˆ Λi(t) = X j:t(j)≤t di_j nj . (1.8) The value d ˆΛi(ti) = di j

nj tells us the probability that a subject will experience the event

at time tj conditioned on the event that he is still alive at time tj. Once we obtain all

the cause specific hazard rates we can calculate the probability of failing due to a specific reason.

(16)

failing due to cause i before time t and it is denoted by Fi or

Fi(t) = P (T ≤ t, e = i).

The cumulative incidence function is expressed mathematically as Fi(t) = Z t 0 λi(s)S(s)ds = Z t 0 λi(s) exp( 2 X i=1 Λi(s))ds. (1.9)

Using the Nelson-Aalen and Kaplan Meier estimators we can obtain a non-parametric estimator for CIF which is given as

ˆ Fi(t) = X i≤t ˆ S(tj−1)d ˆΛi(tj). (1.10)

An intuitive explanation of Formula (1.10) is that if we want that subject fails due to reason i it has to be unresolved up to time tj−1 and then at the next time instance fail

due to reason i.

We will continue with the example from Section 1.3, but this time we will assume that some observations which were censored were actually liquidated. We will take all the steps we need in order to calculate ˆF1(t) from equation (1.10), which will be later needed

in order to estimate CCR.

Figure 1.5: Survival curve estimated with empirical distribution

In order to estimate the survival curve, the following steps need to be taken. It is seen that no event of interest happens on interval [0, 1), consequently for t ∈ [0, 1) it holds

(17)

that

ˆ

S(t) = 1.

At time 1 one event of interest happens and the size of the risk set is 10. Since no events of interest happen until time 2 for t ∈ [1, 2) it holds that

ˆ

S(t) = 1(1 − 10) = 9 10

At time 2 two events of interest happen and the size of the risk set is 9 for t ∈ [2, 3) it holds ˆ S(t) = 1(1 − 1 10)(1 − 2 9) = 7 10. In a similar fashion we obtain the following values for ˆS(t)

ˆ S(t) =            7 10(1 − 2 7) = 1 2 t ∈ [3, 4) 1 2(1 − 2 5) = 3 10 t ∈ [4, 5) 3 10(1 − 0 3) = 3 10 t ∈ [5, 6) 3 10(1 − 0 1) = 3 10 t ∈ [6, ∞).

Figure 1.6: Survival curve in competing risk setting

The Kaplan-Meier curve for the competing risk setting is shown in Figure 1.6. Com-pared with Figure 1.3 it can be seen that it has more jumps, which makes sense, since we have two events of interest. At the same time we can see that if we consider liqui-dated observations as censored, the survival curve will overestimate survival probability.

(18)

Consequently, the CCR will be modeled in a competing risk setting.

In order to calculate an estimator of F1(t) we need to calculate d ˆΛ1(tj). The value of

d ˆΛ1(1) is simply the number of individuals that were cured at time 1 divided by the

number of all individual that are at risk at time 1 or d ˆΛ1(1) =

0 10. In a similar fashion we obtain the other values of d ˆΛ1(ti)

d ˆΛ1(t) =                      1 9 t = 2 2 7 t = 3 0 5 t = 4 0 3 t = 5 0 1 t = 6 0 else

Figure 1.7: cause specific hazard rate for cure estimated with d ˆΛ1

Now we can finally calculate ˆF1(t). Since we do not have any cures in the interval

[0, 1) it holds that ˆF1(t) = 0 for t ∈ [0, 1). For t ∈ [1, 2) it holds

ˆ

F1(t) = S(0)d ˆΛ1(1) = 1

0 10 = 0.

(19)

For t ∈ [2, 3) it holds that ˆ F1(t) = ˆS(0)d ˆΛ1(1) + S(1)d ˆΛ1(2) = ˆF1(1) + ˆS(1)d ˆΛ1(2) = 0 + 9 10 1 9 = 1 10. (1.11)

Figure 1.8: Estimated cumulative incidence function in a similar fashion we obtain values ˆF1(t) for other t,

ˆ F1(t) =            ˆ F1(2) +4₅2₇ = 23₇₀ t ∈ [3, 4) ˆ F1(3) +4₇0₅ = 23₇₀ t ∈ [4, 5) ˆ F1(4) +12₃₅0₃ = 23₇₀ t ∈ [5, 6) ˆ F1(5) +12₃₅0₁ = 23₇₀ t ∈ [6, ∞) .

(20)

2 Current CCR Model

The CCR for time t tells us the probability that a client will be cured after time t conditioned on the event that he is still unresolved at time t. This probability can be expressed as F1(∞) − F1(t). It is important to understand that some cases will never

be cured. Consequently probability of cure is not necessary equal to 1 as t → ∞ or F1(∞) ≤ 1. Since CCR is conditioned on the event of being unresolved up to time t it

follows that

CCR(t) = F1(∞) − F1(t)

S(t) . (2.1)

Using the estimator (1.10) we can estimate a probability of curing after time t ∈ [ti, ti+1)

as

ˆ

F1(∞) − ˆF1(ti). (2.2)

We assume that the probability of cure after the 58th month is equal to zero. Conse-quently it holds ˆF1(∞) = ˆF1(58) and \CCR(58) = 0. Another assumption is that every

case will be resolved as t → ∞ which is equal to the property of requiring survival func-tion that S(∞) = 0. Consequently, we will consider that every case which has observed time bigger than 58 as liquidated. Since CCR is conditioned on the event of being unresolved up to time t ∈ [ti, ti+1), the estimator of CCR takes the following form

\ CCR(t) = F (58) − ˆˆ F (ti) ˆ S(ti) = d 1 i+1 ni+1 + (1 − d 1 i+1+ d2i+1 ni+1 \ CCR(ti+1)). (2.3)

If we define ˆF1(0) = 0, since the probability of being cured before time zero is equal to

zero, \CCR(0) is equal to ˆF (58). This gives us the probability of being cured, i.e the value of PCure in Figure 0.1.

2.1 Model implementation

In order to estimate CCR for each time point the Rabobank uses data which consists of mortgage defaults from the bank Lokaal Bank Bedrijf (LBB). Each default observation consists of the time the client spent in default and the status after the last month in default, which can be equal to cure, liquidated or unresolved. In the data we can also find the following variables:

(21)

• NGH indicator, • Bridge loan indicator.

Variable LTV represents Loan To Value ratio and it is calculated with the following formulation,

LT V = Mortgage amount

Appraised value of the property.

If LTV is higher than 100% it means that the value of mortgage which was not payed back by client is larger than the value of the security. Indicator High LTV takes value 1 if LTV is high. NGH is an abbreviation for the National Mortgage Guarantee or in Dutch ”Nationale Hypotheek Garantie”. If a client which has an NGH-backed mortgage cannot pay the mortgage due to specific circumstances, NGH will cover support to the bank and the client. If the client sells the house under the price of the mortgage, NGH will cover the difference and consequently neither the client nor the bank will suffer the loss. Since both sides get support in case of liquidation, we can expect that the cause specific hazard rate for liquidation will be higher and cause specific hazard rate for cure lower and consequently CCR lower.

Bridge loans are short-term loans that last between 2 weeks and 3 years. A client usually uses them until he finds a longer and larger term financing. This kind of loan provides an immediate cash flow to a client with relatively high interest. Such loans are usually riskier for a bank and have consequntly higher interest rates.

2.1.1 Stratification

It can be seen that Rabobank has data from different clients with different variables, but it uses an estimator for CCR which is unable to incorporate those variables. Rabobank solves this problem with the method called stratification or segmentation. This method separates the original data frame into smaller data frames based on variables and then calculates CCR on each one of them. For instance, if segmentation is based on the variable called ”Bridge loan indicator”, the original data frame will be separated into two data frames. In the first data frame only clients with bridge loans can be found and in the other clients with non-bridge loans. Once this step is made, CCR is calculated for each segment and two CCR estimates are obtained, one for clients with and the other for clients without a bridge loan. Rabobank separates the original data frame into four buckets, as can be seen on Figure 2.1. The segment with the most observations is Non-Bridge-Low LTV which has almost all observations from the original data. It is followed by the segment Non-Bridge-High LTV-non-NHG which has about ten times less observations than segment with Low LTV. The smallest segments are segments with Bridge loans and Non-Bridge-High LTV-NHG, which have about 600 observation.

(22)

Figure 2.1: Segmentation of original data frame

2.2 Results

In this chapter the estimation of survival quantities, which are needed in order to model CCR, will be looked into. In order to estimate the nonparametric cumulative inci-dence function we need the cause specific hazard rate for cure, which follows from equa-tion (1.10). Hazard rates estimated with d ˆΛ(t) estimator are point processes as can be seen in Figure 2.2. The two biggest segments behave regularly, while segments with less observations have irregular behavior with jumps at the end of the period. Big jumps from value zero to above 0.02 are occurring because a small number of individuals is at risk. For instance, at time 53 in the segment high LTV NHG cause specific hazard rate takes value 0.029. At that time only one cure happens, but there are 34 individuals at risk.

The next step is the estimation of the survival function with the Kaplan-Meier estimator. Results can be found on Figure 2.3.

From the figure it is visible that the segment which becomes resolved at the smallest rate is the segment which represents the clients with high LTV and without NHG, while clients with bridge loans become resolved at the highest rate.

Once we obtain the survival functions and cause specific hazard rates for cure we can model the cause specific incidence functions for cure, which can be found in Figure 2.4. Results of the nonparametric estimator can be found in Figure 2.5. Curves are the most irregular, have the most jumps, for the segments where we have the least obser-vations. The explanation for this phenomenon can be found in the recursive part of equation (2.3). It is seen that the big jumps happen because the cause specific hazard rates for cure are irregular.

(23)

Figure 2.2: Cause specific hazard rate for cure estimated with d ˆΛ(t)1

Figure 2.3: Survival function estimated with Kaplan-Meier estimator

estimates. Clients without NHG have larger CCR estimates than clients with, since bank and clients are more motivated into curing their defaults. From the figure is it not completely clear if the clients with bridge loans or clients with high LTV and without NHG have higher CCR estimates.

2.3 Performance of the method

In this chapter we will look into the variance, bias and confidence intervals of the non-parametric CCR estimator. Since the derivation of the asymptotic variance of the CCR estimator is out of scope of this thesis, these quantities will be estimated with the method known as bootstrap.

(24)

Figure 2.4: Survival functions estimated with Kaplan-Meier estimator

Figure 2.5: CCR estimated with non parametric estimators

2.3.1 The bootstrap

The bootstrap was introduced by Efron in 1979. The method is used to estimate the bias, variance and confidence intervals of estimators by resampling. In this chapter we will look at how this method is used for the estimation of bias, variance and confidence intervals. Later the method will be applied to mortgage data in order to estimate the previously mentioned quantities for each segment. For a review of bootstrap estimators and techniques see Fox (2016). In order to bootstrap, data frames from the original data frame need to be selected. These data frames are created by a selection of random default observations from the original data frame. Let us assume that n random data frames will be simulated. Simulated data frames need to be of the same size as the original data frame and it is allowed that simulated data frames have the same observations more than once. Once the simulated data frames are obtained, segmentation is done as in Figure 2.1. Finally CCRi(t), i ∈ {1, 2, . . . , n} estimates are calculated for each segment

(25)

and each time as can be seen in Figure 2.6.

Figure 2.6: Estimation of quantities with bootstrap.

Once estimates CCRi(t) for each data frame are obtained, CCR(t), t ∈ {0, 1, . . . 58} for each segment can be calculated as

CCR(t) = n X i=1 CCRi n . The bootstrap estimate for variance of \CCR(t) is equal to

1 n − 1 n X i=1 (CCRi(t) − CCR(t))2.

An estimator ˆθ is called unbiased if we have E(ˆθ) = θ and if it is biased the bias of the estimator is defined as

Bθ = E(ˆθ) − θ.

Bias of estimators is undesired. With the bootstrap the bias can be estimated by \

CCR(t) − CCR(t)

where \CCR(t) is the estimate of CCR from the original data frame.

For the estimation of confidence intervals a method called bootstrap percentile confidence interval will be used. In order to obtain 100(1 − α) interval for fixed time t we have to take \CCR(t)α

2,L, which denotes a value where

α

2 percent of the CCRi(t) is below that

value. In a similar fashion \CCR(t)α

2,R denotes a value where

α

2 percent of the CCR i_(t)

(26)

is above that value. For an α confidence interval the following values are taken [\CCR(t)α

2,L, \CCR(t) α 2,R].

The Rabobank decides whether it makes sense to make a segmentation or not based on the size of the confidence intervals. If two curves lie in each other’s 95%-confidence intervals, then the CCR curves are pragmatically treated as the same. If we look into CCR curves for each segment, as can be seen in Figure 2.7, we can see that segments LBB-LOW LTV and LBB-High LTV-non-NGH are treated as significantly different, while LBB-Low-LTV and Bridge are not. Furthermore segments with a bigger size of population have narrow confidence intervals, while segments with a small population size have wide confidence intervals. It follows that nonparametric CCR estimators are not a good choice when CCR on a population of small size has to be estimated.

Figure 2.7: Estimation of confidence intervals with the bootstrap.

From Figure 2.8 it can be seen that the biggest variance is obtained by segments with the smallest population and that variance grows with time. This happens because of the variability of the term d

1 i+1

ni+1 from the recursive part of the equation (2.3), as it is

seen from Figure 2.2. Since cures at the end of observation period in the Bridge and LBB-High LTV-NHG do not exist, CCRi(t) always takes the value zero. The same holds for \CCR(t). Consequently, the variance estimated with the bootstrap is equal to zero for the CCR estimates at the end of the observation period.

(27)

Figure 2.8: Estimation of variance with bootstrap.

compare bias with the size of CCR estimates from the Figure 2.5 it can be concluded that it is more than 100 times smaller than the size of the \CCR and that estimator is not problematically biased.

(28)

3 Cox proportional hazards model

Until now we have been looking only into nonparametric estimators of Survival and Hazard rate functions, which are unable to incorporate additional variables. Rabobank uses a method called stratification in order to estimate the CCR curve of individuals with different covariates, that has certain shortcomings. Since a big amount of the data consists out of unresolved (censored) cases, we will look for a suitable regression method from survival analysis.

At the beginning of this chapter we review the theory behind the so called Cox model. In Section 3.1 we will see how to estimate coefficients of explanatory variables and the baseline function. In section 3.2 the formula for the cumulative incidence function will be derived. In Section 3.3 it will be explained how to estimate the coefficients in order to get comparable results with segmentation. In Section 3.4 we will look into quantities estimated with the Cox model that are needed in order to get CCR estimates. In Sec-tion 3.5 the performance of the method will be analyzed with the bootstrap. In the last section of this chapter the method will be compared with the non-parametric estimator. In order to model CCR(t) we need a model, which is able to model the cause-specific hazard rate. Since we expect different behavior from clients with different covariates, we will look into one of the most popular regression models in survival analysis. The Cox proportional hazards model was presented in 1972 by Sir David Cox. For a review of the Cox model see Kalbfleisch and Prentice (2002), Lawless (2002) and Weng (2007). A hazard rate modeled with the Cox model takes the following form

λ(t|X) = λ0(t) exp(βTX), (3.1) where β = (β1, β2, . . . , βp) is a vector of coefficients which represents the influence of

covariates, X = (X1, . . . , Xp) on the hazard rate function λ(t|X), which depends on X.

We denoted covariates of the individual i by Xi. The baseline hazard function is denoted

by λ0(t) and it can take any form Weng (2007). The baseline hazard function can be interpreted as a hazard rate from an individual whose values of covariates are equal to zero. In a similar way the baseline survival function can be defined

S0(t) = exp(−

Z t

0

λ0(t)dt), (3.2)

according to equation (1.3). The survival function of individual j than takes the following form

S(t|Xj) = [S0(t)]exp(β

T_X

(29)

Corrente, Chalita, and Moreira (2003). Since the baseline function can take on any form, the Cox model is a semi-parametric estimator of the hazard rate. In a similar way the cause specific hazard rate for cause i

λi(t|X) = λ0i(t) exp(βiTX), i = 1, 2

can be modeled. Here βi represents the effect of covariate vector X on the cause specific

hazard rate. The function λ0_i(t) represents a cause specific baseline function.

In the following subsections we will look at how to estimate parameters in the Cox model.

3.1 Parameter estimation

Since the proportional hazards model is semi-parametric model the β coefficients and the baseline hazard function need to be estimated. In Section 3.1.1 we will look into partial likelihood and how to estimate the coefficients. In Section 3.1.2 we will introduce the Breslow estimator of the baseline function. For a review of baseline and β estimation see Weng (2007).

3.1.1 Estimation of β

Firstly, it will be assumed that the data consist of n individuals and that each individual has a different observation time ti, no ties in data. These observation times are ordered

in an ascending order, so t1< t2 < · · · < tn. In 1972 Cox proposed to estimate β using

partial likelihood, Weng (2007). The partial likelihood of individual i, P Li, is simply

the hazard rate of individual i divided by the sum of the hazard rates of all individuals that are at risk at time ti or for i ∈ N (ti):

Li= λ(ti|Xi) P j∈N (ti)λ(ti|Xj) (3.4) = P exp(βXi) j∈N (ti)exp(βXj) . (3.5) (3.6) The partial likelihood of individuals that were censored is equal to 1. It follows that the partial likelihood function of the data we have is equal to

P L(β) = n Y i=1 exp(βTXi) P j∈N (ti)exp(β T_X i) . (3.7)

Instead of maximizing P L we maximize log(P L) = pl, pl(β) = X i:δi=1 βTXi− ln( X j∈N (ti) exp(βTXj)).

(30)

3.1.2 Baseline hazard estimation

Breslow has proposed the following estimator for the baseline hazard function. We still assume that we do not have ties in our data, or t1 < t2 < . . . tn. Breslow proposed an

estimator, which is constant on subintervals where no event happened, Weng (2007). Rabobank’s data shows if an event of interest happened or not at the end of each month. That means that if the observation has observed time ti, that customer was

cured, liquidated or censored in the interval (ti− 1, ti]. In the case of event of interest,

the Breslow baseline function will be constant on the interval (ti− 1, ti], where it will

take the value λ0_i As in equation (1.2) and by using the equality f (t) = λ(t)S(t), we get that L(β, λ0(t)) ∝ n Y i=1 λ(ti|Xi)δiS(ti|Xi).

Using the equation (1.3) we get that

L(β, λ0(t)) = k Y i=1 (λ0(t) exp(βXi))δiexp(− Z ti 0 λ0(s) exp(βXi)ds.

If we take ti+ 1 = ti+1, so equidistant ti, we get that

Z ti 0 λ0(s)ds = i X j=1 λ0_j.

Taking the logarithm of L gives us

l(β, λ0(t)) = k X i=1 δi(ln(λ0i) + βXi) − k X i=1 λ0_i X j∈di exp(βXj). (3.8)

Once we obtain ˆβ from partial likelihood, we insert it into l. From the second term in (3.8) we see that only the λi with δi = 1 give a positive value to l. Consequently, we

take λ0_i = 0 for i /∈ {t1· δi, t2· δ2, . . . , tk· δk}. Going through the steps above we get

l(λ0(t)) = X δi=1 δi(ln(λ0i) + ˆβXi) − X δi=1 X j∈ni exp( ˆβXj).

Differentiation with respect to λ0

i gives us that l(λ0(t)) for t ∈ (ti−1, ti] is maximized by

λ0(ti) = λ0ti =

1 P

j∈N (ti)exp( ˆβXj)

,

Weng (2007). It is known that in continuous time it is impossible to have two individuals with the same observed time, but in reality it will most likely happen that some indi-viduals have the same observed time, since we usually check the state of indiindi-viduals on

(31)

monthly intervals. Consequently, many individuals will have the same observed times. It follows that a different estimator for a baseline function and a different partial likelihood function has to be used. In 1974 Breslow proposed the following partial likelihood or

L(β) = k Y i=1 exp(βX_i+) (P j∈N (ti)exp(βXj)) di, where X_i+ = P

j∈N (t(i))Xj. Using the same methodology as when there are no ties in

the data we get that the Breslow baseline for ties in the data is equal to λ0(ti) = di P j∈niexp( ˆβXj) . (3.9)

3.2 Estimators

From Chapter 2 we know that we have to be able to model cause specific hazard rates, survival function and cumulative incidence function in order to estimate CCR. For estimation of the survival function the identity (1.7) will be used. In order to model the cumulative incidence function we have to integrate equation (1.9), which is possible by using the fact that the baseline hazard rate is a step function. For t ∈ (ti−1, ti] we get

F1(t) = Z t 0 λ1(s)S(s)ds = Z t 0 λ1(s) exp(− Z s 0 (λ1(u) + λ2(u))du)ds = F1(ti−1) + Z t ti−1 λ1(s) exp(− Z s 0 (λ1(u) + λ2(u))du)ds (3.10)

Firstly let us look into the integralR₀s(λ1(u) + λ2(u))du. We know that λi(u) is a step

function, which takes value λi_t_i on the interval (ti−1, ti]. For s ∈ (ti−1, ti] it follows

Z s 0 (λ1(u) + λ2(u))du = i−1 X j=1 (λ1_t_j+ λ2_t_j) + (s − ti−1)(λ1ti + λ 2 ti) =Λ1(tj−1) + Λ2(tj−1) + (s − ti−1)(λ1ti+ λ 2 ti). (3.11)

(32)

From equations (3.10) and (3.11) it follows that F1(t) = F1(ti−1) + λ1ti Z t ti−1 exp(−(Λ1(tj−1) + Λ2(tj−1) + (s − ti−1)(λ1ti+ λ 2 ti))) = F1(ti−1) + λ1_t_i exp(Λ1(tj−1) + Λ2(tj−1)) Z t ti−1 exp((ti−1− s)(λ1ti+ λ 2 ti)))ds = F1(ti−1) + λ1_t_i exp(Λ1(tj−1) + Λ2(tj−1))(λ1ti + λ 2 ti) (− exp((ti−1− s)(λ1ti+ λ 2 ti)) | t ti−1) = F1(ti−1) + λ1_t_i(1 − exp((ti−1− t)(λ1ti+ λ 2 ti)) exp(Λ1(tj−1) + Λ2(tj−1))(λ1ti + λ 2 ti) . (3.12) Let us continue the example from Figure 1.5 and assume that every individual also has a variable LTV. We will define a new variable ”HighLTV”, which takes value 1, if LTV is high as it is shown in the Figure 3.1. In this example the variable HighLTV will be included in the regression.

Figure 3.1: Example data

In order to estimate the coefficient β1, which explains the influence of the variable

HighLTV on the cause specific hazard rate for cure, we model cured individuals as individuals who experienced the event of interest. Other individuals are considered as censored. In the same way the coefficient for the cause specific hazard rate for liquidation, β2, is estimated. For the estimation of the parameters β1 and β2 we used the Python

Statsmodels package. This gave the following result β1 = 0.205

(33)

and for liquidation,

β2 = −0.110.

Since no cures occurred in the interval [0, 1) the Breslow estimator for the cause specific baseline function for cure gives us the following value for t ∈ (0, 1]

λ0₁(t) = P 0

j∈{1,2,3,4,5,6,7,8,9,10}exp(β1Xi)

= 0.

Since one event happened on the interval (1, 2], the cause specific baseline function for t ∈ (1, 2] is equal to

λ0₁(t) = P 1

j∈{1,2,4,5,6,7,8,9,10}exp(β1Xi)

= 0.103. In a similar fashion we get

λ0₁(t) = ( ₂

7.455 = 0.268 t ∈ (2, 3]

0 t ∈ (3, ∞].

For the baseline function for cure we get the following values

λ0₂(t) =                0.103 t ∈ (0, 1] 0.115 t ∈ (1, 2] 0 t ∈ (2, 3] 0.408 t ∈ (3, 4] 0 t ∈ (4, ∞).

Once the cause specific baseline hazard rates for cure are calculated, we can multiply them by exp(β1HighLTV), HighLTV ∈ {0, 1} in order to get cause specific hazard rates

for cure of individuals with LTV higher and lower than 1.2. For the baseline function for cure we get the following values

λ1(t|1) =                0 t ∈ (0, 1] 0.127 t ∈ (1, 2] 0.329 t ∈ (2, 3] 0 t ∈ (3, 4] 0 t ∈ (4, ∞) Since exp(β10) = 1 the following identity holds

λ1(t|0) = λ01(t).

By taking the same steps as above the cause specific hazard rate for loss can be calculated and the following results are obtained:

(34)

c

Figure 3.2: Cause specific hazard rate for cure

λ0₂(t|1) =                0.093 t ∈ (0, 1] 0.103 t ∈ (1, 2] 0 t ∈ (2, 3] 0.366 t ∈ (3, 4] 0 t ∈ (4, ∞) and λ1(t|0) = λ01(t).

The hazard rate for liquidation for HighLT V = 1 is almost equal to zero, because of the size of β2. In order to calculate the survival function equation (1.7) will be used. Since

the time steps are of length 1 and the cause specific hazard rate is a step function, the following identity holds for t ∈ (tk−1, tk]

Λi(t|X) = Λi(tk−1|X) + (t − tk)λi(tk|X).

The identity above gives us the following functions of t for cause specific cumulative hazard functions for cure

Λ1(t|0) =            0 t ∈ (0, 1] (t − 1) ∗ 0.103 t ∈ (1, 2] 0.103 + (t − 2) ∗ 0.268 t ∈ (2, 3] 0.371 t ∈ (3, ∞)

(35)

Figure 3.3: Cause specific cumulative function for cure

Figure 3.4: Cause specific cumulative hazard rate function for liquidation

and for HighLTV=1,

Λ1(t|1) =            0 t ∈ (0, 1] (t − 1) ∗ 0.127 t ∈ (1, 2] 0.127 + (t − 2) ∗ 0.329 t ∈ (2, 3] 0.456 t ∈ (3, ∞).

(36)

The cause specific cumulative hazard functions for loss are Λ2(t|0) =                t ∗ 0.103 t ∈ (0, 1] 0.103 + (t − 1) ∗ 0.115 t ∈ (1, 2] 0.218 t ∈ (2, 3] 0.218 + (t − 3) ∗ 0.408 t ∈ (3, 4] 0.626 t ∈ (4, ∞)

and for HighLTV=1,

Λ0₂(t|1) =                t ∗ 0.093 t ∈ (0, 1] 0.093 + 0.103 ∗ (t − 1) t ∈ (1, 2] 0.196 t ∈ (2, 3] 0.196 + (t − 3) ∗ 0.366 t ∈ (3, 4] 0.562 t ∈ (4, ∞).

The cause specific cumulative hazard functions for cure and loss can be seen in Fig-ures 3.3 and 3.4.

After the cause specific cumulative functions for cure and liquidation are determined, the identity (1.7) can be used for the estimation of the survival function. We get

S(t|0) =                exp(−t ∗ 0.103) t ∈ (0, 1] exp(−(0.103 + (t − 1) ∗ (0.115 + 0.103)) t ∈ (1, 2] exp(−(0.321 + (t − 2) ∗ 0.268)) t ∈ (2, 3] exp(−(0.589 + (t − 3) ∗ 0.408) t ∈ (3, 4] exp(−1.997) t ∈ (4, ∞)

and for HighLTV=1 we get

S(t|1) =                exp(−t ∗ 0.093) t ∈ (0, 1] exp(−(0.093 + (t − 1) ∗ (0.127 + 0.103)) t ∈ (1, 2] exp(−(0.323 + (t − 2) ∗ 0.329)) t ∈ (2, 3] exp(−(0.652 + (t − 3) ∗ 0.366)) t ∈ (3, 4] exp(−1.018) t ∈ (4, ∞).

The survival curve estimates can be seen in Figure 3.5.

After calculation of survival curves and hazard rates are calculated we can estimate cu-mulative incidence functions using equation (3.12). The calculated cucu-mulative incidence functions can be seen in Figure 3.6.

(37)

Figure 3.5: Survival function calculated with the Cox model

Figure 3.6: Cumulative incidence function calculated with the Cox proportional model

3.3 Model implementation

For the estimation of CCR, using the Cox model, the same data will be used as in the previous chapter. Rabobank is still interested in the behavior of clients, which have Bridge loans, clients with non-Bridge loans and low LTV, clients with non-Bridge loans, which have high LTV and do not have NHG and clients, which have non-Bridge mortgages with high LTV and do have NHG.

Firstly it needs to be understood that the coefficient estimates of one risk factor depends on all variables that are included into the regression. For instance, if we compare the parameters ˆβhLTV_HighLTV, when Cox regression is performed with variable HighLTV only, to

(38)

ˆ

β_HighLTVhLTV,Br, when HighLTV and variable Bridge are included in the regression, it can be seen that ˆβ_HighLTVhLTV 6= ˆβ_HighLTVhLTV,Br.

Secondly, if one is interested in the behavior of clients with bridge loans then regression only with the variable Bridge needs to be made. If we would include variable highLTV as well, than we would not be able to calculate a CCR curve for Bridge loans. However we would be able to calculate only CCR curves for clients which have Bridge loans and high LTV or CCR for clients which have bridge loans and low LTV, since XhighLTVcan

take only values 0 and 1.

It follows that Cox regression needs to be done three times with three different variable combinations as can be seen in Figure 3.7.

Once the coefficients are obtained for hazard rate calculation of the desired segment, a

Figure 3.7: Input variables and output coefficients matching covariate needs to be used as can be observed from Figure 3.8.

(39)

3.4 Results

In this chapter all quantities that are needed in order to obtain CCR estimates for each segment will be looked into. Since we want to obtain CCR estimates for each segment. The first step is coefficient estimation for all combinations of variables. Coefficient cause specific hazard rates for cure estimated with Cox regression can be found in the table in Figure 3.9 and for loss in Figure 3.10.

Figure 3.9: Output coefficients for cure

Figure 3.10: Output coefficients for liquidation

From the figures it can bee seen that the indicator variable for bridge loan and the indicator variable for high LTV have a negative effect on the cause specific hazard rates for cure. In other words, individuals which have a LTV of more than 1.2 and a Bridge loan are have a smaller hazard rate and consequently have a smaller probability of being cured. Individuals with NGH will be cured slightly faster than individuals without it. On the other hand, it can be seen that individuals which have high LTV, non Bridge loans and NHG will be liquidated faster than individuals without NGH. Once estimates of coefficients are obtained we are able to model cause specific hazard rates for cure and liquidation, which can be seen in Figures 3.11 and 3.12.

It is seen that the segment which is cured at the fastest rate is the segment with clients which have low LTV, while the other three segments have similar hazard rates, which become almost the same after the 30th month in default. These segments behave differ-ently when cause specific hazard rate for liquidation is modeled. We can conclude that liquidation happens at the lowest rates for clients which have low LTV. This probably happens because clients with low LTV want to keep their properties. The segment with the highest rate of liquidation is the segment with bridge loans.

(40)

Figure 3.11: Cause specific hazard rate for cure

Figure 3.12: Cause specific hazard rate for liquidation

is used. Since the size of cause specific hazard rates and the size of the survival function are negatively correlated, the segment with the biggest survival function is represented by clients which have High LTV and do not have NHG. If we look Figure 3.12 it is visible that this segment has the second smallest cause specific hazard rate for liquidation while in Figure 3.11 it is shown that cause specific hazard rates for cure are almost the same as the smallest hazard rates. With the same reasoning the behaviour of the other groups can also be explained.

An intuitive explanation for the shapes of the survival curves is that clients with high cause specific hazard rates will be resolved faster and consequently the probability of being unresolved at time t becomes smaller.

Before CCR is modeled we need to look into the estimation of the cumulative incidence function for which the equation (3.12) is needed. From the Figure 3.14 it can be con-cluded that the highest probabilities of being cured are achieved by individuals with Low LTV. Since other segments have similar estimates for cause specific hazard rates

(41)

Figure 3.13: Survival functions calculated with the Cox model

for cure, cumulative incidence functions are ordered in the same way as estimates of the cause specific hazard rates for liquidation.

Figure 3.14: Cumulative incidence functions for cure calculated with the Cox model Finally all estimates from above can be used to estimate the CCRs, which can be seen in Figure 3.15.

3.5 Performance of the method

In Section 2.3 it was seen that the nonparametric CCR estimator has large variance and wide confidence intervals at the end of the observation period. That happens because of the jumps of the hazard rates that are estimated with the Nelson-Aalen estimator. In this chapter the methodology described in Section 2.3.1 will be used again in order to estimate confidence intervals, variance and bias of CCR estimated with Cox regression. From Figure 3.16 it can be seen that Cox regression estimators have narrower confidence

(42)

Figure 3.15: CCR estimated with the Cox model

intervals than the nonparametric estimator but on the other hand, estimates for each segment are the same at the end of the observation period and consequently not signifi-cantly different.

In Figure 3.17 it is visible that the variance is almost 10 times smaller than the variance

Figure 3.16: Confidence intervals estimated with the bootstrap

of the nonparametric estimator and that the small number of observations at the end of the observation period has almost no effect on variance. In Figure 3.18 it is be shown that bias is smaller and that time also has almost no influence on the bias.

3.6 Discussion

From Section 3.5 it can be concluded that the Cox model gives us lower variance, less bias and narrower confidence intervals than nonparametric estimators. All these prop-erties are definitely desirable features of an estimator.

(43)

Figure 3.17: Variance of CCR estimated with the Cox regression

Figure 3.18: Estimation of bias with the bootstrap

At the same time there exist systematical techniques for variable selection in the Cox model. It can be much easier decided if a variable should be included or excluded from a segmentation than the method based on 95% confidence intervals, which is used in the case of nonparametric estimators. Also β’s which are output from regression have some explanatory power. When the coefficient of a boolean variable is negative it is clear from equation (3.1) that a client with such property will have a smaller hazard rate than a client without and consequently has a higher survival function and smaller probability of resolution than a client without. The opposite phenomenon happens when β is positive. Consequently, tables which can be found in Figures 3.9 and 3.10 could be a helpful instrument at deciding if a client should get a mortgage or not or at deciding what should be a size of interest rate of a client. Such a tool cannot be found when we operate with nonparametric estimators.

This definitely is one of the desired properties of an estimator in Risk Management, the understandability, what is not a property of the Cox model. All nonparametric

(44)

estimators, which are part of the CCR estimator, are intuitively easier to understand than estimators, which are used in the Cox model, especially if we compare explanations of the cumulative incidence functions. Hazard rates estimated with the Nelson-Aalen estimator are closely related to empirical probability, a simple counting process while formula 3.1 is closely related to the exponential distribution. This makes sense only to a person who is better educated in probability.

If time and computational power are important factors at deciding which estimator we are going to use, then we would have to decide to use nonparametric estimators, since they work faster and use less memory.

(45)

4 Proportional hazards model in an

interval censored setting

Up until now CCR was modeled assuming that we have continuous time data and that the events of interest happen exactly at the observed time. In reality however, the events, which are denoted by ti, actually happened on the interval (ti−1, ti]. Consequently, we

are dealing with another type of censoring, interval censoring. In order to model CCR in an interval censoring setting we have to use a different approach and a different like-lihood function. For a review of proportional hazards model in an interval censored setting see Corrente, Chalita, and Moreira (2003).

In Section 4.1 we will get familiar with the theory behind the survival analysis in the interval censoring setting. In Section 4.1.1 generalized linear models, which are needed for the modeling of CCR, will be represented. In Section 4.2 we will learn how to use binning in time intervals construction and make computations with the generalized lin-ear models computationally feasible. In Section 4.3 we review estimated quantities in the interval censoring setting. In Section 4.4 the performance of the method will be analyzed with the bootstrap. Which method from the three estimators is the best will be discussed in the Section 4.5.

4.1 Theoretical background

When we have interval censored data, an interval is divided into smaller subintervals Ii = [ai−1, ai), where 0 = a0 < a1 < · · · < ak = ∞. The set of subjects that defaulted

in interval Ii will be denoted with Di and Ni will denote the number set of subjects,

which are at risk at the beginning of the interval Ii. The indicative variable δji, j ∈

{1, 2, . . . , n}, i ∈ {1, 2, . . . , k} will denote the indicator variable, which takes value 1 if the subject j failed in interval i and it takes value zero if subject j was still alive at the end of interval Ii or it was censored in interval Ii. For instance, if subject j died in the

interval I3, the following equality will hold

(δj1, δj2, δj3) = (0, 0, 1).

The value p(ai|Xj) equals the conditional probability that the subject with covariate

Xj has already experienced an event at time ai given that the individual has not

(46)

interval censoring setting the following two identities are needed P (Tj ∈ I|Xj) = P (ai−1≤ Tj < ai|Xj)

= (S(ai−1|Xj) − (S(ai|Xj)

= [(1 − p(a1|Xj)) . . . (1 − p(ai−1|Xj))] − [(1 − p(a1|Xj)) . . . (1 − p(ai|Xj))]

= [(1 − p(a1|Xj)) . . . (1 − p(ai−1|Xj))](1 − (1 − p(ai|Xj)))

= [(1 − p(a1|Xj)) . . . (1 − p(ai−1|Xj))]p(ai|Xj)

Combining equations (4.1) and (4.2) the likelihood function for interval censored data becomes L = k Y i=1 Y j∈Ni p(ai|Xj)δji(1 − p(ai|Xj))1−δji, (4.3)

Corrente, Chalita, and Moreira (2003).

Equation (4.3) is a likelihood function for observations with a Bernoulli distribution where δji is a binary response variable with probability p(ai|Xj).

When a variable has a Bernoulli distribution generalized linear models can be used for the modeling.

4.1.1 Generalized Linear Models

In the first part of this section generalized linear models will be presented. In the second part we review modeling of p(ai|Xj) with GLM. For a review of Generalized Linear

Models see Fox (2016).

Generalized Linear Models (GLM) is a tool for the estimation of a number of distinct statistical models, for which we would usually need distinct statistical regression models, for instance logit and probit. GLM was firstly presented by John Nelder and R.W.M. Wedderburn in 1972.

In order to use GLM three components are needed:

• Firstly we need a random component Yi for i-th observation, which is conditioned

on the explanatory variables of the model. In the original formulation of GLM Yi had to be a member of an exponential family. One of the members of the

exponential family is binomial distribution and consequently Bernoulli distribution. • Secondly, a linear predictor is needed. That is a function of regressors

(47)

• Thirdly, we need a smooth invertible linearizing link function, g(·). The link func-tion transforms the expectafunc-tion of the response variable, µi = E(Yi), into a linear

predictor

g(µi) = ηi= α + β1Xi1+ β2Xi2+ · · · + βkXik.

One of the functions that can be used in a combination with the binomial distri-bution is the complementary log-log function (clog-log) where

ηi = g(µi) = ln(− ln(1 − µi))

or

µi = g−1(ηi) = 1 − exp(− exp(ηi)).

From the identity (4.3) it is seen that δji will be the response variable with a Bernoulli

distribution with parameter p(ai|Xj) and consequently E(δji) = p(ai|Xj) = µi.

Once we know µi, we have to find the link function g, for which it holds ηi = g(p(ai|Xj)).

Using the Cox proportional hazard model in order to model p(ai|xj), equations (3.3) and

(3.2), give us p(ai|Xj) = 1 − S0(ai) S0(ai−1) !exp(βTXj) . (4.4)

As a complementary log-log transformation is applied to equation (4.4) we get,

ln(− ln(1 − p(ai|Xj))) = βTXj + ln − ln S0(ai) S0(ai−1) !! = βTXj + γi = ηi, (4.5) where γi= ln − ln _SS₀_(a0(a_i−1i)₎.

From above it follows that we can use a GLM with a binomial distribution and comple-mentary log-log link function in order to model δji or in other words, we can model the

Cox proportional hazard model in the interval censoring setting with GLM.

After the values p(ai|Xj) are obtained, the survival function for each time point can be

calculated using equation (4.2).

In the competing risk setting a GLM model can be used in order to model probabilities of failing due to reason k on the interval [aj−i, ai) or pk(ai|Xj), k = 1, 2. In a competing

risk setting indicator function δk_ji tells us if individual j failed on interval i due to reason k = 1, 2. As soon as estimators ˆpk(ai|Xj) are calculated, the following identity can be

used in order to estimate the survival function ˆ

S(t|Xj) =

Y

i:ai≤t

(1 − (ˆp1(ai|X) + ˆp2(ai|X))). (4.6)

(48)

used ˆ F1(t|Xj) = X i:ai≤t ˆ S(ai−1|Xj)ˆpi(ai|Xj). (4.7)

Once the cumulative incidence functions are estimated, equation (2.1) is used in order to estimate CCR.

As can be seen from equation (4.7) GLM can be used for the modeling of proportional hazards in the interval censoring setting. Since the indicator variable δk

ji is defined for

individual j for each time until he is censored, or one of events k occur to him, we need to input a different type of data frame into the regression than the one inputed when we were using the nonparametric and Cox estimators. The easiest way to understand how the data needs to be transformed is by continuing the example from Figure 3.1.

From the figure it can be seen that we are following individual 1 until time 6, when he is censored. And for every time interval we have to define δk

1i, k = 1, 2, i = 1, 2, . . . , 6 which

tells us if individual 1 experienced cure or loss in interval (i − 1, i]. In similar fashion the other rows have to be duplicated, but at observations, which are not censored, δ_ji1 takes value 1 if cure happened as can bee seen for individual 4 from the Figure 4.1.

After we have the data in the right format, we need to make dummy variables out of the variable observed time. How to create dummy variables for individual 1 can be seen in Figure 4.2. Dummies have to be crated for each time from 1 to the largest time that can be found in the column observed time.

Once dummies are created, we can start modeling pk(ai|X). In the GLM we chose the

clog-log as the link function and the binomial distribution. For the response variable we chose δ_jik and for the independent variables we have to chose a dummy variable and the variables we want to include into the regression. In our case it is the variable HighLTV. Coefficients, which we get as output for dummies i, represent values γi while coefficient

βHighLT V explains the influence of the variable HighLTV on pk(ai|HighLT V ) as can be

seen in equation (4.4).

The output coefficients for cure are

β_HighLTV1 0.243 1(γ₁1) -23.553 2(γ1 2) -2.232 3(γ₃1) -1.154 4(γ₄1) -23.528 5(γ1 5) -23.562 6(γ₆1) -23.476 and for liquidation

(49)

Figure 4.1: Transformed data frame for GLM β_HighLTV2 -0.156 1(γ₁2) -2.204 2(γ2 2) -2.096 3(γ₃2) -22.430 4(γ₄2) -0.633 5(γ2 5) -22.423 6(γ₆2) -22.476

(50)

Figure 4.2: Dummies for GLM

In order to estimate pk(ai|HighLTV) we have to use the inverse of equation (4.5) or

pk(ai|HighLTV) = 1 − exp(− exp(γik+ βHighLTVHighLTV)). (4.8)

After doing so we get the following values for the probabilities for cure

ˆ p1(ai|1) =                      7.524 ∗ 10−11 , i = 1 0.128 , i = 2 0.331 , i = 3 7.717 ∗ 10−11 , i = 4 7.462 ∗ 10−11 , i = 5 8.125 ∗ 10−11 , i = 6 and ˆ p1(ai|0) =                      0.105 , i = 1 0.116 , i = 2 1.813 ∗ 10−10 , i = 3 0.411 , i = 4 . 1.827 ∗ 10−10 , i = 5 1.733 ∗ 10−10 , i = 6 For p2(ai|1) we get the following values

(51)

ˆ p2(ai|0) =                      0.143 , i = 1 1.082 ∗ 10−10 , i = 2 9.915 ∗ 10−11 , i = 3 0.500 , i = 4 1.082 ∗ 10−10 , i = 5 6.374 ∗ 10−11 , i = 6 and ˆ p2(ai|1) =                      0.009 , i = 1 0.010 , i = 2 1.552 ∗ 10−10 , i = 3 0.365 , i = 4 . 1.564 ∗ 10−10 , i = 5 1.483 ∗ 10−10 , i = 6 Applying equation (4.6) to the results above gives us

ˆ S(ai|0) =                      1.000 , i = 1 0.895 , i = 2 0.511 , i = 3 0.301 , i = 4 0.301 , i = 5 0.301 , i = 6 and ˆ S(ai|1) =                      0.909 , i = 1 0.703 , i = 2 0.470 , i = 3 0.298 , i = 4 . 0.298 , i = 5 0.298 , i = 6

Credit risk and survival analysis: Estimation of Conditional Cure Rate

Stochastics and Financial Mathematics

Master Thesis