Advanced Backtesting Probability of Default Predictions

(1)

1

MSc STOCHASTICS AND FINANCIAL MATHEMATICS

MASTER THESIS

Advanced Backtesting Probability of

Default Predictions

Author: Supervisor:

Congyi Dong dr. Arnoud den Boer

Examination Date: Daily Supervisor:

27 November dr. Sjoerd C. de Vries

Second Reader:

(2)

2

Abstract

Measuring the performance of Probability of Default (PD) models is always a major task for banks. The predictions of PD models are regularly tested against actual observations. This activity is called backtesting. In the real world, banks won’t directly use the PDs to describe the credit quality of clients but map these clients to a certain bucket of an internal rating system according to their PDs. However, these PD ratings are not produced on a very evenly spaced schedule, and the backtesting reliability suffers from an incorrect assumption that the most recent credit rating predictions, which may be generated 12 months ago or longer ago, are still valid at the backtest starting date. This problem could be solved by establishing a migration matrix of credit grades to estimate the rating at the start of a backtest period. This thesis investigates whether or not we can use Hidden Markov Model (HMM) to obtain a good estimate of this migration matrix. In our research, the ‘true’ credit grades are considered as the Hidden Markov states, while the credit grades predicted by banks are set to be observation states. This leads to large size of both observation and hidden state space. To reduce to data size required to fit this high-dimensional HMM, we propose a technique to estimate the migration matrix block by block. Then, we estimate the migration matrix in two scenarios: monthly or irregularly rating the clients. In the former ideal case, the bank won’t lose a lot of information about credit quality transitions of clients, so the estimated migration matrix is in line with the given ‘true’ matrix; in the latter more realistic case, by introducing a new observation state ‘non-rated', the credit quality migration sequences of clients are also put on a monthly grid. Due to a lack of information, in the latter case, we can only estimate the transition probabilities of clients in low credit rating grade blocks. Thus, we conclude that when banks rerate clients irregularly, HMM can be applied to some specific portfolios whose clients are with low PDs, such as a mortgage portfolio.

Title: Advanced Backtesting Probability of Default Predictions

Keywords: Credit Risk Management, Credit Model Validation, Backtesting, Probability of Default, Hidden Markov Model

Author: Congyi Dong, 12101788 Email: Congyi.dong@student.uva.nl Supervisor: dr. Arnoud den Boer

Daily supervisor: dr. Sjoerd C. de Vries Second reader: dr.ir. Erik Winands Examination date: 27 November

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105-107, 1098 XG Amsterdam http://kdvi.uva.nl

(3)

3

Preface

After eight months, I am finishing this thesis and then graduating from University of Amsterdam. During these two years, I grew rapidly and found my career goal. My master's project starts on 8th March. A few weeks later, the Netherlands was put on lockdown because of the COVID-19 outbreak. It was a tough time for us interns since we could only work from home. I am very grateful to Rabobank not only for offering me the chance to do a thesis internship but also for arranging some online sessions to make me have access to the bank industry even if I was not allowed to go to the office.

I would like to thank my supervisors, Sjoerd de Vries and Arnoud den Boer, for guiding me and supporting me in my master's project. I learned from them about how to solve problems and how to write a well-organized thesis. I would also like to thank my other colleagues in the Credit Model Validation team. Last, many thanks to my roommate Ziyu Zhou for her company during this quarantine and her kind suggestions.

Hope you enjoy your reading.

Congyi Dong,

(4)

4

Table of Content

1 Introduction ...6

2 Regulatory Background ...9

2.1 A Brief History of Basel Accords ... 9

2.2 Some Definitions and Regulations from CRR ... 10

3 Literature Review ... 12

3.1 Literature Related to Model Validation ... 12

3.2 Literature Related to HMM ... 13

4 Model Validation Methodology ... 15

4.1 Calibration Quality ... 15

4.1.1 Binomial Test ... 15

4.1.2 Poisson Binomial Test ... 17

4.1.3 Traffic Light Approach ... 18

4.2 Discriminatory Power ... 18

4.2.1 Cumulative Accuracy Profile (CAP) ... 19

4.2.2 Accuracy Ratio (AR) ... 22

4.2.3 Receiver Operating Characteristic (ROC) ... 23

4.3 Chapter Summary ... 23

5 Dataset Simulation and PD model Validation ... 24

5.1 Dataset Simulation ... 24

5.1.1 Factor Values and Credit Rating System Setup ... 25

5.1.2 ‘True’ Credit Rating Model ... 26

5.1.3 Drift Functions ... 28

5.1.4 Simulation of Credit Rating Migration ... 30

5.2 Validation Methods Implementation ... 34

5.2.1 Logistic Regression Model with Full Information ... 34

5.2.2 Logistic Regression Model with Partial Information ... 38

6 Hidden Markov Model Methodology (HMM) ... 41

6.1 Setup of Hidden Markov Model ... 41

6.1.1 One unit delay Hidden Markov Model... 42

6.1.2 Zero delay Hidden Markov Model ... 44

(5)

5

6.2.1 Probability of obtaining a certain observation sequence ... 46

6.2.2 Estimation of HMM Parameters ... 49

6.2.3 Decoding the observation sequence ... 52

6.3 A Simple Example of HMM Application ... 53

6.3.1 When the Bank Rerates Clients on Evenly Spaced Schedule ... 53

6.3.2 When the Bank Rerates Clients on Unevenly Spaced Schedule ... 54

6.4 Estimation of 15-dimensional HMM Transition Matrix ... 56

7 The Implementation of the Hidden Markov Model ... 59

7.1 Data Pre-processing ... 59

7.1.1 Data Pre-processing for Full Information... 60

7.1.2 Data Pre-processing for Partial Information HMM ... 60

7.2 When Clients Are Rated Monthly ... 61

7.3 When Clients Are Not Rated Monthly ... 66

7.4 Attempts to Improve Obtained Results ... 69

7.4.1 Modifying the Original Settings of Simulated Artificial Bank... 69

7.4.2 Modifying the Data Pre-processing technique ... 72

8 Conclusions and Discussions ... 75

8.1 Conclusions ... 75

8.2 Discussions ... 76

9 Further Research ... 79

Popular Summary ... 82

Reference ... 84

Appendix I. The bucket plotting based on a declining number of factors ... 86

(6)

6

1 Introduction

In the real world, banks are required to calculate Probability of Default (PD) as a part of risk management. PD is defined as ‘the probability of default of a counterparty over a one-year period’ [1]. These are several reasons why banks estimate PD: 1) Regulatory Capital (RC) calculation. Banks are required to hold capital for unexpected losses. PDs are used as part of this calculation. They are used to calculate Risk Weights [1]; 2) RAROC. Calculating the Risk-Adjusted Return ON Capital. This is used to check whether a loan would make sufficient returns to compensate for the risk that a bank runs on the loans and the cost of capital. This may be part of the next point: 3) Client acceptance. The bank can choose to reject clients that create a risk that is deemed to be too high and that doesn’t have a RAROC that is higher than the minimum hurdle rate; 4) Provisioning. PDs are used to calculated expected losses. Currently, the IFRS9 standard is used in most banks to do provisioning. 5) Pricing. With a good PD model, it is easier to accept bad clients if banks can calculate the correct pricing for such clients. Bad clients may have to pay more for their loan, so that even if some of the default, on average banks still get a profit. This is related to the RAROC above. 6) Client monitoring. A rapid changing PD can be a signal to put a client on the close watch by account management or the Special Assets Management (SAM) department.

Instead of directly applying the exact PD of each client, banks usually assign the clients to one of the buckets in the internal rating system which is defined on a set of PD intervals. In this internal rating system, the clients in the same credit grade are assumed to have the same bucket PD. The PD prediction is regularly tested against the actual Observed Default Rate (ODR) to check the predictive ability of a target PD model. This activity is called backtesting. The ideal backtesting procedure, where all the clients are rated at the starting date, will yield the most reliable assessment results, as shown in Figure 1. This is because the time horizon of PD (can be found in Section 2.2) is one year and the backtest period is also a year.

(7)

7

In Figure 1, ‘R1’ to ‘R10’ represents buckets of the internal credit rating system; ‘P’ and ‘D’ represents performing state and defaulting state, respectively. By comparing observed default frequency between the vertical lines in Figure 1 with the PD predicted 1 year ago, we could properly assess the performance of tested PD prediction models.

However, the real-world backtest is not optimal, due to the incorrect assumption that the most recent ratings are still valid at the start of a backtest period. Banks won’t rerate all clients at the starting date but directly compare the ODR with the most recent credit rating grades which could be very old, up to 12 months if regulatory requirements are fully complied with, but in practice sometimes even older, as shown in Figure 2. During the interval of time between the starting date of a backtest period and the most recent rating date, the credit quality of the clients will invisibly change to be better or worse, reducing the backtest reliability.

Figure 2. The real-world backtesting procedure when banks rerate all clients irregularly and assume that the latest credit grades are still valid, which might not be the case

Building a good migration matrix of client ratings would help to solve the problem mentioned above. Based on the knowledge of transition probabilities, banks would be able to predict the possible credit quality migration during this time interval so that the backtest reliability (but also the predictions themselves) will be improved.

However, the migration matrix estimation that banks are using now is slightly wrong. Banks build the migration matrices using the assumption that they are one-year transitions. This can be illustrated in Figure 3. A, B, C, and D represent the time point of the vertical lines, with a one-year time interval in between. The one-year transition matrices are computed based on the latest credit rating grades around the time point of the vertical lines. If during a year a client is not rated, for example, the fourth client from A to B in Figure 3, then his or her credit grade is seen to remain at 𝑅2. However, this is not really reasonable, since we don’t know

whether the migration from 𝑅2 to 𝑅5 of the fourth client happens during the AB period or BC

(8)

8

Figure 3. How banks compute the credit rating grades migration matrix in the real world

The assumption of the backtest implies an identity migration matrix, as the most recent ratings are considered as the current ratings. Although a slightly inaccurate migration matrix, as shown in Figure 3, will at least provide some useful information about possible transitions, a better estimate of the migration matrix would be more helpful.

Thus, this research aims to check whether a Hidden Markov Model can help to obtain a good estimate of the credit grades migration matrix, enabling the prediction of credit quality migration at the starting date of a backtest period.

In Chapter 2, the related regulations and definitions regarding model validation will be discussed, and Chapter 3 covers the literature relating to both model validation and previous researches of predicting the credit rating grades migration. In Chapter 4, the model validation methodologies in use will be explained in detail. Then, after simulating an artificial bank in Chapter 5, these validation methods will be put into practice to see whether they can distinguish the good model from the bad ones. In Chapter 6 the zero delay Hidden Markov Model will be given theoretically, and then in Chapter 7, we will check whether a Hidden Markov Model can be used to estimate the migration matrix of the credit quality of clients. Chapter 8 will make the conclusions resulting from this research and discusses their implications, for this research, and Chapter 9 provides four possible directions for further research. R5 R4 R5 R5 R9 R9 R2 _R5 R3 R8 R3 R4 1 YEAR Original

ratings. Irregularly rerating schedule

A B C D

R4

(9)

9

2 Regulatory Background

In this chapter the history of Basel Accords and its EU translation Capital Requirements Regulation (CRR). Some definitions and regulations related to our research will be introduced. This chapter is based on Basel Committee documents BCBS (2004) [3], BCBS (2005a) [4], BCBS (2005b) [5], BCBS (2010) [6], the BIS (Bank for international settlements) official website (www.bis.org), and BIC (2014) [7].

2.1 A Brief History of Basel Accords

As recorded in the history document published by BIS, after the breakdown of the Bretton Woods system of managed exchange rates [7] in 1973, many banks suffered from large foreign currency losses. On 26 June 1974, because the foreign exchange exposures of Bankhaus Herstatt were three times its capital, the Federal Banking Supervisory Office of West Germany withdrew its banking license. Banks outside Germany took heavy losses on their unsettled trades with Herstatt, adding an international dimension to the turmoil. In October the same year, the Franklin National Bank of New York also closed its doors after incurring large foreign exchange losses [7]. Following bank failures in both Germany and the United States in 1974, the central bank governors of the G10 countries set up a committee on Banking Regulations and Supervision. This committee was renamed the Basel Committee on Banking Supervision. It provides a forum for regular cooperation on banking supervisory matters, and its objective is to enhance understanding of key supervisory issues and improve the quality of banking supervision worldwide [7].

As mentioned in the official history document [7], in the 1980s, the rate of bank failures in the United States was increasing at an appalling rate. As a result, the external debt of a lot of countries had been growing at an unsustainable rate and the probability of major international banks going bankrupt was alarmingly high. Backed by the G10 Governors, the Basel Committee on Banking Supervision met in 1987 in Basel, Switzerland to discuss the possible solutions of preventing things spinning out of control [7]. This meeting reached the agreement to use a weighted approach to measure the risk banks run on their exposure. According to an advisory paper published in December 1987, a capital measurement system, commonly known as the Basel Capital Accord (Basel I), was approved by the G10 Governors and issued to the Banks in July 1988 [7]. Basel I called for a minimum capital ratio of capital to risk-weighted assets of 8% to be implemented by the end of 1992. This was the beginning of the Basel Accords.

In June 1999, the committee issued a proposal for a new capital adequacy framework to replace the 1988 Accord. This led to the release of the Revised Capital Framework in June 2004, which is generally known as ‘Basel II’ [7]. In Basel II, the BCBS recommends to ‘take rating and scoring as the basis for determining risk-sensitive regulatory capital requirements for credit risks’ [3]. Compared to Basel I, where capital requirements are uniformly at 8% in particular for corporate borrowers irrespective of their creditworthiness, Tasche states that this is a major progress [12]. Basel II also gives two approaches for capital calculation: Standardized Approach (SA) and Internal Rating-based Approach (IRB). Credit institutions that apply Basel II Standardized Approach (SA) can base the calculation of capital requirements on agency ratings [3]. In the Standardized Approach (SA), Basel II also gives fixed PD percentages for certain business types (retail, residential real estate, commercial real estate, overdue

(10)

10

loans). Credit institutions that are allowed to apply the internal rating-based (IRB) approach will have to derive PDs from ratings they have determined themselves [12]. Note that in the IRB approach, capital requirements depend not only on PD estimates but also on estimates of loss given default (LGD) and exposure at default (EAD) parameters [12]. Rabobank is now using the IRB approach to calculate the capital for most, but not all, of its portfolios.

As stated in the official history document [7], the need for a fundamental strengthening of the Basel II framework became apparent even before Lehman Brothers collapsed in September 2008. The banking sector had entered the financial crisis with too much leverage and inadequate liquidity buffers [7]. In July 2009, the Committee issued a further package of documents to strengthen the Basel II capital framework. These documents strengthen the regulation and supervision of internally active banks. In September 2010, the Group of Governors and Heads of Supervision announced higher global minimum capital standards for a commercial bank. This followed an agreement reached in July regarding the overall design of the capital and liquidity reform package, now referred to as ‘Basel III’.

However, this Basel III regulation could not be directly applied to the EU. From the official documents [8], it follows this is because Basel III itself is not a law for the worldwide banks, but a set of internationally accepted standards set by regulators and Central Banks. Thus, Basel III regulations had to be translated into an EU adapted version, which could be put under democratic control. The High-Level Group on Financial Supervision in the EU chaired by Jacques de Larosière invited the Union to develop a more harmonized set of financial regulations. In the context of the future European supervisory architecture, the European Council of 18 and 19 June 2009 also stressed the need to establish a ‘European Single Rule Book’ applicable to all credit institutions and investment firms in the internal market [1]. The Capital Requirements Regulation (CRR) was designed for this purpose, and it is now the EU law that aims to decrease the likelihood that banks go insolvent [8].

2.2 Some Definitions and Regulations from CRR

This section is based on the Capital Requirements Regulation (CRR) document [1]. As stated in Chapter 1, this research aims to check whether Hidden Markov Model (HMM) can help to estimate a better credit rating grade migration matrix, so that banks would be able to predict the credit grades at the start of a backtest period. The definition of obligor grade, the definition of Probability of Default (PD) and Observed Default Rate (ODR) are introduced as follows.

In CRR art.3 (54), the Probability of Default (PD) is defined as ‘the probability of default of a counterparty over a one-year period’;

In CRR art.3 (78), 'one-year default rate' means the ratio between the number of defaults occurred during a period that starts from one year prior to a date T and the number of obligors assigned to this grade or pool one year prior to that date;

In CRR art.143 (6), 'obligor grade' means a risk category within the obligor rating scale of a rating system, to which obligors are assigned on the basis of a specified and distinct set of rating criteria, from which estimates of probability of default (PD) are derived;

(11)

11

The following five regulations are related to the requirement of an internal rating system and PD model validation. The third regulation states that Observed Default Rate (ODR) is seen as an estimate of PD, while the fourth regulation points out the way of estimating the PDs of obligors in a given grade. These give us a clue to determine the bucket PDs of the simulated bucketing system in Chapter 5. The last regulation state that the model validation must be conducted on both model level and grade level.

According to CRR art.170 (3c), the process of assigning exposures to grades or pools shall provide for a meaningful differentiation of risk, for a grouping of sufficiently homogenous exposures and shall allow for accurate and consistent estimation of loss characteristics at grade or pool level;

According to CRR art.170 (2), an institution shall take all relevant information into account in assigning obligors and facilities to grades or pools. Information shall be current and shall enable the institution to forecast the future performance of the exposure;

According to CRR art.180 (1a), institutions shall estimate PDs by obligor grade from long run averages of one-year default rates;

According to CRR art.180 (1g), to the extent that an institution uses statistical default prediction models it is allowed to estimate PDs as the simple average of default-probability estimates for individual obligors in a given grade; According to CRR art.185 (b), institutions shall regularly compare realized default rates with estimated PDs for each grade;

As mentioned above, from a model validation perspective, when estimating the credit grade migration matrix, we are not allowed to reduce the dimension of the internal rating system by directly folding the credit grades, as Malgorzata Wiktoris did in her research [27]. This is because the model validation is required to be conducted on both model level and bucket level. If we fold the grades, we are not able to backtest PDs on bucket level.

(12)

12

3 Literature Review

The literature about how to apply a Hidden Markov Model on credit quality is sparse. This section will discuss three articles related to model validation [9][12][13], one article related to HMM applied on credit quality [27], and two books about the theory of HMM [28][29].

3.1 Literature Related to Model Validation

Gerd Castermans and David Martens [9] give a structured introduction of commonly used quantitative validation methods and mainly focus on backtesting and benchmarking that are key quantitative tools. They state that generally there are three parts of model validation: Calibration, discrimination, and stability. Calibration refers to the mapping of a rating to a quantitative risk measure. A rating system is considered well-calibrated if the estimated risk measures deviate only marginally from what has been observed ex-post. Discrimination measures how well the rating system provides an ordinal ranking of the risk measure considered. Stability measures to what extent the population that was used to construct the rating system is similar to the population on which it is currently used.

The authors analyze both advantages and disadvantages of methods that are discussed in their article. In terms of the calibration, the well-known binomial test is mentioned. Its estimations are TTC (through-the-circle) but the outcomes are PIT (point-in-time). The TTC estimation is supposed to be a long-term average ODFs, that is, unconditional of the business cycle we are in at any time. By contrast, the PIT estimation is representative of the current business cycle. This means that the binomial test won’t take the economic situation into account. In terms of discriminatory power, the ROC test and the DeLong test are introduced. Confidence intervals and tests of ROC are available for the AUC measures. However, it is hard for a researcher to use ROC to define a minimum value that determines acceptable discriminatory power. The DeLong test has the sample variability but it is complex to calculate. Tasche [13] elaborates on the validation requirements for rating systems and probabilities of default that were introduced in Basel II [12]. He puts the main emphasis on the issues with quantitative validation. The techniques discussed in his article could be used to meet the quantitative regulatory requirements. However, their appropriateness will depend on the specific conditions under which they are applied. He introduces a theoretical framework by defining two random variables, 𝑆 and 𝑍. The former one denotes a score on a continuous scale that the institution has assigned to the borrower, while the latter one shows the state the borrower will have at the end of a fixed period, default or non-default. Then, the institution’s intention with the score variable 𝑆 is to forecast the borrower’s future state 𝑍, by relying on the information on the borrower’s credit worthiness that is summarized in 𝑆. He mentioned that in this sense, scoring and rating are related to binary classification and that scoring can be called binary classification with a one-dimensional co-variate.

Intuitively, a good rating system should be able to distinguish the creditworthy obligors and the potential defaulters, by assigning good obligors to low credit rating grades and clients with a higher PD to high credit rating grades. Therefore, Dirk Tasche also discusses when and how this monotonicity can be guaranteed under this theoretical framework. This problem is considered in the context of a hypothetical decision problem. He introduces some techniques to find a reasonable threshold of scores, under which the borrower would be predicted to get

(13)

13

defaulted. After that, he studies the question of how discriminatory power can be measured and tested. He elaborates on the Cumulative Accuracy Profile (CAP) and its summary statistics Accuracy Ratio (AR), Receiver Operating Characteristic (ROC), and its summary measure Area Under the Curve (AUC) and the error rates as measures of discriminatory power. Various calibration techniques are also included with similar content as in the article of Castermans and Martens [9].

Dirk Tasche concludes that AR and AUC seem promising tools to check the discriminatory powers as their statistical properties are well investigated and they are available together with many auxiliary features in most of the more popular statistical software packages. With regards to testing calibration, for conditional PD estimates powerful tests, such as the binomial and the Hosmer-Lemeshow test, are available. However, their appropriateness strongly depends on an independence assumption that default events are all independent. This independent assumption needs to be justified on a case-by-case basis.

Similarly, Engelmann [13] also introduces the CAP and ROC as commonly used techniques to test the discriminatory power of a PD prediction model. He gives the relationship between AR and AUC, that is,

𝐴𝑅 = 2𝐴𝑈𝐶 − 1.

3.2 Literature Related to HMM

Elliott’s book [28] is mainly about the theory of HMM. The book includes theorems about both discrete and continuous states and observations. Chapter 2, which describes the discrete HMM, is related to our research and will be our main focus. HMM assumes that there is a Markov process which is unobservable, and that there is another process whose behavior depends on the hidden Markov process. Elliott’s book is based on a one-unit delay discrete HMM, which means the observed value at time 𝑡 only depends on the value of the hidden state at time 𝑡 − 1. The details of the one unit delay HMM can be found in Chapter 6.

Elliott’s book assumes that the noises of these two processes are independent. If the transition matrix and emission matrix are denoted by 𝐴 and 𝐶, for example, the HMM can be written as

𝑋_𝑘+1 = 𝐴𝑋_𝑘+ 𝑉_𝑘+1 (hidden states), 𝑌_𝑘+1= 𝐶𝑋_𝑘+ 𝑊_𝑘+1 (observation states),

where 𝑉_𝑘+1 and 𝑊_𝑘+1 are the noises at time 𝑡 + 1. In his book, these two noise processes are independent of each other. He also proposes another form of HMM where these two noise processes are dependent. In this case, the observation state 𝑌𝑘+1 will depend on both 𝑋𝑘+1

and 𝑋_𝑘. In Chapter 6, we will explain the theoretical HMM model in detail, and 𝑌_𝑡 denotes the ‘true’ credit quality for clients at time 𝑡, while 𝑋𝑡 denotes the credit ratings calculated by

banks.

Malgorzata Wiktoria’s research [27] is based on Elliott’s book [28]. She gives a brief introduction about theorems of both general HMM and dependent HMM, and states that in the dependent HMM the hidden true credit quality state 𝑋_𝑘+1 and observation 𝑌_𝑘+1 are jointly depend on 𝑋_𝑘, which means that in addition to previous period’s credit quality, knowledge of current credit rating carries information about current credit quality. She conducts a numerical experiment to test whether the HMM can be used to estimate the

(14)

14

transition matrix of hidden states which are the true credit quality in her research. Instead of estimating the transition matrix for all credit ratings, she roughly divided all the credit grades into two groups: investment grade (IG) and speculative grade (SG), which reduces the dimensions of both hidden state space and signal state space. However, due to the requirements from CRR, from a validation perspective, folding credit rating grades will make us unable to backtest models on bucket level. Thus, in our research, one of the challenges would be how to reduce the dimension of state space without folding credit ratings.

Different from Malgorzata Wiktoria’s research and Elliott’s book, Rogemar presents a zero delay HMM, which is slightly different from one unit delay version. The zero delay HMM assumes that the observation signal at time 𝑡 depends on the hidden state at time 𝑡 instead of the hidden state at the previous step.

𝑋_𝑘+1= 𝐴𝑋_𝑘+ 𝑉_𝑘+1; 𝑌_𝑘 = 𝐶∗_𝑋

𝑘+ 𝑊𝑘∗,

where 𝐴 = (𝑎𝑗𝑖) represents the transition matrix and 𝐶∗ = (𝑐𝑗𝑖∗) represents the emission

probability matrix; 𝑉𝑘 and 𝑊𝑘∗ are both the noise terms.

Our research is based on zero delay HMM. We apply the zero delay HMM because intuitively the observation signals will depend on the current hidden true credit quality instead of the previous hidden state. The parameter estimation methods are also different. Malgorzata Wiktoria and Elliott apply the filter-based cohort approach [28] to estimate the migration matrix, while we will use the Baum-Welch algorithm to obtain the estimation of transition probability. The specific steps of the Baum-Welch algorithm can be found in Zheng Rong [30] and Jeff Bilmes [31], and the steps are covered in Chapter 6.

(15)

15

4 Model Validation Methodology

In the previous chapter, the regulatory background and previous researches are introduced. In this chapter, the model validation methodologies will be demonstrated, and in Chapter 5 these validation methodologies will be tested to see whether they can help in distinguishing the good PD prediction models from the bad ones.

The focus of the following validation activities is to check whether or not models are fit for purpose and are conceptually sound by effectively challenging the owner, modeling teams, and users of the developed model, model documentation, and test results. Moreover, it is emphasized by the Basel Committee that both quantitative and qualitative components should be considered during the validation process. For more specific qualitative validation, please consult the Basel document [4]. This research deals with quantitative validation only. As BCBS (2005a) states [4], “validation is fundamentally about assessing the predictive ability of a bank’s risk estimated and the use of ratings in credit processes”. Here the term “predictive ability” is not a statistical term that has a specific mathematical meaning, but that, in the financial industry, could be understood as the correctness of calibration of PD models and the discriminatory power of the entire internal rating system [12]. The testing methods of these two parts will be introduced in the following sections, respectively.

4.1 Calibration Quality

To check the correctness of the calibration quality of PD models is to test whether or not the observed PD is in line with the predicted PD. For calibration quality of PD models, BCBS (2004) [3] states that “banks must regularly compare realized default rates with estimated PDs for each grade”. Therefore, the credit institutions need to test the accuracy of prediction models on both grade level and model level. The binomial test can be used for the bucket level testing, while the Poisson-binomial test is applied to describe the prediction ability on the model level. Based on the traffic light approach, which is proposed in [4], the calibration quality of the target model can be monitored.

4.1.1 Binomial Test

In some processes observed values can only be divided into two categories, such as qualified/unqualified, yes/no, life/death, etc. The binomial distribution is a probability distribution that describes 𝑛 independent processes that each yields events with only two mutually exclusive results with a fixed probability. The binomial test is a method used to test whether the samples followed the binomial distribution with parameters (𝑛, 𝑝), where 𝑛 is the number of samples and 𝑝 is the probability of obtaining a ‘success’ event instead of ‘no success’ event. Note that the binomial test can only be used for models that predict a dichotomous variable (in this case, it’s default or performing).

In the binomial test, the observation events are all assumed to be independent, which means that the observed results for the clients in the same observation window are parallel and won’t interact. Suppose that in 𝑛 samples, there are 𝑘 samples present success. The probability mass function of binomial distributed random variable 𝑋 can be written as[9]

ℙ(𝑋 = 𝑘) = (𝑛

(16)

16

If 𝑛 is large enough, for instance, 𝑛 > 1000, and 𝑛 ∙ 𝑝 ∙ (1 − 𝑝) ≥ 9, we can apply a normal approximation, that is, the binomially distributed random variable 𝑋 is approximately normally distributed, which can be expressed as

𝑋 ∼ 𝑁(𝑛𝑝, 𝑛𝑝(1 − 𝑝)). (4.2)

As a hypothesis test, the binomial test can be either one-sided or two-sided, which are shown in Table 1.

Table 1. The null and alternative hypotheses of the binomial test

One-sided binomial test Two-sided binomial test

𝑯_𝟎 The Observed Default Rates (ODR) are in line with the predicted PDs, which means the PD model can be considered accurate.

𝑯_𝟏

1) The ODRs are lower than the predicted PDs, or 2) the ODR is larger than the predicted PDs, which means the PD prediction model is not accurate.

The ODRs are not equal to the predicted PD, which means the PD prediction model is not accurate.

Comparing the output p-value of the binomial test above with the chosen significance level 𝛼, one can reject or not reject the null hypothesis. How to choose a significant level depends on how conservative one would like to be. Theoretically, in terms of the right-sided test, one would reject the null hypothesis if the following inequality holds[10].

ℙ(𝑋 ≥ 𝑘) = 1 − 𝐹(𝑘 − 1) = 1 − ∑ (𝑛_𝑖) 𝑝𝑖∙ (1 − 𝑝)𝑛−𝑖

𝑘−1

𝑖=0

≤ 𝛼 (4.3)

, where 𝑛 shows the amount of all observations; 𝑘 represents the number of success events, which in this case would be understood as the number of default events. Similarly, in terms of the left-sided test, the null hypothesis would be rejected when

ℙ(𝑋 ≤ 𝑘) = 𝐹(𝑘) = ∑ (𝑛_𝑖) 𝑝𝑖∙ (1 − 𝑝)𝑛−𝑖 ≤ 𝛼

𝑘

𝑖=0

(4.4) In both equations (4.3) and (4.4), the 𝐹(∙) represents the cumulative distribution function of binomially distributed random variable 𝑋, where 𝑋 ∼ 𝐵𝑖𝑛(𝑛, 𝑝). For the two-sided binomial test, both probabilities of equations (4.3) and (4.4) should be conducted but comparing with

𝛼

2 instead. As long as one of the one-sided tests leads to a rejection, the null hypothesis of the

two-sided test would be rejected at the significance level 𝛼 , otherwise, it would not be rejected.

Also, based on the assumptions mentioned above, the normal approximation can be considered. In this case, the null hypothesis 𝐻0 is rejected when

ℙ(𝑍 ≥ 𝑧) = 1 − Φ(𝑧) ≤ 𝛼, (4.5)

(17)

17

ℙ(𝑍 ≤ 𝑧) = Φ(𝑧) ≤ 𝛼, (4.6)

for the left-sided test, where

𝑧 = 𝑘 − 𝑛𝑝 √𝑛𝑝(1 − 𝑝).

and Φ(∙) represents the cumulative distribution function of standard normal distribution. Similarly, the null hypothesis 𝐻0 of a two-sided test would be rejected at significance level 𝛼

as long as one of the one-sided tests leads to the rejection.

In the real world, a one-sided test and two-sided test would be conducted under different circumstances. In the real world, a bank tends to use the right-side test to see whether or not the model is too optimistic to avoid the risk of having more defaulters than expected. In contrast, the two-sided binomial test would be applied when one wants the model not to be too conservative or too optimistic. From a current risk management perspective, monitoring conservative and optimistic predicted PDs are both important. Therefore, in this research, a two-sided test is used, which means both conservatism and optimism are unacceptable. 4.1.2 Poisson Binomial Test

Different from the binomial test, the Poisson binomial test is conducted to test the calibration of the whole rating system rather than on grade level. It is an exact test since no approximations and shortcuts are necessary. The Poisson binomial test describes the distribution of the sum of independent and non-identically distributed random indicators, where each indicator is a Bernoulli random variable and where each probability of default may vary. The Poisson binomial test would reduce to the binomial test if the probabilities of default are equal on the bucket level. The null and alternative hypotheses are shown in Table 2.

Table 2. The null land alternative hypotheses of the Poisson binomial test

One-sided Poisson binomial test Two-sided Poisson binomial test

𝑯_𝟎 The ODR is in line with the predicted PDs, which means the PD model can be considered accurate.

𝑯_𝟏

1) The ODR is lower than the predicted PDs or 2) the ODR is larger than the predicted PDs, which means the PD model is not accurate

The ODR is not in line with the predicted PDs, which means the PD model is not accurate.

Similarly, using the Poisson binomial test, the results of p-values is compared with the set significance level 𝛼. The 𝛼 is depending on how conservative one wants to be. For the Poisson binomial test, the null hypothesis would be rejected when

ℙ(𝑋 ≥ 𝑘) = 1 − 𝐹(𝑘 − 1) ≤ 𝛼, (4.7)

for the right-sided test, and

(18)

18

for the left-sided test, where 𝐹(∙) is the cumulative distribution of Poisson binomial test and are written as 𝐹(𝑘) = ∑ ∑ ∏ 𝑝_𝑗 ∏(1 − 𝑝_𝑗) 𝑗∈𝐴𝑐 𝑗∈𝐴 𝐴∈ℱ𝑚 𝑘 𝑚=0 , (4.9)

where ℱ𝑚 is the 𝜎 -algebra of 𝑚 integers which is the subset of {1,2,3, … , 𝑛} ; 𝐴𝑐 is the

complement of set 𝐴. Same as above, the null hypothesis two-sided test of Poison binomial will be rejected at significance level 𝛼 as long as one of the one-sided tests reaches the rejection at the significance level 𝛼₂.

However, the computation of the CDF of the Poisson Binomial distribution is not as straightforward as directly applying equation (4.9) above [18]. Instead of using the approximation approaches, Rabobank applies a simple derivation for an exact formula with a closed-form expression of Poisson binomial CDF, which involves the Fourier transform of the characteristic function of the distribution. Since how to improve computational efficiency is not the focus of this thesis, we won’t explain the details further. Further details of the algorithms can be found in [18][19].

4.1.3 Traffic Light Approach

The traffic light approach is applied while reporting the quality of the PD model[2]. There are three colors, red yellow, and green, with thresholds set in advance. The model validator could see the potential issues of a model and then would need to perform a further investigation according to the output warning signals. Note that the results of this traffic light approach should not be seen as the specific conclusion of the calibration of the internal rating system, but as the direction of further research. The traffic light indicators are shown in Table 3.

Table 3. The traffic light indicator of both the binomial test and Poisson binomial test

Traffic light approach

PD

PD predictions and ODF for the bucket are not significantly different at an

alpha of 10%.

PD predictions and ODF for the bucket are significantly different at an alpha of 10%, but not at an

alpha of 0.2%.

PD predictions and ODF for the bucket are significantly different at

an alpha of 0.2%. An alpha value of 0.05 is commonly used for a hypothesis test. In the previous version of the binomial test in the backtest, two one-sided tests were conducted both with a significance level 0.05, one for optimism, and the other for conservatism. Thus, the combined alpha value is 0.1, which is considered as the threshold of the orange traffic light. Sometimes getting a warning signal when the predicted PDs locates out of the 90% confidence interval would be early. Thus, the red traffic light is introduced with an alpha value of 0.1% for one-sided tests, yielding the combined alpha value of 0.2%, which is very conservative. The red traffic light would indicate serious errors while validating.

4.2 Discriminatory Power

The discriminatory power of a PD model determines to verify whether or not this model can distinguish defaulters from non-defaulters. In BCBS (2005a) [4], the discriminatory power is defined as the ‘ability to discriminate ex-ante between defaulting and non-defaulting

(19)

19

borrowers’, where the term ‘ex-ante’ means ‘in advance’. The discriminatory power can indicate the performance of the whole internal rating system, while the correctness of calibration can be used to see if a PD prediction model has correctly assigned clients to credit rating grades.

There are multiple methods for discriminatory power testing. The most commonly used validation method is the Cumulative Accuracy Profile (CAP) and its summary statistic Accuracy Ratio (AR) which is also known as Gini Index or Powerstat. Another method with a similar idea is the Receiver Operating Characteristic (ROC). This section is mainly based on Dirk Tasche [12] and Bernd and Evelyn [13].

4.2.1 Cumulative Accuracy Profile (CAP)

In the form of figures, Cumulative Accuracy Profile (CAP) intuitively and concisely describes the quality of an internal rating system. A perfect internal rating system can properly assign defaulters to lower credit quality rating classes. Suppose that there are 10 performing credit buckets in an internal rating system {𝑅0, 𝑅1, … , 𝑅9}1 with decreasing credit quality. The

performance of all clients in the next period (one year) is monitored. At the end of the following period, the defaulted clients are named as defaulters(D), and the clients that stay performing are seen as non-defaulters (ND). Let 𝑝_𝐷𝑖, 𝑖 ∈ ℕ, 0 ≤ 𝑖 ≤ 9, represents the proportion of defaulters in 𝑅𝑖 credit rating class. These proportions sum up to 1, that is

∑9 𝑝_𝐷𝑖

𝑖=0 = 1. Similarly, we can define the other two proportions 𝑝𝑁𝐷𝑖 and 𝑝𝑇𝑖, 𝑖 ∈ ℕ, 0 ≤ 𝑖 ≤

9 by the same way, where 𝑝𝑇𝑖 means the proportion of the defaulters in all clients. Note that

here the proportion 𝑝_𝐷𝑖, 𝑝_𝑇𝑖 and 𝑝_𝑁𝐷𝑖 do not mean the ‘true’ proportions. They are all computed based on the predictions of the chosen PD prediction model.

Provided the average observed default rate 𝜋, the fraction between the total number of defaulters and the total number of clients, we can easily deduce [13]

𝑝_𝑇𝑖 = 𝜋𝑝_𝐷𝑖 + (1 − 𝜋)𝑝_𝑁𝐷𝑖 . (4.10) The discrete empirical cumulative distribution function 𝐹_𝑇(∙), 𝐹_𝐷(∙) and 𝐹𝑁𝐷(∙) can be

defined by summing up the portion values above, that is, 𝐹_𝑇(𝑘) = ∑ 𝑝_𝑇𝑗 𝑘 𝑗=0 , 𝑘 = 0,1, … ,9 (4.11) 𝐹_𝐷(𝑘) = ∑ 𝑝_𝐷𝑗 𝑘 𝑗=0 , 𝑘 = 0,1, … ,9 (4.12) 𝐹𝑁𝐷(𝑘) = ∑ 𝑝𝑁𝐷 𝑗 𝑘 𝑗=0 , 𝑘 = 0,1, … ,9 (4.13)

, where 𝐹𝑇(𝑘) = ℙ(𝑆𝑇 ≤ 𝑘), meaning the probability of a client with credit grade no greater

than 𝑅𝑘.

Then, the CAP function[12] is

(20)

20

𝐶𝐴𝑃(𝑢) = 𝐹_𝐷(𝐹_𝑇−1(𝑢)), 𝑢 ∈ (0,1) (4.11) The Cumulative Accuracy Profile (CAP) is defined as the curve connecting all points of (𝐹𝑇(𝑘), 𝐹𝐷(𝑘)), 𝑘 ∈ ℕ, 0 ≤ 𝑘 ≤ 9 or (𝑢, 𝐶𝐴𝑃(𝑢)), 𝑢 ∈ (0,1) by linear interpolation[13]. The

former expression of CAP points can still be used when the cumulative distribution function 𝐹_𝑇(∙) is invertible. An example of CAP is shown in Figure 4.

Figure 4. An example of Cumulative Accuracy Profile

As shown in Figure 4, the curve describes the performance of a perfect model that can correctly predict whether those performing clients default or not; the dashed diagonal line indicates the performance of a developed model which got no forecasting power; the middle orange curve means the performance of the model that we wants to evaluate. On common sense, the predicted model applied by banks cannot do the perfect prediction but will more or less provide helpful advice, so its performance line is normally located between the worse and the best.

The process of plotting the CAP as shown in Figure 4 can be elaborated by a simple case. Suppose that there are 5 clients with 3 of them defaulted (red) and the rest 2 clients performing at the end of the next period.

NO. 1 2 3 4 5

Default or not

We apply three PD prediction models with different quality and then obtain the forecasted default probabilities in Table 4.

Table 4. The predicted PDs in the example of CAP plotting

NO. 1 2 3 4 5

Default or not

Predicted PD from perfect models (best quality) 0.9 0.4 0.6 0.8 0.1

𝑎

_𝑝

(21)

21

Predicted PD from random models (worst) 0.9 0.7 0.6 0.1 0.2 Predicted PD from the developed models 0.9 0.6 0.8 0.3 0.2 After sorted these default possibilities, we can obtain the ordered sequences in Table 5.

Table 5. The sorted predicted PDs in the example of CAP plotting

Predicted PD from perfect models (best quality) 1 4 3 2 5 Predicted PD from random models (worst) 1 2 3 5 4 Predicted PD from the developed models 1 3 2 4 5

As shown in Table 5, perfect models will assign the actual defaulters to the lower rating class, putting the red cells on the left side and green cells on the right side, while the model with zeros information cannot distinguish the actual defaulter from non-defaulters, with the green cells and red cells randomly spreading. The prediction PD models apply in the financial industry only give limited predictions, with errors of placing cells on the wrong side as the third row in Table 5.

In order to draw the CAP lines, we should set a threshold of probabilities in Table 4, above which clients would be considered as positive. Then the value of the X-axis can be defined as the fraction of the number of positive samples and the number of total clients. The value of the Y-axis can be correspondingly defined as the fraction of the number of defaulters who are also seen as positive and the number of all positive samples. In this case, the points can be calculated with decreasing thresholds.

Table 6. Computing the coordinate points of CAP.

Perfect model Threshold 0.85 0.70 0.50 0.30 0 x-axis 0.20 0.40 0.6 0.8 1 y-axis 0.33 0.67 1 1 1 Developed model Threshold 0.85 0.70 0.50 0.25 0 x-axis 0.20 0.40 0.60 0.80 1 y-axis 0.33 0.67 0.67 1 1 Random model Threshold 0.80 0.65 0.50 0.15 0 x-axis 0.20 0.40 0.6 0.8 1 y-axis 0.33 0.33 0.67 0.67 1

Connecting all the (𝑥, 𝑦) points in Table 6, we can see that the line for a perfect model is rising to 1 fast, like the blue line in Figure 4. Since the worst model randomly assigns the clients to

(22)

22

credit buckets, its CAP line is almost diagonal as the dashed line in Figure 4. By contrast, the CAP line of the real-life model shows a tendency of approaching 1, but at a relatively lower speed than that of a perfect model, making the developed CAP curve located between two extremes.

4.2.2 Accuracy Ratio (AR)

The information given by the CAP figure can be simply summarized by the Accuracy Ratio (AR), which is also known as the Gini coefficient or Powerstat. In Figure 4, the notation 𝑎_𝑝 represents the area between the CAP of the perfect model and CAP of the random model; 𝑎𝑟

shows the area between the CAP of the model under evaluation and the CAP of the random model. The ratio of 𝑎_𝑟 and 𝑎_𝑝 are defined as the Accuracy Ratio[13], that is,

𝐴𝑅 =𝑎𝑟

𝑎_𝑝. (4.15)

Also, it can be analytically written in terms of CAP function[12] 𝐴𝑅 =2 ∫ 𝐶𝐴𝑃(𝑢)𝑑𝑢

1

0 − 1

1 − 𝑝 , (4.16)

where 𝑝 represents the fraction of defaulters.

The AR of a random model is 0, while the AR of the perfect model is 1.Since the developed model has the discriminatory power in between these two extremes, so its AR is a fraction, in the range of 0 and 1. We can conclude according to Figure 4 that the larger the AR score is, the closer the CAP of the developed model is to the CAP of the perfect model, which means the discriminatory power of the developed internal rating system is higher.

In Table 4, we can see that some of the predicted PDs are very large, such as 0.9 and 0.8. However, in the real world, by doing risk management, the bank can somewhat avoid the loss produced by default. For instance, the bank would assess the credit risk of obligors before signing the contract and refuse those assigned to higher credit grades. As a result, the predicted PDs of clients who are already in the portfolios would be less than a given threshold, such as 0.4. Moreover, the default of an obligor is assumed to be a random variable with Bernoulli distribution, since credit quality that the banks are trying to monitor is in form of a probability instead of a specific outcome. Therefore, the AR concerning a PD prediction model is stochastic. We can do a Monte Carlo experiment, based on the modeled PDs and repeatedly draw defaults and calculate ARs from that. In this way, we can obtain a predicted AR and its confidence intervals.

In addition, AR is portfolio dependent[14][15][16]. Due to the stochastic factors, we can hardly find an actual model with AR extremely close to 1. Thus, the ‘perfect’ AR (Gini)2_{, which}

is the Gini when the fraction of defaulters is exactly the expectation PD in terms of every bucket in the internal rating system, could be seen as the benchmark for the perfect model considering stochastic factors. Based on the bucket PDs of the internal rating system, we can repeat the Monte Carlo experiments and thus obtain the bootstrapped ARs whose average will be close, but not always, to the ‘perfect’ AR. This is because the AR function is non-linear.

2_{Note that this is not the model quality indicator for Rabobank. In Rabobank, a traffic light approach with a}

orange light threshold of 40% is used to determine the discriminatory power of a model. However, for some of the portfolios, even their perfect AR cannot reach 40%, so setting a ‘perfect’ AR will be more reasonable.

(23)

23

In our simulation, we apply the bootstrapped average AR as the benchmark, because it considers the non-linearity of Gini functions.

4.2.3 Receiver Operating Characteristic (ROC)

Receiver Operating Characteristic (ROC) is another technique to investigate the discriminatory power of an internal rating system. It is with a concept similar to the CAP. Given the assumptions in section 4.2.1, the ROC function[13] is defined as

𝑅𝑂𝐶(𝑢) = 𝐹_𝐷(𝐹_𝑁𝐷−1_(𝑢)), _{𝑢 ∈ (0,1).} _(4.17)

By connecting all points of (𝑢, 𝑅𝑂𝐶(𝑢)), 𝑢 ∈ (0,1), we can obtain the ROC plotting. If the 𝐹_𝑁𝐷−1(∙) is invertible, one can plotting all points of (𝐹𝑁𝐷(𝑘), 𝐹𝐷(𝑘)), 𝑘 ∈ ℕ, 0 ≤ 𝑘 ≤ 9 to

obtain ROC. By contrast, plotting ROC does not require the estimation of PD for all clients. Area Under the Curve (AUC) is the associated summary measure for Receiver Operating Characteristic, to describe the discriminatory power of the rating system by a simple number. The AUC has a strong connection with the AR, which can be technically demonstrated by

𝐴𝑈𝐶 = ∫ 𝑅𝑂𝐶(𝑢)𝑑𝑢

1 0

=𝐴𝑅 + 1

2 (4.18)

Equation (4.18) is proved by Engelmann[17].

Since ROC is more well known in the artificial intelligence industry and plays no role in this thesis, it won’t be explained any further in detail.

4.3 Chapter Summary

In this chapter, some quantitative validation methods were introduced. For the calibration quality test, we applied the binomial test on the bucket level and the Poisson binomial test on the model level. For the discriminatory power test, the Cumulative accuracy Profile (CAP) and its summary measure AR were elaborated with an example. Then, the traffic light approach was introduced as a standard benchmark to determine good models for these three techniques. In addition, the concepts of ‘perfect’ AR and bootstrapped AR average were introduced to set a benchmark to determine the discriminatory power of a PD prediction model. In the next chapter, a simulated artificial bank will be set up, on which these validation methods will be implemented to see whether they are effective and efficient in distinguishing the good model from the bad ones.

(24)

24

5 Dataset Simulation and PD model Validation

In the previous chapters, we discussed the credit modeling regulations and introduced some validation methodologies. Now, in this chapter, an artificial bank will be generated to check whether the validation approaches stated in Chapter 4 works out or not. The procedure of dataset simulation and validation methods implementation is as follows.

Step 1: The artificial internal rating system setup. First, we will build up an internal rating system that consists of 10 performing credit rating buckets. Each client can be mapped to one of the buckets in this internal rating system according to his PD. This bucketing system is defined by stating the PD boundaries. Then, a ‘true’ PD model is set up with 5 factors whose initial values are generated from normal distributions with different parameters. We also introduce 5 states which describe the situation of the defaulted clients: first year in default (DY1), second year in default (DY2), third year in default (DY3), Cured (C), and Liquidated (Liq). Then the whole state space is of 15 states. Section 5.1.1 to section 5.1.2 will explain this step in detail.

Step 2: Credit grade migration rules setup. by adding the drifts to the factors, the credit grades will change depending on where the ‘true’ PDs are in the bucketing system. The ‘true’ credit rating grades will be updated monthly, so the migration matrices also describe the credit rating transitions between two months. After 250 consecutive months, we can obtain 249 migration matrices, and then the convergence of ‘true’ migration matrices will be tested by the standard deviations of each cell. Then, the ‘true’ information dataset of the artificial bank is established. Section 5.1.3 to section 5.1.4 will explain this in detail.

Step 3: PD prediction models setup. Since in the real world it is hard for a financial institute to figure out all the factors that somehow affect the credit quality, the predicted credit grades will inevitably have errors. Now, in order to mimic the real-world situation, we assume that all the ‘true’ information is hidden, such as the ‘true’ PD model, the ‘true’ migration matrix, and the ‘true’ credit quality migration rules. Then, we build up 5 PD models in the form of sigmoid functions based on the different quantity of factors. These 5 PD models are with different levels of predictive ability. Section 5.2 will explain this in detail.

Step 4: Validation methods implementation. Intuitively, the less the factors, the worse the quality of the PD prediction model. By rerating clients using the PD models built up in step 3, we can obtain 5 groups of forecasted PDs. Applying the validation methods introduced in Chapter 4 to test the prediction performance of these models, we can know whether these validation methods can recognize the decreasing prediction power of these PD prediction models. Section 5.2 will explain this in detail.

5.1 Dataset Simulation

In this section, we aim to build up the artificial bank with ‘true’ information, such as ‘true’ PDs and ‘true’ credit grade migration matrix.

In the real world, clients are assigned to credit buckets according to the forecasted PDs which are computed based on various factors, for instance, liquidity ratio, solvency, debt service converge ratio, quality of management, business segment, history of defaults, etc. All the factor values are drifting over time. Before considering the drifts, we need to simulate the initial factor values at the beginning of the first month. The initial factor values are generated

(25)

25

from normal distributions with different parameters. By building the ‘true’ PD model and defining the PD boundaries of bucketing systems, we can label the credit grades for all clients. From a risk modeling and validation perspective, it is effective and efficient to assume all client who are in the same credit grade to have the same PD which is called the bucket PD. Then the drift functions are introduced to make these factor values to change over time, making the credit rating grades drift as well. In this case, the ‘true’ PD model is in form of a sigmoid function of a score based on these factors. This process is illustrated in Figure 5.

Figure 5. Simulate migrating PD and credit grades

In Figure 5, 𝑡 represents the time, where 𝑡 = 𝑖 and 𝑡 = 𝑖 + 1 are with actual time interval a month.

In section 5.1.1, the defined bucketing system and the approaches used to generate initial factor values are introduced. In section 5.1.2, the ‘true’ PD model for the artificial bank and the data pre-processing techniques for the ‘true’ PD model are determined. In section 5.1.3, the idea of selecting drift functions is elaborated and in section 5.1.4, the simulation of credit quality migration is introduced, and a ‘true’ migration matrix is shown at the end of this section.

5.1.1 Factor Values and Credit Rating System Setup

As a part of the basic setting of this artificial bank, the performing buckets in this internal rating system are based on the PD. There are 10 performing buckets3_{in the system. Suppose}

that there is a total of 5 factors that would affect PD, namely ‘Factor 1’ to ‘Factor 5’. The initial values of these factors are assumed to follow normal distributions with different parameters, which are shown in Table 7. Note that these distributions do not apply to the drifted factor values that are calculated by given drift functions. Every factor value will follow the normal distribution at the beginning. Once it starts drifting, its distribution will change. At the beginning of the first month, all clients are performing. the first default will be witnessed at the end of the first month.

3_{Note that Rabobank has its own proprietary bucketing system which differs from the bucketing system in this}

(26)

26

Table 7. The distributions of initial values (𝑡 = 0) of all factors

Factors Distribution 1 N(1,1) 2 N(2,2.5) 3 N(3,4) 4 N(4,5.5) 5 N(5,7)

Then, a ‘true’ PD model is applied to compute the PDs for every client. Here the outcome PDs would range from 0 to 1. To control the risk of defaulting, this artificial bank would refuse the clients whose PD are higher than 0.4 to join a portfolio. The bucketing system is defined based on a set of PD intervals. From a risk management perspective, it would be more concise to assume all clients in the same bucket to have the same bucket PD, which can be simply set to be the median of each PD interval. These bucket PDs would be used to simulate the defaulters and non-defaulters at the end of every month. Note that bucketing system is only used to denote the credit quality of performing clients instead of the defaulters. The credit state of defaulters will be added into the state space in section 5.1.4. The bucketing system is shown in Table 8.

Table 8. The bucketing system of simulated artificial bank

Credit bucket 𝑅0 𝑅1 𝑅2 𝑅3 𝑅4 PD interval [0,0.04) [0.04,0.08) [0.08,0.12) [0.12,0.16) [0.16,0.20) Bucket PD 0.02 0.06 0.10 0.14 0.18 Credit bucket 𝑅5 𝑅6 𝑅7 𝑅8 𝑅9 PD interval [0.20,0.24) [0.24,0.28) [0.28,0.32) [0.32,0.36) [0.36,0.40) Bucket PD 0.22 0.26 0.30 0.34 0.38

In this bucketing system, the higher a credit rating grade is, the lower the credit quality is. The numbering of buckets in the bank system is consistent with the value of PDs instead of the intuitive credit quality, for instance, the highest credit grade 𝑅₉ contains the clients with PD around 0.38, which is the biggest acceptable PD grade, in terms of the simulated artificial bank. In the Rabobank credit bucketing system, there exists a riskless bucket, ‘𝑅₀’ containing clients who will not get defaulted in the next period after joining the portfolio. However, this is not the case in this artificial bank. In Table 8, it is still possible for clients in 𝑅0 to default.

The increase increment of bucket PD in our system is 0.04, which is not very small and will enable the HMM to distinguish these states.

5.1.2 ‘True’ Credit Rating Model

This ‘true’ PD model describes the real probability of default in this artificial world. It is invisible in the real world since there are many factors that banks don’t know or haven’t taken

(27)

27

into account. As stated in Figure 5, for this artificial bank, the ‘true’ PD model is set to be a sigmoid model:

𝑃𝐷 = 1

1 + 𝑒−(𝛽∙𝑋+𝛽0), (5.1)

where 𝑋 is the matrix of all factor values; 𝛽 is the sigmoid parameter vector and 𝛽0 is the

intercept. If the artificial bank will refuse clients with PD over 0.4,

𝛽 ∙ 𝑋 + 𝛽0 ∈ (−∞, −𝑙𝑜𝑔1.5]. (5.2)

Set 𝛽0 = −𝑙𝑜𝑔1.5 and 𝛽 = [0.1,0.5,1,0.5,0.25]. To scale all factor values, we apply the

equation

Transform(factor i) = (𝑢𝑏 − 𝑙𝑏)

1 + 𝑒−𝑠(factor i+𝑠ℎ)+ 𝑙𝑏 ∈ (𝑙𝑏, 𝑢𝑏), (5.3)

where ‘𝑢𝑏’ and ‘𝑙𝑏’ are the upper and lower bounds of the target interval in which the transformed factor values are located; ‘𝑠’ and ‘𝑠ℎ’ are the steepness and shift, which are both parameters set by ourselves. Those parameters exploited in this case are in Table 9.

Table 9. The parameters chosen for scaling factor values

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

ub 0.13

lb -3

s 4 0.5 0.4 0.2 1.5

sh -1 2 3 6 -1.5

These parameter values in Table 9 are selected to make the distribution of output ‘true’ PD fit the reality: most clients would be more likely to be in the middle credit rating class, such as ‘𝑅3’ to ‘𝑅5’. Then, the number of clients would decrease as the credit rating goes both up

and down. Based on the initial factor values, we now have the histogram of initial credit rating grades in Figure 6, which looks similar to a normal distribution.

(28)

28

Whether a client is default or not is described by a random variable that is Bernoulli distributed with corresponding bucket PD (Table 8) as the parameter. Now, we can obtain the dataset of the artificial bank for the first month and part of the dataset is shown in Table 10.

Table 10. Part of the dataset of the artificial bank for the first month

In the dataset(Table 10), we can see the difference between the ‘true’ PDs and the bucket PDs; all the defaulters are denoted by 1, while all the non-defaulters are denoted by 0, as shown in column named ‘𝑑𝑒𝑓𝑎𝑢𝑙𝑡_𝑜𝑟_𝑛𝑜𝑡’; ‘𝑜𝑙𝑑_𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔’ describes the clients’ credit grades in last month. For the dataset of the first month, the number 1000 means that this client is new to the artificial bank; all clients are numbered by ‘𝑐𝑙𝑖𝑒𝑛𝑡𝑠_𝑖𝑑’ so as to trace the migration histories of the credit quality.

Note that based on factor values, we can only give a possibility of default rather than giving an exact outcome prediction of default or not. During the first month, defaulters will be existing. The credit transitions of defaulters will be further explained in section 5.1.4.

5.1.3 Drift Functions

After generating all the factor values for the very first month, some drift functions will be introduced here to create movements of those factor values, based on which clients will start to migrate on their credit grades. We can interpret the factor values as 5 different stochastic process {𝑋_𝑡(𝑖)}, 𝑖 = 1,2,3,4,5, where 𝑖 represents the factors’ number and 𝑡 represents time. The increments of factor values with respect to time are determined by the stochastic differential equations shown in Table 11. 𝜃(i), 𝑖 = 2,3,4,5 are self-defined constant that differs for every factor. They are the means of the respective stochastic processes; 𝐾(𝑖), 𝑖 = 1,2,3,4,5 is a scaling parameter controlling the centralization speed; 𝜎(𝑖), 𝑖 = 1,2,3,4,5 is the volatility of each factor; 𝑊 is a Brownian motion, and 𝑑𝑊𝑡∼ 𝑁(0, 𝑑𝑡), where 𝑑𝑡 represents

the increment of time. in this case, 𝑑𝑡 = 1, meaning that clients are rerated every month.

Table 11. The stochastic differential equation in terms of every factor

Factor

NO. SDE Equations

Factor 1 Dothan 𝑑𝑋_𝑡(1) = 𝜎(1)_𝑋

𝑡 (1)

(29)

29 Factor 2 CIR 𝑑𝑋_𝑡(2)= 𝐾(2)(𝜃(2)− 𝑋_𝑡(2))𝑑𝑡 + 𝜎(2)√|𝑋_𝑡(2)|𝑑𝑊_𝑡 Factor 3 Vasicek 𝑑𝑋_𝑡(3) = 𝐾(3)(𝜃(3)− 𝑋_𝑡(3))𝑑𝑡 + 𝜎(3)𝑑𝑊_𝑡 Factor 4 Longstaff 𝑑𝑋_𝑡(4) = 𝐾(4)(𝜃(4)− √|𝑋_𝑡(4)| ) 𝑑𝑡 + 𝜎(4)_√|𝑋 𝑡 (4) |𝑑𝑊_𝑡 Factor 5 Geometric Brownian Motion 𝑑𝑋𝑡 (5) = 𝜃(5) 𝑋_𝑡(5)𝑑𝑡 + 𝜎(5)𝑋_𝑡(5)𝑑𝑊_𝑡

According to the credit quality migration matrix in real life, the rating grade of most of the clients in portfolios is stable, showing the maximum probability for remaining in a grade rather than migration. In terms of those migrations, clients with relatively high grades or low grades, such as R0 and 𝑅9, are more likely to move towards the center, such as 𝑅5. Given the

clue above, we can determine the parameters listed in Table 11 by treating factors of low risky clients and factors of high risky clients separately. All the numbers are included in Table 12.

Table 12. The parameters chosen for stochastic differential equations of every factor

𝑹𝟎− 𝑹𝟒 (Factor 1 to 5) 𝑹𝟓− 𝑹𝟗(Factor 1 to 5)

𝑲(𝒊) [𝑁𝐴𝑁 0.01 0.05 0.5 0.05] [𝑁𝐴𝑁 0.1 0.1 0.15 0.05]

𝜽(𝒊) [𝑁𝐴𝑁 3.35 1.6 3 2] [𝑁𝐴𝑁 3.35 1.2 2.5 2]

𝝈(𝒊) [0.05 0.3 0.3 0.2 0.1] [0.05 0.3 0.3 0.2 0.1] From Table 12, the speed of centralization is restricted by keeping the 𝐾(𝑖), 𝑖 = 1,2,3,4,5 small, and the mean of a factor stochastic process is tailored to have clients moved towards the center. Figure 7 Figure 8 Figure 9 Figure 10 and Figure 11 illustrate some examples of stochastic processes in months, whose increments follow the stochastic differential equations (Table 11), respectively.

Figure 7. The stochastic process based on Dothan SDE

Figure 8. The stochastic process based on Vasicek SDE