A Multiple Testing Correction Approach to PD Validation

(1)

A Multiple Testing Correction Approach to PD Validation

Master’s Thesis in Economics

Author: Romans Daukuls (s1658026) Supervisors: dr. Christoph H. Hanck

prof. dr. Ruud H. Koning

July 2012, Groningen

(2)

A Multiple Testing Correction Approach to PD Validation

By Romans Daukuls

^∗

The validation of probability of default models has become in- creasingly important for banks and supervisory authorities with the introduction of the revised international capital framework, also known as Basel II. Several PD validation tests have been proposed and discussed in the literature, but there is no consensus about a universal methodology that could be used to validate PD models.

In this paper, we set out to investigate whether using several PD validation tests in combination with a multiple testing correction procedure proposed by Godfrey (2005) could lead to a better control of type I errors and to lower type II errors. Based on the results of our simulation study, we conclude that the type I error control of the Godfrey approach is comparable to that of the individual tests based on the bootstrap method. In terms of power, the Godfrey ap- proach performs somewhat worse than the test with the best power characteristic for a given power study scenario, but provides the benefit of diversification: the Godfrey approach is less sensitive to the misalignments in the underlying individual tests than the Holm-Bonferroni multiple testing correction procedure, and may well be the preferred validation method when the individual tests perform differently under different circumstances.

JEL: C63, G17, G28

Keywords: probability of default, risk model validation, multiple testing, overall significance, Basel II risk parameters

I. Introduction

With the introduction of the Basel II capital adequacy accord, banks have gained more flexibility in designing their own credit risk management framework.

In particular, if a bank follows the internal ratings-based approach outlined in the Basel II framework, it is allowed to use own credit risk estimates to calculate the minimum capital requirements (BCBS, 2005). Consequently, it is important for banks and supervisory authorities to be able to assess the validity of banks’

credit risk estimates correctly in order to ensure an efficient allocation of capital.

In the Basel II regulatory framework, the three key risk parameters are: prob- ability of default (PD), loss given default (LGD) and exposure at default (EAD).

We will focus on the validation of the probability of default estimates. Typically,

∗ I wish to thank my supervisors for helpful suggestions and patience, my parents for their support and understanding, and all those people who post their questions and solutions in forums for sharing their knowledge.

2

(3)

a PD estimate is an output of a statistical model employed by a bank; a PD esti- mate can be obtained directly by using the binary choice models (for an example, see Medema, Koning and Lensink, 2009).

PD model validity can be assessed along two lines: discrimination and calibra- tion. Discrimination is the ability of a model to identify the defaulting borrowers ex-ante. Calibration measures the accuracy of the forecast PDs: for example, if k borrowers share the same PD π, the ex-post number of defaulters should be close to k × π. To validate a PD model, a number of statistical tests have been proposed in the literature. They often rely on the restrictive assumption of inde- pendence and are valid only asymptotically. If these assumptions are relaxed, it can be challenging, if not impossible, to derive the distribution of a test statistic analytically. Resampling methods can be a useful tool to address these problems, as demonstrated in Engelmann and Rauhmeier (2006).

Another problem arises when using several tests to determine the validity of a model: the multiple testing issue. This situation seems to be quite common in the practice of risk management: one would like to assess both the discrimination and the calibration property of a model, and validate each rating grade produced by the rating system (there must be at least eight rating grades under the Basel II framework requirements). Moreover, Engelmann and Rauhmeier (2006) sug- gest using several test statistics to assess a PD model, since these tests perform differently in terms of power under different circumstances, and there appears to be no single “best” test. However, using several tests to evaluate a model requires a multiple testing correction.

To illustrate the point, consider two independent tests τ

1

and τ

2

, both with the true significance level α = 0.05. The probability that at least one test rejects the true model is 1 − 0.95

²

= 0.0975, which is considerably higher than α. Hence, if the decision maker rejects the model whenever at least one of the tests rejects it, she may reject a good model too often. This type I error “distortion” increases if more tests are added, and decreases if the correlation between tests increases. By applying a multiple testing correction, such as Bonferroni and Holm-Bonferroni correction (see Holm, 1979), we make sure that the actual overall significance level is less or equal to the desired overall significance level, with equality holding when the tests are independent. However, if the tests are dependent, the actual overall type I error is unknown. Since the estimated PDs are used to compute the capital requirements, banks and supervisory authorities are interested in a precise control of the type I error.

Multiple testing correction procedures can be costly in terms of power: control-

ling the overall type I error may come at the cost of an increased type II error

(Benjamini and Hochberg, 1994). Thus, if the null hypothesis is “the model is

correct”, and our criterion for rejecting the null is too stringent, the risk is that

we may fail to detect a misspecified model too often. This is, of course, an unde-

sirable situation for banks and even less so for supervisory authorities. Therefore,

it is important to consider both type I and type II errors when deciding which

(4)

methodology to use to validate a model.

Godfrey (2005) proposed a methodology to control the overall significance level of tests by using the double bootstrap, and applies it to the least squares diagnos- tic tests. In this paper, we use this methodology to control the overall significance level of the PD validation tests. We set up a simulation study to investigate how well the approach proposed by Godfrey (2005) (we will call it the Godfrey ap- proach from now on) performs in validating a PD model. In particular, we will compare how the Godfrey approach, the Holm-Bonferroni method, and the in- dividual tests suggested in the literature perform in terms of type I and type II errors.

Our simulation study consists of the following elements. First, we need a statis- tical model for data generation and fitting. The model used in the data generating process (DGP) and the fitted model may differ, depending on the scenario used in the simulation. A general latent-variable PD model is described in Section II.

Second, we use several common PD model validation tests to assess the validity of a given model. These tests are reviewed in Section III. Third, we compute the multiple test statistic and its confidence interval using the Godfrey approach, which is described in Section IV. We present the simulation study scenarios and the results in Section V. Finally, conclusions are drawn in Section VI.

II. A Latent-Variable PD Model

In this section, we present the latent-variable model that will be used to assess the size of the tests. From now on, we will call it the “null” model. Let Y

it

denote the default indicator variable for an obligor i at time t; i = 1, ..., N and t = 1, ..., T . Y

_it

is equal to 1 if an obligor defaults and 0 otherwise. Assume that a default occurs when the latent variable X

it

falls below a certain threshold (one can think of the latent variable as being the difference between the assets and the liabilities of an obligor; without the loss of generality, let the threshold be equal to zero).

Further, let Z

_it

be a risk factor that affects the latent variable. For simplicity, we will make the following assumptions regarding this risk factor:

1) for a given observation i, Z

i

follows the multivariate normal distribution with a T -mean vector 0 and an identity variance-covariance matrix I

_T

; 2) for a given time period t, Z

_t

follows the multivariate normal distribution

with an N -mean vector 0 and an identity variance-covariance matrix I

_N

; 3) the covariance between any Z

_it

and Z

_jt⁰

is zero.

While these assumptions are not really necessary for model fitting, they are not restrictive and can be used for our purposes to simplify the simulation setup.

We assume that the variation in the latent variable X

_it

is caused partly by

the aforementioned risk factor, and partly by an unobserved individual shock

_it

.

Thus, we can write:

(5)

X

it

= β

0

+ β

1

Z

i,t−1

+

it

(1)

Y

_it

= 1, if X

_it

< 0 Y

it

= 0, if X

it

≥ 0 where β

₀

and β

₁

are population parameters.

Similar assumptions are made regarding

it

:

1) for a given observation i,

i

follows the multivariate normal distribution with a T -mean vector 0 and an identity variance-covariance matrix I

_T

; 2) for a given time period t,

t

follows the multivariate normal distribution

with an N -mean vector 0 and an identity variance-covariance matrix I

_N

; 3) the covariance between any

it

and

_jt⁰

is zero;

4) E(

_it

| Z) ∼ N (0, 1).

Again, as we will see further, most of these assumptions are not necessary for model fitting, but they simplify the simulation setup.

Assume that a bank uses a PD model to estimate the probabilities of default, which are denoted by y b

it

. The PD estimates are used to assign the borrowers into rating grades; each rating grade contains the borrowers whose PD estimates are within certain boundaries. Let c

_k−1

and c

_k

denote the PD bounds of the kth rating grade, k = 1, ..., K. We can then write

b y

it

= P (

it

≤ −[ b β

0

+ b β

1

Z

i,t−1

]) = Φ(−[ b β

0

+ b β

1

Z

i,t−1

]) (2)

c

k−1

< b y

it

≤ c

_k

⇒ i ∈ G

_k

(3)

where G

_k

denotes the kth rating grade, and b β

0

and b β

1

denote the estimates of the true parameters β

₀

and β

₁

. The rating grade boundaries are also called the master scale. An example of constructing a master scale is given in Appendix A1.

Given the assumptions stated above, it is appropriate to fit a pooled probit model using the partial maximum likelihood method (see Wooldridge, 2002). For a consistent estimation of the parameters β

₀

and β

₁

, this method only requires the following assumptions:

1) a functional form, as stated in Equation 1;

2)

it

| Z

_it

∼ N (0, 1).

The bank is also required to assign a pooled PD (PPD) to each rating grade, which we denote as b π

_k,t

. We will assume that the PPD of a rating grade is calculated as the average of the individual PDs:

(4) b π

_k,t

= 1

N

k,t

X

i

y b

_it

, i ∈ G

_k

(6)

where N

_k,t

is the number of observations in the kth rating grade at time t.

In our paper, we will assume that a PD model together with a rating grade assignment philosophy form an internal rating system of a bank. Note that we have assumed that PDs and PPDs are assigned without the consideration of a macroeconomic scenario. This corresponds to the point-in-time rating philosophy, which will be assumed throughout this paper. We are now in position to present several common tests which can be used to assess internal rating systems.

III. PD Model Validation Tests

In this section, we review four common tests that are used to validate internal rating systems. We will focus on the one-period-ahead validation tests. Thus, if we have a dataset consisting of N obligors, T periods, and K rating grades, we fit a PD model to this N × T dataset to compute the PDs and the PPDs at T + 1.

Once we observe the realized defaults at T + 1, we can compute the relevant test statistics. Since all the tests reviewed in this section are one-period-ahead tests, we will omit the time index.

In general, there are two dimensions along which internal rating systems are assessed (see BCBS, 2005): discrimination and calibration. Discrimination is the ability of a rating system to classify borrowers into defaulting and non- defaulting categories. A rating system differentiates well between defaulters and non-defaulters if more defaulters relatively to non-defaulters are assigned to lower rating grades. Calibration measures the accuracy of the system: if a pooled prob- ability of default is ex-ante assigned to a rating grade, the ex-post realized share of defaulters should match the probability of default of the rating grade.

Typically, the null hypothesis for the tests described below is that the fitted model accurately predicts the PDs and the PPDs: P (Y

_i

= 1) = y b

_i

∀i and π

_k

= b π

_k

∀k. We can express these hypotheses in terms of the population parameters as follows:

P (Y

i

= 1) = E(Y

i

| β

₀

, β

1

; Z

i

) = Φ(−β

0

− β

₁

Z

i

) ∀i (5)

b y

i

= E(Y

i

| b β

0

, b β

1

; Z

i

) = Φ(− b β

0

− b β

1

Z

i

) ∀i (6)

β

₀

= b β

₀

and β

₁

= b β

₁

⇐⇒ P (Y

_i

= 1) = y b

_i

∀i (7)

Similarly, for the PPDs we have:

π

k

= 1 N

_k

X

i

P (Y

i,k

= 1) ∀k (8)

π b

k

= 1 N

_k

X

i

b y

i,k

∀k (9)

β

₀

= b β

₀

and β

₁

= b β

₁

⇐⇒ π

_k

= π b

_k

∀k (10)

It is somewhat harder to see the link in Equation 10, but consider the following

(7)

heuristic argument. If π

_k

= b π

_k

∀k, we have:

X

i

[Φ(−β

₀

− β

₁

Z

_i,k

) − Φ(− b β

₀

− b β

₁

Z

_i,k

)] = 0 (11)

X

i bi

Z

ai

e

⁻^t2²

dt = 0 (12)

where a

i

= −β

0

− β

₁

Z

_i,k

and b

i

= − b β

0

− b β

1

Z

_i,k

. If a

i

6= b

_i

, the difference between a

_i

and b

_i

would be nonrandom, and hence, in general, the sum in Equation 12 would not be equal to zero.

Thus, the null hypothesis

(13) H

₀

: β

₀

= b β

₀

, β

₁

= b β

₁

and P (Y

_i

= 1|Z

_i

) = Φ(−β

₀

− β

₁

Z

_i

)

is going to be used in the tests described below. Note that this hypothesis implies

“perfect estimation”: the fitted model is indistinguishable from the true model.

We are now ready to discuss several common PD model validation tests.

A. The Brier Score

The Brier score (BS) (also known as the Spiegelhalter test and the mean squared error) was originally proposed by Brier (1950) as a weather forecast ver- ification method. BCBS (2005) suggested that the Brier score could be applied to evaluate the rating systems of banks. We attempt to extend this idea by using the fact that the Brier score can be decomposed into a calibration component and a discrimination component.

The Brier score is defined as

(14) BS = 1

N

K

X

k=1 Nk

X

i=1

( b π

_k

− y

_i,k

)

²

where K is the number of rating grades and N

_k

is the number of observations in the kth rating grade. As before, b π

k

denotes the pooled PD, and y

i,k

takes on values 0 and 1 in the case of no default and in the case of default, respectively.

The Brier score can be decomposed as (see Appendix A2 for details):

(15) BS =

K

X

k=1

ω

_k

( b π

_k

− ¯ y

_k

)

²

+ ¯ y

_k

(1 − ¯ y

_k

)

where ω

_k

is the relative weight of the observations in the kth category, and ¯ y

_k

is

the actual share of defaulters in a given rating grade. The first term in brackets

in Equation 15, ( b π

_k

− ¯ y

_k

)

²

, measures the calibration of the system: the larger the

(8)

difference between the predicted and the actual frequency of defaults in a given risk bucket, the higher the score. The second element, ¯ y

_k

(1 − ¯ y

_k

), is related to the discriminatory power of the system: BS is lower if the variance of y

_i,k

within a given category is lower. Note that a lower Brier score means better performance of the rating system. It is also possible to decompose the Brier score into three components, measuring calibration, discrimination, and the minimum attainable Brier score for a given sample; see Engelmann and Rauhmeier (2006) for a detailed description and an analysis in the context of credit risk model validation.

The confidence interval of the Brier score can be computed using Spiegelhalter’s method. We follow Redelmeier (1991) in explaining the basic properties of the Brier score test. Consider the Brier score component for the kth category in Equation A6 and, as before, let π

_k

be the true PPD in the kth category, as stated in Equation 8. Under the null hypothesis of a correct prediction (Equation 13), the expected value of the Brier score in the kth category is

(16) µ

BS,k

= N

_k

N π

k

(1 − π

k

)

The expected value of the Brier score under the null hypothesis applied to each category is

(17) µ

BS

=

K

X

k=1

ω

_k

π

_k

(1 − π

_k

)

It can be shown that under the null hypothesis, the variance of BS

_k

is given by

(18) σ

_BS,k²

= N

_k

N

²

π

_k

(1 − π

_k

)(1 − 2π

_k

)

²

Since the covariance between the Brier scores for different rating grades is zero under the null hypothesis (because the default realizations y

_i

are independent conditional on Z

i

), the variance of BS is

(19) σ

²_BS

= 1

N

K

X

k=1

ω

_k

π

_k

(1 − π

_k

)(1 − 2π

_k

)

²

See Bradley, Schwartz and Hashino (2007) for the derivation steps.

Under the assumption that y

1

, ..., y

N

are iid conditional on Z

1

, ..., Z

N

, the dis- tribution of BS approximates the normal distribution as N → ∞

(20) Z

BS

= BS − µ

BS

q

σ

_BS²

(9)

B. The Binomial Test

The binomial test can be used to assess the calibration property of a single rating grade. We follow Engelmann and Rauhmeier (2006) in explaining the basic properties of this test. Again, consider the kth rating grade and assume that the true probability of default in that grade is π

_k

. If we assume that the defaults are independent, the probability of observing d defaulters out of N

_k

borrowers is (21) P (N

D,k

= d | N

k

, π

k

) = N

_k

d

π

^d_k

(1 − π

k

)

^N^k^−d

Our test statistic, N

_D,k

= P

N_k

i=1

y

_i

, is the sum of independent Bernoulli trials and thus follows the binomial distribution. The confidence interval of N

D,k

can be computed using the quantiles of the binomial distribution B(N

_k

, π

_k

); this is the exact binomial test. Alternatively, one could use the normal approximation to the binomial test. To normalize the test statistic, we use the following expressions:

(22) µ

bin

= π

k

N

k

(23) σ

_bin²

= N

_k

π

_k

(1 − π

_k

)

The standardized test statistic Z

_bin

approximates the normal distribution as N

_k

→ ∞

(24) Z

_bin

= N

_D,k

− µ

_bin

q σ

_bin²

∼ N (0, 1)

Engelmann and Rauhmeier (2006) show that the standardized test statistic of the binomial test Z

_bin

is equal to the standardized test statistic of the Brier score for one category, Z

_BS,k

. Since the Brier score BS is the average of the category Brier scores (see Equation 15), the Brier score and the binomial test are related:

(25) Z

_BS

=

P

K

k=1

Z

_bin,k

pK/N

²

It is important to realize that the assumption regarding π

_k

being the true PPD

of the kth rating grade is different from the null hypothesis in Equation 13. In

the “null” model, each obligor has a different probability of default P (Y

i

= 1),

while in the case of the binomial test we assume that all the obligors in the kth

rating grade share the same probability of default π

k

. Hence, we expect some size

distortion when we validate rating grades using the binomial test. A possible cure

for this problem is considering the Poisson-binomial distribution instead of the

binomial distribution to validate individual rating grades (for details, see Chen

(10)

and Liu, 1997).

The Simultaneous Binomial Test

Since we would like to validate all the rating grades simultaneously, we have to consider the multiple testing issue. If we have K independent rating grades, and test the null hypothesis H

₀

: π

_k

= b π

_k

∀k at the individual nominal significance level α

1

, the number of grade-wise rejections N

rej

follows the binomial distribution:

(26) P (N

_rej

= n

_r

| K, α

₁

) = K n

_r

α

ⁿ₁^r

(1 − α

₁

)

^K−n^r

The confidence interval for N

rej

can be obtained in a similar fashion as for N

_D,k

described above.

C. The Hosmer-Lemeshow Test

The Hosmer-Lemeshow test (also known as the χ

²

test) can be used to assess the calibration of all the rating grades simultaneously. The Hosmer-Lemeshow test statistic HL is the sum of the squared Z-statistics of the binomial test (see Equation 24):

(27) HL =

K

X

k=1

(N

_D,k

− µ

_bin,k

)

²

σ

_bin,k²

As before, we would like to test whether we have made an accurate prediction of the default rate for all rating grades; therefore, the null hypothesis is H

0

: π

k

= b π

_k

∀k. Under the assumption that the numbers of defaults across the rating grades are independent, HL is asymptotically chi-square distributed with K degrees of freedom. Hence, the confidence interval for the HL statistic can be found using asymptotic theory.

Clearly, the HL statistic is also related to the Brier score: if we look at the decomposition of the Brier score in Equation 15, we can see that HL is similar to the calibration element in the Brier score decomposition. Again, by assuming that the PD of each obligor in the kth rating grade is π

_k

, we deviate from the

“null” model, just like in the case of the binomial test. Hence, a distortion in size is expected for the HL test as well.

D. The Area under the ROC Curve or the MWW Test

The receiver operating characteristic (ROC) is an instrument to assess the dis-

criminatory power of a classifier. ROC has many applications; for example, it is

used in medicine to evaluate diagnostic tests and in machine learning to evalu-

ate classification algorithms. For an introduction to ROC analysis, see Fawcett

(2006).

(11)

In the case of risk modelling, we can think of a rating system as a classifier.

Information about a borrower is collected, and on the basis of that information, the rating system assigns the borrower into the defaulting or the non-defaulting class. Hence, the classification problem is as follows: should the system assign the borrower to the lowest rating grade (defaulting) or to the higher rating grades (non-defaulting).

As before, y b

_i

is the PD estimate of a risk model and Y

_i

is the actual state of nature (0=no default and 1=default). Next, we set a threshold c for the model output y b

i

such that if y b

i

> c, we classify the borrower as a defaulter:

(28)

y b

c,i

= 1, if b y

i

> c y b

_c,i

= 0, if b y

_i

≤ c

where y b

_c,i

is the class assigned by the model with a given threshold c.

Now, if y b

_c,i

= y

_i

= 1, we have made a correct prediction and the borrower which we had expected to default actually did so. The share of correct default predictions, also called the true positive rate (T P R), is defined as

(29) T P R = P I( b y

_c,i

= 1 | y

_i

= 1) N

d

where N

_d

is the total number of actual defaults and I(.) is the indicator function.

If, however, we had predicted that a borrower would default, but in reality this default did not occur, we have rung a false alarm or, equivalently, made a false positive prediction. The false positive rate (F P R) is defined as

(30) F P R = P I( y b

_c,i

= 1 | y

_i

= 0) N

_nd

where N

_nd

is the number of non-defaulters. If we plot T P R against F P R for all possible values of threshold c, we get the ROC curve.

Clearly, if a classifier has a higher true positive rate given the same false pos- itive rate, it performs better at this point. Using this reasoning, we can look at the T P R for every F P R. A perfect model would immediately identify all the de- faulters without any false alarms. A random model would make as many correct predictions as incorrect ones; consequently, T P R and F P R would be equal for any threshold value. In reality, ROC lies somewhere between these two extremes.

Hence, we can use the area under the ROC curve (AU C) to determine the dis- criminatory performance of a rating system. We can write the AU C test statistic as:

(31) AU C =

Z

1 0

T P R(F P R)d(F P R)

Bamber (1975) showed that AU C is related to the Mann-Whitney U statistic.

(12)

Let y b

_d,i

be a PD score from the defaulter set y ∈ D, and let b y

_nd,j

be a PD score from the non-defaulter set y ∈ N D. We can write

(32) AU C =

Z

1 0

P ( y b

_d,i

> c)dP ( y b

_nd,j

> c) = P ( b y

_nd,j

< y b

_d,i

)

It can be seen that this corresponds to the Mann-Whitney-Wilcoxon (M W W ) rank sum test U-statistic with a kernel function h = I( y b

_nd,j

− y b

_d,i

< 0):

(33) M W W =

P

N_d i=1

P

N_nd

j=1

I( y b

_nd,j

− b y

_d,i

< 0) N

_d

N

_nd

A detailed description of U-statistics in general and the MWW test in particular can be found in Kowalski and Tu (2008).

Typically, the null hypothesis of the M W W test is “no difference between populations”, meaning that our PD model performs randomly. Let ψ

_nd,j

and ψ

_d,i

denote the true PD of an obligor from the nondefaulter set and an obligor from the defaulter set, respectively. The null hypothesis of no discriminatory power can be written as P (ψ

_nd,j

< ψ

_d,i

) = 0.5. However, we will need to consider this test under the null hypothesis in Equation 13 to be able to incorporate this test in the Godfrey approach. The expressions of the mean and the variance of M W W under the null hypothesis (Equation 13) are provided in Appendix A3.

It is known that as N → ∞,

(34) Z

_{M W W}

= M W W − µ

_{M W W}

q σ

²_{M W W}

∼ N (0, 1)

IV. An Overview of the Godfrey Approach

In this section, we review the multiple testing correction procedure proposed by Godfrey (2005). First, we briefly introduce the motivation behind using the Godfrey approach, and then we review the method itself in the light of the PD model and the validation tests described above.

To begin with, it is useful to introduce the concept of errors in rejection prob- ability (ERP). Beran (1988) defines ERP as the difference between the actual probability that a test rejects a true null hypothesis and the desired probability that a true null is rejected. Hence, ERP is a criterion to assess the precision of type I error control.

There are several factors that can affect ERP of a test (Beran, 1988): sample

size, the rate of convergence for rejection probabilities, whether the asymptotic

distribution of the test statistic in question depends on the unknown parameters,

the method of confidence interval computation (whether asymptotic theory or

bootstrap methods are used), and the use of asymptotic refinements. There are

two conclusions in the paper by Beran (1988) that are useful in our paper. These

(13)

conditions can be summarized as follows. The order of ERP can be reduced:

(1) by using the bootstrap, if the asymptotic distribution of the test statistic in question is independent of the unknown parameters, and (2) by transforming a test statistic by its bootstrap cdf (prepivoting).

In the case of a multiple testing problem, ERP (the family-wise error rate) may also be affected by the dependence between tests. Although Bonferroni- type procedures work under dependent tests, they only provide the bounds for the overall type I error; consequently, the actual type I error may differ from the desired type I error. Godfrey (2005) takes a different approach to multiple testing:

he uses resampling methods to find the distribution of the minimum p-value of tests, utilizing the results presented in Beran (1988), and reports a reasonably precise control of the overall type I errors.

Godfrey (2005) presents two methods of the overall type I error control. The difference between the two methods is that the first method (Method 1) does not rely on asymptotic theory and the second method (Method 2) does. Given the fact that all the tests reviewed in Section III have known asymptotic distributions, both methods are applicable in our case. Godfrey (2005) suggests using Method 2 whenever both methods are available, since it produces a lower ERP; therefore, we will focus on this method. The following steps describe the Godfrey approach in more detail.

Step 1. Let N be the sample size. Assume that a PD model has been fitted to data and consider the test statistics described in Section III: M W W , BS, SBin (the simultaneous binomial test), and HL. In the first step, we obtain the test statistics from the sample and calculate their asymptotic p-values. An overview of the test statistics and the p-values is given in Table 1. We then calculate the

Test Statistic P-value

Mann-Whitney-Wilcoxon M W W

N

2 min[Φ(Z

M W W

), 1 − Φ(Z

M W W

)]

Simultaneous Binomial N

_rej,K

1 − Φ(Z

_SBin

) Hosmer-Lemeshow HL

_N

1 − χ

²_K

(HL)

Brier Score BS

N

1 − Φ(Z

BS

)

Table 1—Asymptotic p-values

minimum p-value mp

₀

based on the asymptotic p-values:

(35) mp

0

= min(p

M W W

, p

_Sbin

, p

HL

, p

BS

)

Step 2. In this step, we prepivot the minimum p-value obtained in Step 1.

In other words, we calculate the bootstrap p-value of the minimum asymptotic

p-value computed in Step 1. The bootstrap scheme is set up as follows. Consider

the binary choice model described in Section II. The PD of the ith obligor, y b

_it

,

means that we expect this obligor to default with probability b y

_it

and to survive

with probability 1 − b y

it

. We simulate N × T + 1 realizations of the standard

(14)

uniform distribution, and we form an artificial sample y

_it^∗

by comparing the ith realization u

_it

with the ith PD y b

_it

:

(36) y

^∗_it

= 1, if u

_it

≤ y b

_it

y

^∗_it

= 0, if u

it

> y b

it

Thus, we have obtained an artificial dataset consisting of y

_it^∗

and Z

it

; t = 1, ..., T + 1. We fit the same PD model to the artificial dataset, excluding the observations at T + 1. Again, we compute the predictions of the model at T + 1, and all the p-values of the test statistics described in Step 1. Finally, we compute the minimum p-value from the artificial sample. If we repeat this procedure by generating B samples, we will obtain B artificial minimum p-values, which are denoted as mp

^∗_b

. We can now compute the bootstrap p-value of mp

₀

:

(37) p

^∗₀

= 1

B

X

b=1

I(mp

^∗_b

< mp

0

)

We could finalize the procedure at this point and compare the bootstrap p-value with the desired significance level α. This would save the computational cost, but the expense could be lost precision.

Step 3. In this step, we assess the statistical significance of p

^∗₀

. To accomplish this task, we can use the second stage of bootstrapping: C artificial samples are generated for each of the B artificial samples. We repeat the procedure described in Step 2 for each artificial sample b, and thereby obtain B bootstrap p-values of mp

^∗_b

:

(38) p

^∗∗_b

= 1

C

X

c=1

I(mp

^∗∗_c,b

< mp

^∗_b

)

Since we now have B values of p

^∗∗_b

, the bootstrap p-value of p

^∗₀

can be assessed as follows:

(39) p

^∗∗₀

= 1

B

X

b=1

I(p

^∗∗_b

< p

^∗₀

)

If p

^∗∗₀

is small enough (i.e. if it falls below the desired level α), we reject the null hypothesis that the model is specified correctly.

A. The Fast Double Bootstrap P-Value

The Godfrey approach described above requires computing B × C + 1 mp test

statistics, which is a rather computationally-intensive task. Fortunately, it is

possible to decrease the number of computations by employing the fast double

(15)

bootstrap (FDB) procedure proposed by Davidson and MacKinnon (2007). The details are as follows.

First, we compute mp

0

as in Step 1 and B × mp

^∗_b

as in Step 2. For each of the B artificial samples, we compute only one second-level artificial sample and hence one mp

^∗∗_b

. As before, we compute the p

^∗₀

as:

(40) p

^∗₀

= 1

B

X

b=1

I(mp

^∗_b

< mp

0

)

Next, we find the p

^∗₀

quantile of mp

^∗∗_b

, which we denote as Q

^∗∗_mp

(p

^∗₀

). Finally, we compute the FDB p-value:

(41) p

^∗∗₀

= 1

B

X

b=1

I(mp

^∗_b

≤ Q

^∗∗_mp

(p

^∗₀

))

The FDB approach only requires 2 × B + 1 test statistics to be computed, which is considerably faster than the double bootstrap approach.

V. Simulation Study

In order to evaluate the performance of the Godfrey approach, we set up a simulation study. We compare the Godfrey approach to the well-known Holm- Bonferroni MCP (see Holm, 1979), the individual test statistics, and the “no- correction” approach - when a model is rejected whenever at least one test rejects it. This simulation study can be viewed as an extension to the simulation study provided in Engelmann and Rauhmeier (2006). More details about our simulation study can be found in Appendix B.

We consider three portfolio sizes: a small portfolio (200 obligors), a medium portfolio (1000 obligors), and a large portfolio (5000 obligors). Obligors are clas- sified into eight rating grades. The rejection probabilities are computed based on A = 5000 simulation rounds, and the bootstrap p-values are computed based on B = 1000 bootstrap rounds.

A. Size Study

A Perfect Forecast Scenario

The general algorithm of the perfect forecast size study can be summarized in the following steps:

1) Generate an N × 1 dataset from the “null” model (for one time period only, at t = T + 1). We use β

0

= 1.601 and β

1

= 0.75.

2) Compute the PD predictions of the “perfect model”: ψ

_{i,T +1}

= Φ(−β

₀

−

β

1

Z

i,T +1

).

(16)

3) Assign the borrowers into the rating grades based on ψ

_{i,T +1}

and compute the rating grade PPDs.

4) Compute the test statistics and their asymptotic p-values.

5) Compute the mp test statistic as in Equation 35.

6) Generate B = 1000 bootstrap samples using the bootstrap scheme described in Section IV. For each bootstrap sample

a) repeat steps 4 and 5;

b) compute a second level bootstrap sample and repeat steps 4 and 5 to compute mp

^∗∗_b

.

7) Compute the bootstrap p-values of the various tests, as well as the p-values of the Godfrey approach. Let τ be a test statistic of interest. We use the asymptotic p-values to compute the bootstrap p-values:

(42) p

^∗_τ

= 1

B

X

b=1

I[p(τ

_b^∗

) < p(τ

₀

)]

where τ

_b^∗

is the test statistic τ computed from the bth bootstrap sample, and τ

0

is the test statistic τ computed from the original sample.

8) Repeat steps 1-7 A times.

Based on this algorithm, we generate an A × 10 matrix of p-values, which we denote as P. It consists of the following elements:

• A× asymptotic p-values of M W W , BS, HL, and SBin

• A× bootstrap p-values of M W W , BS, HL, and SBin

• A× Godfrey Method 2 p-values

• A× Godfrey Method 2 fast double bootstrap p-values

Now, for each element p

_ij

of P we can compute a decision δ

_ij

at a given nominal significance level α:

(43) δ

_ij

= I(p

_ij

< α)

Thus, if δ

_ij

= 1, the jth test rejects the null hypothesis at the nominal significance level α in the ith simulation round.

In addition to the decisions of the individual tests and the Godfrey approach,

we also compute the decisions of two procedures: the Holm-Bonferroni (HB)

procedure and the procedure when no MCP is used (noM CP ). These procedures

can be based either on the asymptotic p-values or on the bootstrap p-values of

the individual tests.

(17)

The decision δ

_i,HB

of the HB procedure is computed as follows. Let p

^a_i,j

be the subset of the asymptotic p-values of the individual tests in the ith simulation round; since we have four individual tests, j = 1, ..., 4. We first sort the asymptotic p-values in an ascending order: p

^a_i,(1)

, ..., p

^a_i,(4)

. The decision of the HB procedure in the ith simulation round is computed as:

δ

_i,HB

= I





4

X

j=1

I

p

^a_i,(j)

< α 5 − j

> 0

 (44) 

The same procedure can be done using the bootstrap p-values p

^b_i,j

. The decision δ

_{i,noM CP}

of the noM CP procedure is computed as:

δ

i,noM CP

= I





4

X

j=1

I p

^a_i,j

< α > 0

 (45) 

Similarly, the noM CP procedure can be based either on the asymptotic or the bootstrap p-values.

Next, we form a matrix with the decisions of the individual tests, the HB procedure based on the asymptotic and the bootstrap p-values, the noM CP procedure based on the asymptotic and the bootstrap p-values, and the Godfrey approach (both with and without the prepivoting step). Hence it is an A × 14 matrix, and we denote it as ∆

α

, where the index α means that the decision matrix

∆ is computed using the nominal significance level α.

Finally, the rejection probabilities of each test and MCP at the nominal sig- nificance level α can be computed by summing the columns of ∆

_α

and dividing these sums by the number of simulation rounds A:

(46) RP

_j,α

= 1

A

X

i=1

∆

_α,ij

Figure 1 plots the difference between the nominal size and the actual test size

(size discrepancy) against the nominal test size for each test and MCP. The upper

three plots show the RPs of the individual tests, and the lower plots show the RPs

of the MCPs. Each plot column corresponds to a different sample size. Ideally, an

RP graph of a test or an MCP should lie closer to the zero line, meaning that the

actual size corresponds to the nominal size. Each test and MCP is represented

by a different color; thicker lines are used for the asymptotic tests and thinner

lines are used for the bootstrap tests. For the Godfrey approach, the thicker

line corresponds to the Godfrey Method 2 without the prepivoting step (denoted

as “Godfrey 2” in the legend), and the thinner line corresponds to the Godfrey

Method 2 with the prepivoting step (denoted as “Godfrey 2 FDB” in the legend).

(18)

Figure 1. Size Discrepancy Plot

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=200 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=1000 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=5000 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

Asymptotic BS Bootstrap BS Asymptotic MWW Bootstrap MWW Asymptotic HL Bootstrap HL Asymptotic SBin Bootstrap SBin Godfrey 2 Godfrey 2 FDB Asymptotic HB Bootstrap HB Asymptotic noMCP Bootstrap noMCP Nominal Size

Size Discrepancy

(19)

If we look at the graphs of the individual tests, we can see that the bootstrap tests generally outperform the asymptotic tests. The difference between using asymptotic theory versus using the bootstrap method is slightly less obvious for the Brier score. This can be explained by the fact that the “null” model assump- tions are satisfied only for this test. Still, if we look at the BS plots for all sample sizes, we can see that bootstrapping reduces ERP in the smaller sample.

The null hypothesis of the SBin test and the HL test does not correspond to the “null” model, as discussed in Section III. Of course, we could have aligned the “null” model with the null hypotheses of these two tests, but the implication would be that the realizations of the risk factor are identical for all the borrowers in a given rating grade, which seems unlikely in practice. Hence, the result of this misalignment is higher errors in rejection probability in the case of the asymptotic versions of HL and SBin. The latter produces even higher ERP due to its discreteness; this problem could become less severe if more rating grades were defined. The sampling variance of M W W is understated due to the fact that we do not have two samples of defaulters and nondefaulters of a fixed size; rather, N

_D

and N

_{N D}

vary per sample. We can see that bootstrapping these test statistics reduces ERP, because we approximate the pdf of the test statistics under the assumptions of the “null” model.

From the results in Figure 1 we may be inclined to conclude that using SBin and HL to assess PD models is undesirable due to large errors in rejection prob- ability. This, however, may not be the case, since large ERP for these tests may be a signal of good performance in terms of power due to the null hypothesis misalignments. Bootstrapping the SBin test did not reduce ERP because of the stringent assumptions that are made when computing the p-value of SBin: the size of each individual binomial test is exactly α

1

. Again, we would like to suggest using the Poisson-Binomial distribution in the individual rating grade tests (see Section III). Finally, the SBin test itself entails a multiple testing problem, and hence multiple testing procedures are applicable to this test alone. This could be an interesting area for future research.

If we look at the lower three graphs in Figure 1, we can see that the Godfrey approach outperforms the noM CP procedure and the HB procedure in terms of lower ERP. As expected, the noM CP procedure leads to considerable overre- jection relative to the nominal size level. The Holm-Bonferroni procedure is an improvement compared to the noM CP procedure, but since the size discrepancy is large for HL and SBin, this problem is “inherited” by the HB procedure.

The Godfrey approach, on the other hand, is based on the approximation to the

pdf of the minimum p-value; consequently, it is less prone to misalignments in

individual tests. Finally, we can see that the performance of the individual tests

based on the bootstrap method (with the exception of SBin) and the Godfrey

approach is comparable.

(20)

An Imperfect Forecast Scenario

In the previous subsection, we have assumed that the parameters β

0

and β

1

were always estimated precisely. In reality, a perfect forecast is hardly attainable due to the fact that the dataset that is observable by the researcher is limited. Hence, in general, β

₀

6= b β

₀

and β

₁

6= b β

₁

. However, by using an appropriate estimation method, we can obtain estimates close to the true β

0

and β

1

. We have seen in Section II that given the assumptions of the “null” model, it is appropriate to fit a pooled probit model using the partial maximum likelihood method. In order to assess the performance of the tests and the MCPs when the coefficients are estimated imperfectly, we use the following algorithm:

1) Generate an N × T + 1 dataset from the “null” model. We set T = 4, β

₀

= 1.601 and β

₁

= 0.75.

2) Fit a pooled probit model to the N × T part of the dataset. Use the estimated coefficients b β

0

and b β

1

to compute the predictions of the model at T + 1: y b

_{i,T +1}

= Φ(− b β

₀

− b β

₁

Z

_{i,T +1}

)

3) Assign the borrowers into the rating grades based on b y

_{i,T +1}

and compute the rating grade PPDs π b

_k

.

4) Compute the test statistics and their asymptotic p-values.

5) Compute the mp test statistic as in Equation 35.

6) Generate B = 1000 bootstrap samples, using the bootstrap scheme de- scribed in Section IV. We use the entire N × T + 1 dataset with predictions to generate the bootstrap samples. For each bootstrap sample

a) repeat step 2;

b) recompute the PPDs (rating grade assignment remains the same);

c) repeat steps 4 and 5;

d) compute a second level bootstrap sample and repeat steps 6a to 6c to compute mp

^∗∗_a

.

7) Compute the bootstrap p-values of the various tests, as well as the p-values of the Godfrey approach. Again, we use the asymptotic p-values to compute the bootstrap p-values:

(47) p

^∗_τ

= 1

B

X

b=1

I[p(τ

_b^∗

) < p(τ

0

)]

8) Repeat steps 1-7 A times.

(21)

We follow the same procedure as described in the previous subsection to produce the decision matrix ∆

_α

and the rejection probabilities RP

_j,α

(Equation 46).

Figure 2 shows the plots of the difference between the actual rejection probabil- ity and the nominal α against the nominal α. As before, each column corresponds to a different sample size. Consider the upper row of the plots that show the RPs of the individual tests. We can see that for the asymptotic tests, the rejection probability increases compared to the perfect forecast size study results discussed above. This is explained by the fact that the null hypothesis of a perfect forecast is not satisfied; the estimated coefficients b β

0

and b β

1

are different in each simulation round. This additional variance is not accounted for under the null hypothesis.

On the other hand, by using bootstrap, we take the variance of coefficient esti- mates into account, and, as a consequence, the bootstrap test series are closer to the zero line.

At this point, we may consider the following question: which size study is more relevant in the risk management practice? The answer to this question is not that straightforward. A perfect forecast is hardly attainable in practice; hence, setting up the null hypothesis this way would result in rejecting an appropriate estimation method too often. This can be seen in the upper three graphs of Figure 2: three out of four asymptotic tests reject too often. On the other hand, in the imperfect forecast size study, we actually use a set of different DGPs when computing the bootstrap p-values (each realization of the DGP corresponds to the coefficients of the fitted model). However, if the variance of the coefficient estimates is large (which can happen if the sample size is too small, for example), we would still get the rejection probabilities close to the nominal size when using the bootstrap tests. Thus, we would accept a model, which is potentially poorly calibrated, because we take the variance of the coefficients into account when approximating the pdfs of the test statistics.

A possible solution to this problem is to specify a maximum variance of the co- efficients that would be tolerated by the supervisory authority. If the coefficient variance estimate supplied by a bank is acceptable, we could use that variance es- timate in the “null” model, which would now consist of a set of DGPs rather than one fixed DGP. We could then approximate the distribution of the various test statistics by using S Monte Carlo replications. The next step would be conducting the size study as described in this subsection, but referring the test statistics to the newly obtained approximate distribution. Again, this is an interesting issue for future research.

The discussion above also applies to the MCPs; if we look at the lower three graphs in Figure 2, we see that the Godfrey 2 method produces lower ERP than the other two procedures. This is because the Godfrey 2 method is based on bootstrapping, meaning that it takes the variance of the coefficients into account.

The HB procedure based on the bootstrap tests “inherits” the misalignments

of the bootstrap SBin test, and hence produces higher ERP than the Godfrey

approach. As in the perfect forecast size study, we can see that the individual

(22)

Figure 2. Size Discrepancy Plot: Imperfect Forecast

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=200 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=1000 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

N=5000 0.000.050.100.15

−0.04

−0.02 0.00 0.02 0.04

Asymptotic BS Bootstrap BS Asymptotic MWW Bootstrap MWW Asymptotic HL Bootstrap HL Asymptotic SBin Bootstrap SBin Godfrey 2 Godfrey 2 FDB Asymptotic HB Bootstrap HB Asymptotic noMCP Bootstrap noMCP Nominal Size

Size Discrepancy

(23)

bootstrap tests (excluding SBin) and the Godfrey approach perform comparably in terms of ERP.

B. Power Study: An Omitted Variable Scenario

In this subsection, we will examine how the test statistics and the MCPs perform in terms of power. We will limit ourselves to comparing the power of the tests and the MCPs under a scenario when a risk factor is not observed by the researcher. In order to accomplish this task, we need an alternative DGP. Consider the following modification of the “null” model:

X

_it

= β

₀

+ β

₁

Z

_i,t−1

+ β

₂

M

_t

+

_it

(48)

Y

_it

= 1, if X

it

< 0 Y

_it

= 0, if X

_it

≥ 0

where M

_t

∼ N (0, 1) can be interpreted as a macroeconomic shock that affects all the borrowers simultaneously in a given time period. All the assumptions regarding Z

_it

and

_it

remain unchanged (see Section II). It can be shown that β

₂²

can also be interpreted as the covariance between X

it

and X

jt

; thus, we introduce default dependence in this model.

We assume that M

t

is not observable by the researcher; hence, the assumption of conditional default independence is not satisfied. The power study algorithm is similar to the imperfect forecast size study algorithm, the only difference being that the alternative model described above is used in the first step to generate data. In the second step, a model is fitted to the observable data: Y

it

and Z

i,t−1

. The rejection probabilities at the nominal significance level α are computed in a similar fashion as in the size studies. However, in order to compare the power of the tests and the MCPs properly, we need to take the size discrepancies into account. In order to correct for the size discrepancies, we use the RPs produced in the imperfect forecast size study, and find the corresponding RPs of the power study via the common nominal significance α. Figures 3, 4, and 5 show the plots of the power study RPs against the RPs of the imperfect forecast size study (labelled as “Actual size” on the X-axis). Each figure corresponds to a different sample size, and the plot columns within each figure correspond to a certain value of β

₂

. As before, the upper row of the plots shows the RPs of the individual tests, and the lower row of the plots shows the RPs of the MCPs.

If we look at Figures 3, 4, and 5, there are two general tendencies. First, it is “easier” to detect a misspecified model if the omitted variable causes a larger variation in the dependent variable. We can see how the graphs of almost all tests rotate counter-clockwise relative to the origin point as β

2

increases. Second, the power of the tests increases as the sample size becomes larger.

Consider the performance of the individual tests in the upper rows of the three

figures. In general, we can see that the BS test performs better in terms of power

than the other tests, followed by the SBin test. The graphs of the M W W test,

(24)

Figure 3. Power Study: N = 200

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.01 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.05 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.1 0.000.050.100.15

0.00 0.05

0.10 0.15

Asymptotic BS Bootstrap BS Asymptotic MWW Bootstrap MWW Asymptotic HL Bootstrap HL Asymptotic SBin Bootstrap SBin Godfrey 2 Godfrey 2 FDB Asymptotic HB Bootstrap HB Asymptotic noMCP Bootstrap noMCP Actual Size

Po wer

(25)

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.01 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.05 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.1 0.000.050.100.15

0.00 0.05

0.10 0.15

Po wer

(26)

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.01 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.05 0.000.050.100.15

0.00 0.05

0.10 0.15

0.000.050.100.15

0.00 0.05

0.10 0.15

β2=0.1 0.000.050.100.15

0.00 0.05

0.10 0.15

Po wer

(27)

both asymptotic and bootstrap, stay close to the 45-degree line. However, it does not mean that this test has a low power characteristic. We have assumed in the alternative model that the risk factor M

t

affects all the borrowers simultane- ously; hence, it does not change the ordering of the probability forecasts. As a consequence, we can discriminate between the defaulting and the non-defaulting borrowers equally well, even if we do not observe the factor that impacts all the borrowers simultaneously.

The power of the MCPs can be seen in the lower three plots in Figures 3, 4, and 5. It can be seen that the noM CP and HB procedures, based either on bootstrapping or asymptotic theory, perform quite similarly in terms of power.

From the evidence presented in the three graphs, it is hard to conclude whether the Godfrey 2 method or the other two MCPs perform better in terms of power;

the rejection probabilities are quite close to each other.

VI. Conclusion

In this paper, we have set out to investigate the performance of several PD model validation methods. We have reviewed several common PD validation tests (the Brier score, the simultaneous binomial test, the Hosmer-Lemeshow test and the Mann-Whitney-Wilcoxon rank sum test), and we have recognized that there is a need for a multiple testing correction procedure if several tests are to be used simultaneously. We have reviewed a relatively recent multiple testing correction procedure due to Godfrey (2005), as it could potentially reduce the errors in rejection probability because of its reliance on the bootstrap method.

We have set up a simulation study to investigate how well this multiple testing correction procedure performs in terms of type I and type II errors compared to other MCPs (the Holm-Bonferroni method and no-MCP-correction method) and individual tests.

Before we proceed to conclusions, it is important to understand what type I and type II errors committed by model validation methods could mean in practice.

Type I error means that we reject a well-specified model. Now, if the supervisory authority is too stringent and rejects an appropriate model too often, it may create adverse incentives for banks to produce “spurious” models that fit the data too well. On the other hand, if the supervisory authority detects misspecified models too seldom (type II error), banks may have an incentive to underestimate risks.

While quantifying the costs of type I and type II errors in the context of PD validation seems to be a difficult task and could be an interesting research topic on its own, it is important to understand that ultimately, these misclassification costs determine the desirability of a given validation method. For now, we will assume that the costs of misclassification are equal for both types of errors.

A Multiple Testing Correction Approach to PD Validation

A Multiple Testing Correction Approach to PD Validation

Master’s Thesis in Economics

Author: Romans Daukuls (s1658026) Supervisors: dr. Christoph H. Hanck

prof. dr. Ruud H. Koning

July 2012, Groningen

A Multiple Testing Correction Approach to PD Validation

By Romans Daukuls

JEL: C63, G17, G28

Keywords: probability of default, risk model validation, multiple testing, overall significance, Basel II risk parameters

With the introduction of the Basel II capital adequacy accord, banks have gained more flexibility in designing their own credit risk management framework.

credit risk estimates correctly in order to ensure an efficient allocation of capital.

In the Basel II regulatory framework, the three key risk parameters are: prob- ability of default (PD), loss given default (LGD) and exposure at default (EAD).

We will focus on the validation of the probability of default estimates. Typically,

a PD estimate is an output of a statistical model employed by a bank; a PD esti- mate can be obtained directly by using the binary choice models (for an example, see Medema, Koning and Lensink, 2009).

To illustrate the point, consider two independent tests τ

and τ

, both with the true significance level α = 0.05. The probability that at least one test rejects the true model is 1 − 0.95

Multiple testing correction procedures can be costly in terms of power: control-

ling the overall type I error may come at the cost of an increased type II error

(Benjamini and Hochberg, 1994). Thus, if the null hypothesis is “the model is

correct”, and our criterion for rejecting the null is too stringent, the risk is that

we may fail to detect a misspecified model too often. This is, of course, an unde-

sirable situation for banks and even less so for supervisory authorities. Therefore,

it is important to consider both type I and type II errors when deciding which

methodology to use to validate a model.

In this section, we present the latent-variable model that will be used to assess the size of the tests. From now on, we will call it the “null” model. Let Y

denote the default indicator variable for an obligor i at time t; i = 1, ..., N and t = 1, ..., T . Y

is equal to 1 if an obligor defaults and 0 otherwise. Assume that a default occurs when the latent variable X

falls below a certain threshold (one can think of the latent variable as being the difference between the assets and the liabilities of an obligor; without the loss of generality, let the threshold be equal to zero).

Further, let Z

be a risk factor that affects the latent variable. For simplicity, we will make the following assumptions regarding this risk factor:

1) for a given observation i, Z

follows the multivariate normal distribution with a T -mean vector 0 and an identity variance-covariance matrix I

; 2) for a given time period t, Z

follows the multivariate normal distribution

with an N -mean vector 0 and an identity variance-covariance matrix I

; 3) the covariance between any Z

and Z

is zero.

While these assumptions are not really necessary for model fitting, they are not restrictive and can be used for our purposes to simplify the simulation setup.

We assume that the variation in the latent variable X

is caused partly by

the aforementioned risk factor, and partly by an unobserved individual shock 

.

Thus, we can write:

X

= β

+ β

Z

+ 

(1)

Y

= 1, if X

< 0 Y

= 0, if X

≥ 0 where β

and β

are population parameters.

Similar assumptions are made regarding 

:

1) for a given observation i, 

follows the multivariate normal distribution with a T -mean vector 0 and an identity variance-covariance matrix I

; 2) for a given time period t, 

follows the multivariate normal distribution

with an N -mean vector 0 and an identity variance-covariance matrix I

; 3) the covariance between any 

and 

is zero;

4) E(

| Z) ∼ N (0, 1).

Again, as we will see further, most of these assumptions are not necessary for model fitting, but they simplify the simulation setup.

Assume that a bank uses a PD model to estimate the probabilities of default, which are denoted by y b

. The PD estimates are used to assign the borrowers into rating grades; each rating grade contains the borrowers whose PD estimates are within certain boundaries. Let c

and c

denote the PD bounds of the kth rating grade, k = 1, ..., K. We can then write

b y

= P (

≤ −[ b β

+ b β

the aforementioned risk factor, and partly by an unobserved individual shock

+

Y

Similar assumptions are made regarding

1) for a given observation i,

; 2) for a given time period t,

; 3) the covariance between any

and

4) E(

= P (

2)