15 years of backtesting expected shortfall : what do we know?

(1)

what do we know?

Sander Nijland

Supervisor: dr. S.A. Broda Second Reader: dr. K.J. van Garderen

Supervisor: J.R. Boog AAG

April 18, 2016

Abstract

In this thesis we compare the different backtesting procedures researched in the last 15 years and evaluate them in terms of size, power and ease of use. Moreover, the backtests will be evaluated under different scenarios, including conditional on the number of violations and estimation errors in the parameters. The last is mostly of interest for practitioners; how do the results change when we fit a certain distribution to the data. The results show that the methods of Wong and Van Straelen, both based on saddle point approximations, provide the best results in terms of test size, high power and computational ease. From a practitioners point of view the Wong test is preferred as due to the integral transformation, the results are indistinguishable from the Van Straelen test, whereas the Van Straelen test is most appealing for theorists because of its sophisticated theory.

(2)

(3)

Statement of Originality

This document is written by Sander Nijland who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

Sander Nijland CONTENTS

1 Introduction

In times where financial distress is more likely then ever, regulators of financial institutions face the important task of determining an appropriate level of capital such that the institution can survive a negative period of the market but is not limited in it’s day to day business. In order to do so, different frameworks are developed as a guidance to determine the right level of capital. Examples are the well known Basel and Solvency frameworks, but also the less known Swiss Solvency Test which is used since 2006 in Switzerland. The first two frameworks rely on Value at Risk (V aR) as the leading risk measure to determine the capital requirement whereas the latter one uses Expected Shortfall (ES).

Since the adoption of V aR by the Basel Committee in 1996 a lot has been written about this subject and V aR became the industry standard in terms of measuring risk and to determine the capital needed as a reserve. V aR can be described as the maximum loss given a certain confidence level, say 99%. Although it is a very intuitive and simple concept, it lacks two fundamental properties a risk measure would ideally have. First, it does not account for so called ”tail risk”; risk emerging from high losses far in the tail of the underlying distributions. If a tail event does occur, we do not know what loss we can expect, only that it is more than V aR. In other words, V aR itself does not give any indication of how much that loss might be. Second, it lacks the property of subadditivity (see Artzner et al. 1999), which means that diversification of a portfolio does not necessarily lower the risk and thus the capital required. As a result ES was proposed by (Acerbi & Tasche 2001) to overcome these shortcomings of V aR.

”Expected Shortfall is defined as the expected loss conditional on the loss being above the V aR level” (Wong 2008). Clearly, when we have an extreme event, ES tells us what loss we can expect on average. Although ES is superior compared to V aR it has only been recently accepted as the leading risk measure in regulatory frameworks; the Basel Committee explicitly raised the prospect of phasing out V aR and replacing it with ES in its consultative document on the third Basel Accord, dated May 2012 (Basel Committee 2012). Basel III, which is scheduled to be implemented per 31 March 2019, will use ES for calculating the capital requirements. Moreover, the Swiss Solvency Test as introduced in 2006 uses ES for determining the capital needed and in that same year the European Insurance and Occupational Pensions Authority (EIOPA) generally acknowledged the theoretical advantages of using ES instead of V aR. The main reason for this rather late transition from V aR to ES is due to the fact that backtesting of ES is in general harder then V aR. Since the introduction of ES a lot of research has been

(7)

done about backtesting ES in the hope of finding an easy yet effective way to backtest ES.

In this thesis we compare the different backtesting procedures researched in the last 15 years and evaluate them in terms of size, power, and ease of use. To the best of our knowledge, our thesis is the first to consider a very extensive comparison of the different backtesting methods and extend them to accommodate different setups such as unconditional versus conditional tests and estimation errors.

The remainder of this thesis is organized as follows. Section 2 gives an overview of the history of risk measures and introduces some notation about risk measures and their desired properties. Section 3 discusses the different backtest methods that will be compared and extends them to accommodate different setups. In Section 4 the Monte Carlo setup is discussed and we investigate the finite-sample performance of the difference backtests trough a set of Monte Carlo experiments. In Section 5 we will conclude.

(8)

Sander Nijland 2 HISTORY OF RISK MEASURES AND THEIR DESIRED PROPERTIES

2 History of risk measures and their desired properties

Risk measures are used to estimate the amount of capital required as a reserve to protect the financial institution from a potential future downfall of the market. A risk measure is a function ρpXq of the random variable X, where X is defined as the gain on a given financial position (i.e. a portfolio of stocks). In other words, a risk measure quantifies the risk. In this section a brief summery of the development of risk measures is given.

The first signs of financial risk management date from the early 1970s. According to Jorion (2007), the rise of the risk management industry was a result of the increased volatility of financial markets in those years. This increase of volatility was an effect due to deregulation and globalization of the financial market. ”Deregulation forced financial institutions to be more competitive and to become acutely aware of the need to address financial risk. Barriers to international trade and investment also were lowered. This globalization forced firms to recognize the truly global nature of competition. In the process, firms have become exposed to a greater variety of financial risks.”(see Jorion 2007, pp. 7).

To protect financial institutions from financial distress as a result of this increased market risk, the 1988 Basel I Accord was introduced as a security measure. This accord sets minimum capital requirements for banks equal to 8% of their risk-weighted assets (RWA) that must be met to guard against market risks. The weights were set by the Basel committee itself. The Basel 1 Accord has been heavily criticised: ”The original rules were too simplistic and rigid and did not align regulatory capital sufficiently with economic risk-based capital. This led to regulatory arbitrage, which generally can be defined as a transaction that exploits inconsistencies in regulatory requirements”(see Jorion 2007, pp. 56).

In response to the criticism of the standardized method, a revision was made in 1995 allowing banks the option of using their own risk measurement models to determine capital charges for market risk. This approach was based on Value at Risk (V aR) analysis; the first implementation of V aR after it was introduced in July 1993 by the Group of Thirty (G-30). V aR can be described as the maximum loss given a certain confidence level, say 99 % (we define V aR more formally in the next paragraph). In practice the capital charge set by the internal V aR-based model was much lower than the standardized Basel I charge, resulting in a widespread movement towards V aR. In June 2004 the Basel II Accord was published keeping the V aR based internal model approach unaltered.

(9)

2.1 Value at Risk

Value at Risk (V aR) is defined as the the maximum loss of a financial position given a certain confidence level α, i.e., a quantile. More formally:

P rX ď ´V aRαs “ α,

where X is the gain of the portfolio. Hence, V aRα is the minus α quantile of the distribution

F p¨q of the gain of the portfolio:

V aRα“ ´F´1pαq “ inftx P R : F p´xq ď αu.

In Figure 2.1 V aR is graphically shown. A major disadvantage of V aR is that it is not informa-tive about the risk in the tail, i.e. when we do have an so called ”tail event” we cannot deduce from V aR the magnitude of the loss; all we know is that it is at least as great as V aR itself.

Figure 2.1: Value at Risk and Expected Shortfall

2.2 Coherent risk measures

Due to the introduction of V aR in the regulatory Basel framework a lot of research has been done about testing the quality of V aR forecasts and possible other risk measures. Artzner et al. (1999) introduced the concept of a coherent risk measure, arguing coherence is important because it

(10)

will guarantee realistic properties of the risk measure. They define a coherent risk measure as follows: ”A risk measure satisfying the four axioms of translation invariance, subadditivity, positive homogeneity and monotonicity is called coherent” (see Artzner et al. 1999, pp. 7, Definition 2.4.). The four axioms will be described below.

Translation invariance

For all X and all real numbers a, we have:

ρpX ` aq “ ρpXq ´ a,

where ρ is a risk meausre function, i.e. V aR. The value of a can be conceptualized as adding cash (or a risk-free investment with value a) to the portfolio and thus lowering the risk of the portfolio with a. Translation invariance ensures that, for each X, ρpX ` ρpXqq “ 0, which means that if a sufficient amount of cash is added to the portfolio (equal to the risk X) no extra capital is required (see Remark 1).

Subadditivity

For all X1 and X2,

ρpX1` X2q ď ρpX1q ` ρpX2q.

This properties ensures that diversification is possible: a merger of two portfolios does not increase risk and and as a result, does not increase capital requirements. If this property would be violated an investor willing to take risk X “ X1` X2 can split his financial position X into

two positions X1 and X2 and have the lower capital requirement of ρpX1q ` ρpX2q.

Positive homogeneity

For all λ ě 0 and all X,

ρpλXq “ λρpXq.

When the portfolio is changed by a factor of λ, the associated risk is changed with the same factor λ.

(11)

Monotonicity

for all X and Y with X ď Y , we have:

ρpY q ď ρpXq.

This means that if the payoff of a portfolio Y is at least that of portfolio X, the risk of Y cannot be higher than the risk of X.

A key insight of Artzner et al. is that the widely used V aR is in general not a coherent risk measure (only in special cases) because it lacks the property of subadditivity. This led to the introduction of the Expected Shortfall by Acerbi & Tasche (2001).

2.3 Expected Shortfall

Expected Shortfall (ES) is defined as the expected loss of a financial position, given that the loss X of the financial position is greater than V aR. This means that ES does account for ”tail risk” and is by definition greater than V aR. Formally, ES is defined as follows:

ESα “ ´ErX|X ă ´V aRαs “ ´

ż´V aRα

´8

x f pxq dx.

In Figure 2.1 above ES is graphically shown.

2.4 Elicitability

In 2011, Gneiting (2011) showed that ES is not elicitable as opposed to V aR. A statistic ψpY q of the random variable Y is said to be elicitable if it minimizes the expected value of a scoring function S:

ψ “ arg min

x ErSpx, Y qs.

An example of a scoring function is the squared error, Spx, yq “ px ´ Y q2, which is used for forecasting the mean. We can show that:

ErY s “ arg min

x Erpx ´ Y q 2

s,

by taking the derivative with respect to x of the right hand side and setting it to zero. Gneiting showed that ES lacks a scoring function; it does not exist. In other words, it is not possible to find a scoring function Spx, Y q such that ES is defined as the forecast x given a distribution Y

(12)

that minimises the scoring function. The lack of elicitability of ES led many to conclude that ES can not be backtested.

In a recent article by Acerbi & Szekely (2014), the authors discuss that elicitability is not necessary for backtesting ES and point out that successful approaches are developed without the need of elicitability in the papers of Kerkhof & Melenberg (2004) and Wong (2008). Moreover, they argue that elicitability has in fact nothing to do with backtesting. They conclude that this property is mainly a way to rank the forecasting performance of different models: if we can find a backtest that does not use the property of elicitability there is no reason why we cannot find a backtest that does work.

2.5 The distribution of financial risk

A central consideration in all the articles discussed is the distribution of the data. In case the data are normal distributed, V aR and ES are simply multiples of the standard deviation. Yamai & Yoshiba (2005, pp. 999-1000) state the following: ”When the profit–loss distribution is normal, V aR and ES give essentially the same information. Both V aR and ES are scalar multiples of the standard deviation. Therefore, V aR provides the same information on tail loss as does ES. For example, V aR at the 99% confidence level is 2.33 times the standard deviation, while ES at the same confidence level is 2.67 times the standard deviation”

It is generally known that returns have non-Gaussian distributions. A (possible asymmetric) distribution with heavy tails (excess kurtosis) is more appropriate to describe financial returns. A backtest that is flexible in terms of the underlying distribution is preferred. Although some backtest approaches depend on a Normal distribution, the authors argue that by using the inverse probability transformation, the data can be transformed to Normal distributed data and still have the same interpretation as the original untransformed data. In other words, inaccuracies of the true distribution remain intact when the data is transformed. This argument will be tested in Section 4.

(13)

3 The different backtests

A lot of backtest approaches have been developed in the literature to correctly backtest the forecasted ES. A correct forecast of ES is needed in order to determine the right amount of capital needed as a reserve. In this section we will summarize the different backtest approaches that will be compared in Section 4. If needed, the methods below will be extended to work as an conditional and an unconditional backtest (see Section 4.1 for a more detailed explanation about this subject).

3.1 Berkowitz (2001) - The Log-likelihood method

Berkowitz (2001) proposes a test on a censored variable that is censored at the V aR-level. Berkowitz argues that when the data are transformed to a normal distribution, a likelihood-ratio test on the censored variable can be used to check the tail for normality and thus a correct ES forecast.

3.1.1 Transforming the data

Berkowitz uses the transformation by Rosenblatt (1952) to transform the market realisations xt in the following way:

yt“

żxt

´8

ppuq du “ P pxtq,

where xt is the market realisation at time t with probability density function (PDF) f pxtq

and cumulative function (CDF) F pxtq, pp¨q the ex ante forecasted loss density and P p¨q the ex

ante forecasted distribution. Rosenblatt showed that yt i.i.d.„ U p0, 1q if pp¨q, P p¨q are correctly

specified.

In Proposition 1 (Berkowitz 2001, pp. 467) the author proposes the use of the probability integral transformation method to transform the data to normality:

zt“ Φ´1 ”żxt ´8 ppuq du ı ,

where Φ´1_{p¨q is the inverse of the standard normal CDF and z} t

i.i.d.

„ N p0, 1q. The main ar-gument put forward by the author for transforming to a Normal distribution is that, besides being computationally trivial, it is straightforward to calculate the Gaussian likelihood and construct likelihood-ratio (LR), Lagrange multiplier (LM), or Wald statistics after the data are transformed to a normal distribution.

(14)

Sander Nijland 3 THE DIFFERENT BACKTESTS

He then shows in Proposition 2 (Berkowitz 2001, pp. 467) that the transformed data have, in some respect, the same interpretation as the original untransformed data. More formally:

log « f pxtq ppxtq ff “ log « hpztq φpztq ff ,

where φp¨q is the standard normal PDF and hpztq the density of zt. The above result shows that

inaccuracies in the density forecast pp¨q will be preserved in the transformed data. Berkowitz then proposes a likelihood-ratio (LR) test to test for independence across observations and a mean and variance equal to 0 and 1 respectively.

3.1.2 Constructing the LR test

To test ES, an LR test based on a censored likelihood is proposed. Berkowitz notes that only the shape of the tail needs to be correctly specified; by using a censored likelihood approach any misspecification of the interior of the distribution is ignored. The following censored variable is defined to construct a censored Normal distribution:

z˚_t “ $ ’ & ’ % ´V aRα if ztě ´V aRα, zt if ztă ´V aRα,

where V aRα is the α-percentile of the transformed distribution zt (see Section 2.1). This yields

the following log-likelihood:

Lpµ, σ|z˚q “ ÿ z˚tă´V aRα ´ ´1 2lnp2πσ 2 q ´ 1 2σpz ˚ t ´ µq2 ¯ ` ÿ z˚ t“´V aRα ln ´ 1 ´ Φ ´_{´V aR} α´ µ σ ¯¯ .

An LR test is used on the censored Normal distribution to test for a mean and variance of 0 and 1 respectively. More formally, the following hypotheses are defined:

H0: µ “ 0, σ2“ 1,

Ha: µ ă 0, σ2ą 1.

If either µ ă 0 or σ2 ą 1, than the density places greater probability mass in the tail region than an i.i.d. N p0, 1q does, signaling an incorrect ES.

(15)

3.2 Kerkhof & Melenberg (2004) - The Functional Delta Method

Instead of using the Gaussian probability integral transform method as in Berkowitz (2001), Kerkhof & Melenberg (2004) use a more general probability integral transformation method to obtain an i.i.d. sequence called the Functional Delta Method.

3.2.1 The Functional Delta Method

The Functional Delta Method is based on the expression as (see Van der Vaart 1998, Theorem 20.8): ? N rρpPtq ´ ρpP qs “ ? N 1 N N ÿ t“1 ψtpP q ` opp1q, ErψtpP qs “ 0, Erψt2pP qs ă 8,

where ρp¨q is the risk measurement function, Ptdenotes the empirical distribution of the random

sample xt of market realisations and ψtpP q denotes the influence function of the risk

measure-ment method ρ at observation t. The Functional Delta Method is a way of determining the distribution of a nonparametric estimator, i.e. the sample ES.

The main difference compared to Berkowitz, which uses a Gaussian probability integral trans-form, is that the Functional Delta Method allows a different distribution each period. The null hypothesis H0 : Pt“ F can be tested against numerous alternatives. We can use the following

test statistic (see Kerkhof & Melenberg 2004, pp. 1854):

SN “ ? NpρpPtq ´ ρpP qq? V d ÝÑ H0 N p0, 1q,

with V “ Erψt2pP qs and ρpP q evaluated under the null hypothesis Pt “ F . Because this test

statistic relies on asymptotic theory, the sample size used will have an effect on the performance of the statistic.

3.2.2 The Expected Shortfall case

In case for backtesting ES we have the following definitions for Erψ2ESpP qs and ρpP q (see

Kerkhof & Melenberg 2004, pp. 1856):

V aRpP q “ ´P´1 pαq, ESpP q “ ´1 α « żP´1pαq ´8 xdP pxq ` P´1pαq ˜ α ´ żP´1pαq ´8 dP pxq ¸ff , Erψ2ESpP qs “ 1 αErX 2 |X ď P´1pαqs ´ ESpP q2` 2 ´ 1 ´ 1 α ¯ ESpP qV aRpP q ´ ´ 1 ´ 1 α ¯ V aRpP q2,

(16)

where the term α ´şP_´8´1pαqdP pxq is zero for continuous distributions. By changing P we can accommodate different distributions of the underlying data. In case of a Normal distribution the above expressions will simplify as:

V aRpP q “ ´qα, ESpP q “ φpqαq{α, Erψ_ES2 pP qs “ 1 α ˜ 1 ´ qα φpqαq α ¸ ´ ˜ φpqαq α ¸2 ` 2 ˜ 1 ´ 1 α ¸ φpqαq α qα´ ˜ 1 ´ 1 α ¸ z2_α, where qα is the α-quantile of the distribution P. Backtesting is done, in this case a left sided

test, by the following hypotheses:

H0 : ρpPtq “ ρpF q,

Ha: ρpPtq ě ρpF q,

meaning an underestimation of the ES.

3.3 Wong (2008) - Saddlepoint approximations

Wong points out that because the number of V aR-violations n will be very small in practice the approach of both Kerkhof & Melenberg (2004) and Berkowitz (2001) will perform poorly when carried out on small samples because they depend heavily on asymptotic theory. The proposed saddlepoint technique makes use of a small sample asymptotic distribution that works reasonably well, even for n “ 1.

3.3.1 The Expected Shortfall random variable and its moments

Wong starts by using the simple average of the returns X1, . . . , XN with n ą 0 exceptions

(where Xi ă ´V aRα, so n refers to the number of V aR violations) as an estimator of the true

ES: ESn“ ´ 1 n n ÿ i“1 X_piq, (3.1)

where for i “ 1, . . . , N ´ 1, X_piq is the order statistic such that X_piqď Xpi`1q.

Let X be the portfolio return which has a standard Normal distribution with CDF and PDF denoted by Φ and φ, respectively and q the α-quantile of the standard Normal distribution. Define the random variable Y “ IXă´V aRα ¨ X, where IXă´V aRα is an indicator function that

is 1 if the return X is below V aR and 0 otherwise, with distribution function:

(17)

where y ă q. The random variable Y is defined as the portfolio return conditional on X ă q. Given a sample Y1, . . . , Ynof exceedances above V aR, the sample version of (3.1) can be written

as: ESn“ ´ sY “ ´ 1 n n ÿ i“1 Yi.

The PDF of Y is given as f pyq “ α´1_{φpyq and the CDF by F pyq “ α}´1_{Φpyq, where y P p´8, qq.}

In other words Y is censored normal distributed. The saddlepoint technique approximates the CDF of the mean of a sequence of truncated observations. This is done using the Cumulant Generating Function (CGF) of the truncated random variable, which is defined as Kptq “ ln M ptq, where M ptq is defined as the Moment Generating Function (MGF) of the variable Y . As can be seen in Proposition 1, the MGF of Y is defined as MYptq “ α´1 exppt2{2q ¨ Φpq ´ tq,

which can be used to calculated the different moments of Y and as a corollary the mean and variance of Y .

3.3.2 The Lugannani and Rice formula

Asymptotically ?nσ_Y´1p sY ´ µyq is standard Normal, but this result is hardly valid for small

sample sizes one may encounter in practice. As a remedy the author uses the saddlepoint technique provided by Lugannani & Rice (1980) to accurately calculate the tail probability of the sample mean of an independently and identically distributed (i.i.d.) random sample Y1, . . . , Ynfrom a continuous distribution F with density f . This yields the following probability

(see Wong 2008, pp. 1407, Proposition 2):

P p sY ď ¯yq “ $ ’ & ’ % Φpςq ´ φpςq ¨ ´ 1 η ´ 1 ς ` Opn´3{2q ¯ for ¯x ă q, 1 for ¯x ě q,

where η “ ¯ωanK2_p¯_{ωq and ς “ sgnp¯}_ωqa_2np¯_{ω ¯}_{y ´ Kp¯}_{ωqq, where sgn(s) equals zero when s “ 0,}

or takes the value of 1(-1) when s ą 0ps ă 0q. Lastly, the saddlepoint ¯ω is defined as the solution of K1

p¯ωq “ ¯y.

Given n realized exceptions we can calculate the sample Expected Shortfall ESn “ ´ sY and

test whether the sample ES is too large, i.e.:

H0: ESn“ ES0,

Ha: ESną ES0,

(18)

simply given by the Lugannani and Rice formula as:

p-value “ P p sY ď ¯yq.

3.3.3 Change of CGF when number of violations is random

The method described above is valid in case the number of violations n above the V aR-level is known. When the number of violations is random, n becomes a random variable as well and will follow a Poisson process. To accommodate this new setup, the method described by Wong will be extended by changing the test statistic S “ _n1 řn_i“1Yito S1 “

řn

i“1Yi, were n is assumed

to be independent of Yi. This yields the following CGF:

KS1ptq “ K_PpK_Yptqq,

where KP is the CGF of the Poisson distribution, i.e. KPptq “ πpet´ 1q and π is the expected

number of exceedances T ¨ α. This yields the following expression for the CGF of S1_:

KS1ptq “ K_PpK_Yptqq, “ πpeKYptq ´ 1q, “ πpelnpMYptqq ´ 1q, “ πpMYptq ´ 1q,

where MYptq is the truncated CGF as defined above. The first and second derivatives of this

CGF are easily calculated:

K 1 S1ptq “ πM 1 Yptq, K 2 S1ptq “ πM 2 Yptq.

Because the test statistic has changed, the new expressions for η and ς will become: η “ ¯ωaK2S1p¯ωq and ς “ sgnp¯ωq

a

2p¯ω ¯y ´ KS1p¯ωqq, where the saddlepoint ¯ω is the solution of

K1S1p¯ωq “ S1. We will refer to this approach as the compound version of Wong because it uses

a compound CGF.

3.4 Van Straelen (2014) - Saddlepoint approximations

Whereas the approach by Wong (2008) only works under the assumption of normally distributed returns, Van Straelen (2014) improves on this idea by incorporating non-normal distributions. Under non-normality, the truncated MGF is not in a closed form expression and needs to be

(19)

estimated. Van Straelen uses a saddlepoint technique to approximate the truncated MGF. After the truncated MGF is approximated the approach set out by Wong to approximate the CDF can be used.

3.4.1 The truncated Cumulant Generating Function

Van Straelen starts by defining the MGF of a right-truncated variable Y in the following way:

MYptq “ M0ptq

Ftpqq

F0pqq

, (3.2)

where M0ptq is the MGF of the untruncated random variable and Ftpxq is the exponentially

tilted CDF of X, with density ftpxq “ etxf0pxq{M0ptq. Taking the logarithm of (3.2) gives the

tilted representation of the CGF:

KXptq “ K0ptq ` logrFtpqq{F0pqqs, (3.3)

where K0ptq is the CGF of the untruncated random variable.

3.4.2 Robinson-Hauschildt approximation

Now Van Straelen is concerned with estimating Ftpxq which can be done in two ways: using

again the Lugannani-Rice (LR) approximation or the Robinson-Hauschildt (RH) approximation. The author shows that the RH method is superior compared to the LR method, especially the second order RH (RH2) proves to be the best method. The RH2 is given as:

ˆ F_tRH2pxq “ Hpxq ` ept´ ¯ω0qx`K0p ¯ω0q´K0ptq`1₂uˆ2t " rΦpûtq ´ Hpxqs „ 1 ´λ3uˆ 3 t 6 ` λ4uˆ4t 24 ` λ2₃uˆ6_t 72  ´φpûtq „ λ3 6 pû 2 t ´ 1q ´ λ4 24` û 3 t´ ût ˘ ´λ 2 3 72` û 5 t ´ û3t ` 3ût ˘ * , (3.4)

where Hptq is the Heaviside function, which is 0 for t ă 0, 1 for t ą 0 and 1₂ for t “ 0.

λprq“ λprqp ¯ω0q “ K prq 0 p ¯ω0q rK20p ¯ω0qs r 2 , uˆt“ p ¯ω0´ tq b

K20p ¯ω0q, and ¯ω0 such that K

1

0p ¯ω0q “ x, (3.5)

where ¯ω0 is called the saddlepoint (note the subscript 0, meaning the saddlepoint of the

non-truncated CGF K0). The first and second derivative of ˆFtRH2pxq are given in Appendix A.

3.4.3 Approximating the truncated CGF

To approximate the CGF of the truncated variable Y given in (3.3) we replace the CDF’s in (3.3) with the RH2 approximation given in (3.4), resulting in:

ˆ

(20)

The first and second derivatives of ˆKYptq are given by:

ˆ K 1 Yptq “ K 1 0ptq ` ” ˆ F_tRH2pqq ı´1 _{B ˆ}_FRH2 t pqq Bt , (3.7) and ˆ K 2 Yptq “ K 2 0ptq ` ” ˆ F_tRH2pqq ı´1_B2_FˆRH2 t pqq Bt2 ´ # ” ˆ F_tRH2pqq ı´1 _{B ˆ}_FRH2 t pqq Bt +2 .

3.4.4 Approximating the CDF of the test statistic

Just as in the Wong paper, we are interested in the test statistic S “ 1_nřn

i“1Yi. Again, a

saddlepoint approximation is done by looking for ¯ω such that ˆK1Yp¯ωq “ s. After finding the

saddlepoint ¯ω the LR approximation can be used to approximate the CDF of S using the following formula: P pS ď sq “ $ ’ & ’ % Φpςq ´ φpςq ¨ ´ 1 η ´ 1 ς ` Opn ´3{2 q ¯ for s ă q, 1 for s ě q, where η “ ¯ω b n ˆK”_Xp¯ωq and ς “ sgnp¯ωq b

2np¯ωs ´ ˆKXp¯ωqq, where sgn(s) equals zero when s “ 0

, or takes the value of 1(-1) when s ă 0ps ą 0q.

3.4.5 Change of CGF when number of violations is random

Again, just as in Section 3.3.3, the CGF needs to be changed when the number of violations above the V aR-level is random. This means the test statistic is changed to S1

“řn_i“1Yi. The

CGF of the truncated variable Y and his first and second derivatives remain unchanged, the same holds for the expressions for η and ς. Using (3.2), the first and second derivative of MYptq

are: M 1 Yptq “ 1 F0pqq ” M 1 0ptqFtpqq ` M0ptqF 1 tpqq ı M”Yptq “ 1 F0pqq ” M”0ptqFtpqq ` 2M 1 0ptqF 1 tpqq ` M0ptqFt”pqq ı .

We will refer to this approach as the compound version of Van Straelen because it uses a compound CGF.

(21)

3.5 Acerbi & Szekely (2014) - Non-parametric test

Acerbi & Szekely (2014) adopt a standard hypothesis testing framework for ES analogous to the standard Basel V aR setting. By using a non-parametric approach, the authors are not concerned with specifying a certain underlying distribution.

We assume that for the returns Xt a real (unknowable) distribution Ft exists and is forcasted

by a model predictive distribution Pt. Moreover, Xt is assumed to be independent, but not

identically distributed. We will denote by V aRF

α,t and ESα,tF the value of the risk measures

when X „ F , i.e. the value of ES and V aR under the true distribution F . The following expression for ES is used:

ESα,t “ ´ErXt|Xt` V aRα,tă 0s. (3.8)

The null hypothesis generally assumes that the prediction is correct, while the alternative hy-potheses are chosen to be only in the direction of risk underestimation. Acerbi & Szekely propose three different tests.

3.5.1 Test 1: testing Expected Shortfall after Value at Risk

The first test is inspired by (3.8). Rewriting (3.8) gives the following result:

E ” X_t ESα,t ` 1 ˇ ˇ ˇXt` V aRα,t ă 0 ı “ 0.

If the V aRα,t has been tested already we can separately test the magnitude of the realized

exceptions against the model predictions. Define It“ pXt` V aRα,tă 0q, the indicator function

of an α-exception. The following test statistic is defined:

Z1pXq “ řT t“1 XtIt ESα,t NT ` 1, if NT “ řT

t“1It ą 0. This test is based on the assumption that the predictive distribution Pt

is the true distribution, i.e. H0 : Ptrαs “ F rαs

t , @t, where P rαs

t pxq “ minp1, Ptpxq{αq is the

distribution tail for x ă ´V aRα,t. The alternative hypothesis is based on the fact that Pt is

underestimating the risk, meaning that at least one predicted ES is smaller then the true ES, i.e.:

H1: ESα,tF ě ESα,t, @t and ą for some t,

V aRF_α,t“ V aRα,t, @t.

Under these conditions EH0rZ1|NT ą 0s “ 0 and EH1rZ1|NT ą 0s ă 0 meaning that the realized

(22)

3.5.2 Test 2: testing Expected Shortfall directly

A second test follows from the unconditional expectation.

ESα,t “ ´E « XtIt α ff .

This suggest defining the following test statistic:

Z2pXq “ T ÿ t“1 XtIt T α ESα,t ` 1.

Again, the test is based on the assumption that the predictive distribution Ptis the true

distribu-tion and the alternative hypothesis that Ptis underestimating the risk. Appropriate hypotheses

for backtesting the ES are:

H0: Ptrαs“ F rαs t , @t,

H1: ESα,tF ě ESα,t, @t and ą for some t,

V aRF_α,tě V aRα,t, @t.

In Test 2 the assumption of an already tested V aRα,t is dropped as can be seen by H1.

3.5.3 Test 3: estimating Expected Shortfall from realized ranks

Test 3 is based on the idea of backtesting the tails of a model by checking if the observed ranks Ut“ PtpXtq are i.i.d. U p0, 1q as they should if the model distribution is correct. They propose

the following (see Acerbi & Szekely 2014, pp. 5):

y ESpN q_α pÝÑY q “ ´ 1 rN αs rN αs ÿ i“1 Yi:N,

where rN αs is the integer part of N α and Yi:N denotes the increasing order statistic of Xt. This

ES is based on a vector of N i.i.d. drawsÝÑY “ tYiu. The test statistic is defined as:

Z3pXq “ ´ 1 T T ÿ t“1 y ESpT q_α pP_t´1pÝÑU qq EV ” y ESpT q_α pP_t´1pÝÑV qq ı ` 1,

where ÝÑV are i.i.d. U p0, 1q. The idea is that the entire vector of ranks ÝÑU “ tUtu is reused to

estimate ES in every past day t and the result is then averaged over the entire period. The denominator can be expressed in closed form:

EV ” y ESpT q_α pP_t´1pÝÑV qq ı “ ´ T rT αs ż1 0 I1´ppT ´ rT αs, rT αsqPt´1ppq dp,

(23)

where Ixpa, bq is a regularized incomplete beta function. Because in some cases the quantile

function P´1

t will have to be numerically solved I propose the following change of variable to

obtain an expression that does not depend on the quantile function:

x “ P_t´1ppq, EV ” y ESpT q_α pPt´1p Ý Ñ V qq ı “ ´ T rT αs ż8 ´8 I1´PtpxqpT ´ rT αs, rT αsqx ptpxq dx,

where ptpxq is the predictive density and Ptpxq the predictive distribution.

Again EH0rZ1s “ 0 and EH1rZ1s ă 0, but hypotheses involve the entire distributions this time:

H0 : Pt“ Ft, @t,

H1 : Ptľ Ft, for all t and ą for some t.

3.5.4 Significance of the tests

Finding the p-values of the three test described above is done by bootstrapping. Acerbi & Szekely (2014) proposed the following steps:

1. Simulate independent Xi

t from Pt, @t, @i “ 1, . . . , B

2. Using the simulations from previous step, compute for i P t1, 2, 3u Zi “ ZpÝÑXiq

3. Estimate the p-value as p “řB_i“1pZi ă ZpÝxÝobsÑqq{B, where B is the number of bootstrap

samples and ZpÝÝÑxobsq denotes the observed value on Zi.

For the number of bootstrap samples B we use B “ 399, which is generally used in the literature. Although the number is rather small, the idea is that any sampling variation due to a small B will cancel out across the Monte Carlo simulations.

(24)

3.6 Summary

In Table 3.1 the different backtests discussed in Chapter 3 are summarized according to their fundamental properties, i.e., if they rely on parametric assumptions or not and if simulation is needed to test the null hypothesis.

Parametric assumptions

Yes No

Sim

ulation

Yes Acerbi and Szekely

No

Berkowitz Kerkhof & Melenberg

Wong Van Straelen

(25)

4 Monte Carlo simulation

In this chapter the Monte Carlo setup will be discussed and the results presented. Both the size and power of the different tests are considered as criteria for selecting the most appropriate backtest available. Ideally the tests would be size correct, meaning that the approximations matches the theoretical significance set beforehand. At the same time a high power is preferred, meaning that the probability of a Type-II error (accepting a false null hypothesis) is low.

4.1 Monte Carlo setup

Because not all tests are based on the same distributional assumptions, we start with the assumption set under which all backtests are correct. All the backtests considered are applicable to the Normal distribution. Based on the Normal assumption a Monte Carlo simulation is done and we compare size and power based on 10.000 Monte Carlo replications (Normal case). Next, we change the assumption set to a more realistic set that better describes real market data (heavy tails etc.). We consider the generalized hyperbolic (GHyp) distribution. The GHyp is a very flexible distribution which nests many other distributions, including the Normal (as a limit), standardized Student’s-t, and Hyperbolic distributions. In practice we will use a standardized Generalized Hyperbolic distribution when we consider non-estimation errors (see Section 4.3 below for a detailed explanation why). Again, we run a Monte Carlo experiment with 10.000 replications (GHype case). In Appendix B the main characteristics of the Ghype distribution are summarized: the Probability density function (PDF), Cumulant Generating Function (CGF) and the first four derivatives of the CGF. For the models that are designed to work with the Normal distribution (Berkowitz and Wong) we use the inverse probability transformation to transform the data to normality as proposed by the authors of the backtests in question. For a graphical comparison between the Normal distribution and the GHyp distribution see Figure 4.1.

When backtesting ES, the number of violations above the V aR-level are of interest. Because of this we can consider two different types of Monte Carlo setups, namely an unconditional and a conditional setup:

1. unconditional: the number of violations above the V aR-level is random for every gen-erated set

2. conditional the number of violations above the V aR-level is fixed for every generated set.

(26)

Sander Nijland 4 MONTE CARLO SIMULATION x -6 -4 -2 0 2 4 6 P(X=x) 0 0.1 0.2 0.3 0.4 0.5

0.6 Normal vs Generalized Hyperbolic

Normal Ghype Std GHype

Figure 4.1: N(0,1), GHyp(1, 0.91, 0.0036, 1,0) and standardized GHyp(1, 0.91, 0.0036, 1,0)

The conditional setup is the most natural one to use in the backtesting framework. In prac-tice one uses a sample of N observations which includes n violations above the V aR level. Conditional on the number of violations n the ES is tested, resulting in a conditional test.

In the unconditional case we consider sample sizes of 125, 250 and 500 observations. This corresponds to half a year, a year and two years of data, respectively. In the conditional case the most natural way to generate conditional sets is to only keep the sets where the number of violations matches the expected number of violations given the α-percentile used for the ES:

expected number of violations “ N ¨ α, (4.1)

where N is the sample size. In this way we do not alter the underlying distribution of the data. We consider 1, 3, and 5 violations, meaning (using α “ 0.01) we have close to half a year, a year and two years of observations. In case of the Acerbi tests where the results are bootstrapped we keep the bootstrapping setup identical to the one proposed by Acerbi. This means that for the bootstrap we use an unconditional approach.

4.2 Estimation errors

Next, estimation errors in the parameters are considered for both distributions. For each new sample that is generated from the distribution under the null, parameter estimation is done via Maximum Likelihood. The new parameters are then used to calculate the new V aR and

(27)

ES-level and the different backtests are computed. The goal is to see how the size and power change when estimation errors are taken into account.

4.3 Power

To investigate the power of the different tests we draw samples from a different distribution then the distribution under the null and treat this data at it was generated under the null. We consider a standardized Student’s-t distribution and an stationary GARCH(1,1) process with Gaussian innovations. The degrees of freedom used depends on the true distribution: when the standardized Student’s-t distribution given a certain number of degrees of freedom is to extreme/limited compared to the V aR-level of the true distribution, it is impossible to generate sample sizes as described by (4.1).

The standardization approach for the Student’s-t is used to make sure that the distributions are in some way comparable (having mean and variance of 0 respectively 1); in other words, we try to come up with a fair (false) alternative distribution that is not to far away from the true distribution. This also means that in the GHype case with non-estimation errors we standardize the GHype distribution to make sure it is again a fair comparison. For the estimation case this is not needed because the very flexible GHype distribution is first fit to the standardized Student’s-t data, standardizing it automatically.

4.4 Hypothesis testing

Two types of test will be considered: one-sided (left) and two-sided. The one-sided test will test if the predicted Expected Shortfall ES0 underestimate the risk in terms of realised Expected

Shortfall yES :

H0 : yES “ ES0 and Hα: yES ą ES0.

If the null hypothesis is rejected, the predicted ES0 is said to underestimate the risk. A

one-sided test is more of interest from a regulatory point of view because it tests if the financial institution holds enough capital as a safeguard for financial distress.

A two-sided test will test if the predicted Expected Shortfall ES0 correctly estimates the risk

in terms of realised Expected Shortfall yES. This means that the prediction is not to optimistic (underestimating the risk), nor to conservative (overestimate the risk):

(28)

Sander Nijland 4 MONTE CARLO SIMULATION

H0 : yES “ ES0 and Hα: yES ‰ ES0.

A two-sided test is more of interest for the financial institution, meaning they have the right amount of capital as a reserve and thus the most capital available for their investors (for example to pay out dividends).

4.5 Normal distribution, unconditional

Table 4.1 summarizes the results for the Normal case with a random number of violations and no estimation error. Note that the Van Straelen test is left out of the comparison in case of a Normal distribution; it can be shown that the results of the RH2 approximation converge to the true truncated CGF when the Normal distribution is used, resulting in the Wong results. For completeness the Wong test is shown, which is not designed to work in a unconditional setup.

In Panel A of Table 4.1 empirical test sizes are shown for different sample sizes N in case of a left-sided and two-sided test. In terms of size the Wong compound and the Acerbi tests outperform the Berkowitz and Kerkhof tests for all sample sizes N in case of a left-sided test. Especially the Kerkhof test is significantly size incorrect and needs a bigger sample size to be closer to the theoretical significance level. When N “ 500 the Berkowitz test performs more or less the same as the Wong compound and the Acerbi tests. The Wong test performs to conservative for smaller samples, for N “ 500 the results are close to the compound Wong test. If a two sided test is considered the Acerbi Z3 test has the best empirical size for all samples size. Berkowitz is a good second candidate for sample sizes N “ 125, 250 while the Wong compound test is more favourable when N “ 500. The Wong test shows to be size correct when a two-sided test is used for all samples sizes.

Panel B, C and D show the power of the different tests when the data is drawn from a standard-ized Student’s-t distribution with 3, 5 or 7 degrees of freedom (see Figure D.1 for a graphical comparison). Overall, the Acerbi Z1 and Z3 tests perform best in terms of highest power when a left-sided test is considered. In some cases the Berkowitz test is close to the Acerbi Z1 and Z3 tests. The Wong test has higher power than the Wong compound test. When a two-sided test is considered the Acerbi tests and the Wong test have the highest power.

In Panel E, the power is shown when the data are simulated from a stationary GARCH(1,1) model. The overall power is lower compared to Panel B, C and D. For all left-sided cases the Berkowitz, Acerbi Z2, and Wong compound tests perform best, although the power is still very

(29)

low. When N “ 125, 250 the Wong and Acerbi Z2 model perform equally well. In the two-sided the Acerbi Z3 test and Wong compound test has the highest power.

4.5.1 Estimation error

In Table 4.2 we summarize the results when estimation errors are considered. We can see clearly that the overall test sizes drop compared to Panel A of Table 4.1. The only test that still produces sizes close to the theoretical ones for all sample sizes N is the Acerbi Z1 test. For N “ 125, 250 the Berkowitz test is close to the theoretical significance level, although it gets too conservative when N “ 500. Again the Kerkhof test is too conservative. The main reason why the Wong compound test loses size is due to the fact that because of the estimation procedure, the number of violations becomes more concentrated around the expected number of violations. In other words, the sample becomes more conditional on the number of violations. The Wong compound test is designed to work in an unconditional setting, which is in some way violated by estimating the distribution first. Simulations results shown (see Table 4.3) that if we consider an conditional test and use the Wong compound test, empirical test sizes are close to zero. This supports the reason for loss of size. Something similar happens for the Acerbi Z2 test because we do not estimate within the bootstrap. The Wong test shows higher sizes as was expected due to the semi conditional setup due to estimation. If a two-sided test is considered the Berkowitz test performs close to the theoretical significance level for all sample sizes N , the same holds for the Wong test.

In terms of power when a standardized Student’s-t is used (see Panel B, C and D) we see that the Acerbi Z3 test performs best, closely followed by the Z1 test. When N increases the Berkowitz test performs somewhat in the range of the Acerbi tests. For a two-sided test we see that the Berkowitz, Wong and Azerbi Z3 tests perform best. In Panel E we see rather low powers for the GARCH(1,1) model, left-sided tests all perform somewhat the same and have power close to the theoretical sizes set beforehand. For a two-sided test all tests perform more or less the same.

(30)

Sander Nijland 4 M ONTE CARLO SIMULA TION

Table 4.1: Size and power without estimation error: Normal(0,1)-sample. Unconditional Distribution

Method Left-sided Two sided

Panel A: standard Normal (theoretical significance 5%)

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 125 0,0599* 0,0187* 0,0349* 0,0437 0,0533 0,0515 0,0504 0,0555 0,2952* 0,0554 0,3128* 0,0356* 0,0402* 0,0504 250 0,0682* 0,0312* 0,0427 0,0449 0,0492 0,0484 0,0505 0,0524 0,0968* 0,0589 0,1003* 0,0349* 0,0376* 0,0493 500 0,0539 0,0315* 0,0481 0,0516 0,0505 0,0523 0,0487 0,0653* 0,0247* 0,0565 0,0605* 0,0492 0,0491 0,0460

Panel B: standardized Student’s-t with 3 degree of freedom

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 125 0,3481 0,3548 0,4470 0,2462 0,4524 0,2570 0,4668 0,3186 0,4874 0,5913 0,1857 0,3259 0,1553 0,4220 250 0,5596 0,5756 0,6619 0,3442 0,6301 0,3488 0,6609 0,5202 0,5500 0,6516 0,2648 0,4674 0,2242 0,5867 500 0,8035 0,8133 0,8753 0,4926 0,8519 0,4912 0,8432 0,7758 0,7632 0,8457 0,4107 0,7173 0,3376 0,7835

Panel C: standardized Student’s-t with 5 degree of freedom

Panel D: standardized Student’s-t with 7 degree of freedom

Panel E: GARCH(1,1) with ω “ 0.01, α “ 0.04 and β “ 0.95

* Empirical test size differs significantly from theoretical test size at 1% level

(31)

Nijland 4 M ONTE CARLO SIMULA TION

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 125 0,0510 0,0184* 0,0308* 0,0149* 0,0537 0,0192* 0,0383* 0,0537 0,2632* 0,0541 0,2608* 0,0421 0,0298* 0,0402* 250 0,0585 0,0316* 0,0397* 0,0191* 0,0504 0,0198* 0,0372* 0,0483 0,0778* 0,0545 0,0627* 0,0399* 0,0307* 0,0384* 500 0,0357* 0,0298* 0,0470 0,0208* 0,0496 0,0218* 0,034* 0,0569 0,0198* 0,0526 0,0305* 0,0485 0,038* 0,0413

(32)

4.6 Normal distribution, conditional

In Table 4.3 the results are summarized for the conditional backtests. In Panel A we can see clearly that the Wong test is superior compared to the other test if we consider left-sided test sizes. The Wong test is size correct even for n “ 1 due to the small sample asymptotic distribution used. The other backtests break down when conditional tests are considered. As mentioned before, the Wong compound test does not work in a conditional setup. The same holds for a two-sided test; the Wong test performs best. A second candidate is the Acerbi Z3 test when bigger sample sizes are considered. The Acerbi Z2 test shows sizes of zero percent. The main reason for this is that the observed set is a conditional set but the bootstrap set is a unconditional set. Because the Acerbi Z2 test is concerned with testing the ES directly, the bootstrap ES (based on a unconditional set) is greater than the observed ES under the conditional assumption.

Panel B, C and D show the power when the data are drawn from a standardized Student’s-t distribution. Overall, the Wong test performs best by showing the highest power compared to the other tests. Good second candidates are the Acerbi Z1 and Z3 tests which perform almost the same as the Wong test in terms of power but lack correct test sizes. The Acerbi Z2 test shows very poor performance. If two-sided test are considered we see that the Wong test is most favourable.

In Panel E the power is shown when data are drawn from a stationary GARCH(1,1) model. We see very low power; some results are close to the theoretical size. Again, the Wong test proves to be the most powerful, although the differences are rather small. The Berkowitz test performs best when a two-sided test is used.

The results when estimation errors are considered are shown in Table 4.4. The same conclusions hold as in the case when estimation errors are not considered; the Wong test is superior compared to the other tests. When n “ 3 or greater the Acerbi Z1 is a good second candidate with empirical test sizes close to the theoretical ones, but lacks some power compared to the Wong test. When a two-sided test is considered the Wong test is superior compared to the other tests.

(33)

n Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 1 0,0445 0,0291* 0,0500 0,0000* 0,0993* 0,0000 0,0535 0,1972 0,0166* 0,1050* 0,0001* 0,0728* 0,0711* 0,0595* 3 0,0222* 0,0251* 0,0509 0,0001* 0,0440 0,0000 0,0344* 0,1807 0,0126* 0,0499 0,0001* 0,0321* 0,1109* 0,0452 5 0,0158* 0,0197* 0,0486 0,0002* 0,039* 0,0000 0,0278* 0,1776 0,0097* 0,0518 0,0002* 0,0347* 0,1413 0,0444

n Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 1 0,2353 0,3473 0,3866 0,0326 0,3887 0,3887 0,3887 0,3042 0,3098 0,3652 0,0193 0,3030 0,3030 0,3030 3 0,5152 0,6201 0,6815 0,0544 0,6802 0,6802 0,6802 0,5242 0,5654 0,6280 0,0337 0,5766 0,5766 0,5766 5 0,7031 0,7752 0,8334 0,0768 0,8167 0,0787 0,7940 0,6887 0,7270 0,7921 0,0456 0,6860 0,0400 0,6901

(34)

Table 4.4: Size and power with estimation error: Normal(0,1)-sample. Conditional Distribution

n Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Berkowitz Kerkhof Wong Wong Compound Acerbi Z1 Acerbi Z2 Acerbi Z3 1 0,0266* 0,0280* 0,0460 0,0058* 0,1023* 0,0055* 0,0549 0,1716* 0,1316* 0,1724* 0,0009* 0,0595* 0,0510 0,0483 3 0,0287* 0,0317* 0,0520 0,0035* 0,0523 0,0037* 0,0377* 0,1346* 0,0213* 0,0476 0,0011* 0,0368* 0,0675* 0,0447 5 0,0182* 0,0235* 0,0517 0,0031* 0,0459 0,0026* 0,0315* 0,1145* 0,0116* 0,0471 0,0032* 0,0353* 0,0723* 0,0367*

Panel E: GARCH with ω “ 0.01, α “ 0.04 and β “ 0.95

(35)

4.7 Generalized hyperbolic distribution, unconditional

Table 4.5 summarizes the results for the Generalized Hyperbolic case with random violations and no estimation error. Because we do not consider estimation errors the data are standardized (see section 4.3 for a detailed explanation why). For the models that are designed to work with a Normal distribution (Berkowitz and Wong) the inverse probability transformation is used as suggested by the authors (see Section 3.1.1 for more details on this).

In Panel A empirical test sizes are shown for different sample sizes N . In general the results are the same as in the Normal case when we consider left-sided tests (see Tabel 4.1). Although never shown in the articles, the inverse probability transformation is proven to be very powerful; the results of the Normal distribution are indistinguishable of the GHype distribution results for the Berkowitz, and Wong models. The Van Straelen test shows similar results as the Wong Compound test, although for lower values of N the results are a bit closer to the theoretical test size. The Wong test is only size correct when bigger samples are considered. The Kerkhof test needs a bigger sample size to be closer to the theoretical significance level.

Panel B, C and D show the power of the different tests when the data are drawn from a standardized Student’s-t distribution. This time we consider 3, 5 and 7 degrees of freedom. A graphical comparison of the standardized Student’s-t distributions and the standardized GHyp can be found in Figure E.1. Overall, the Wong compound and Acerbi Z1 and Z3 tests have the highest power.

In Table 4.6 we summarize the results when estimation errors are considered. We can see clearly that the overall test sizes change compared to Panel A of Table 4.5. For N “ 125, 250 the Berkowitz test shows size correct results, but gets to conservative when N “ 500. Kerkhof shows similar results. The sudden drop of size in the compound saddlepoint approximation models is discussed in Section 4.5.1. The Wong test is still size correct for samples sizes n “ 250, 500. Lastly, the Acerbi tests become to optimist when N increases, although for N “ 125, 250 the Z2 test performs close to the theoretical test size.

In terms of power the Wong test and the Acerbi Z1 and Z3 tests performs best by showing the highest power for different degrees of freedom (see Panel B and C). If a two-sided test is considered the Wong test and Acerbi Z3 test are the most powerful. See Figure F.1 for a graphical comparison of the Ghype and standardized Student’s-t distributions.

(36)

Table 4.5: Size and power without estimation error: standardized GHype(1,0.91,0.0036,1,0)-sample. Unconditional Distribution

Panel A: standardized GHyp (theoretical significance 5%)

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound 125 0,0527 0,0192* 0,0321* 0,0423 0,0458 0,0482 0,0467 0,0468 0,3437* 0,3056* 0,3456* 0,0275* 0,0356* 0,0386* 0,0481 0,0243* 250 0,0655* 0,0363* 0,0454 0,0461 0,0509 0,0480 0,0498 0,0472 0,1309* 0,1034* 0,1343* 0,0226* 0,0378* 0,0369* 0,0495 0,0216* 500 0,0578 0,0350* 0,0492 0,0499 0,0509 0,0515 0,0503 0,0492 0,0689* 0,0301* 0,0607* 0,0628* 0,0496 0,0538 0,0534 0,0597*

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound

125 0,1384 0,1573 0,2067 0,0796 0,2274 0,1133 0,2248 0,0719 0,4032 0,4163 0,4712 0,0453 0,1378 0,0706 0,2018 0,0429

250 0,2236 0,2826 0,3354 0,0945 0,3237 0,1419 0,3244 0,0854 0,2682 0,3207 0,3705 0,0565 0,1972 0,0958 0,2675 0,0536

500 0,3525 0,4427 0,4944 0,1004 0,4842 0,1628 0,4263 0,0888 0,3166 0,3867 0,4380 0,0862 0,3307 0,1078 0,3467 0,0840

125 0,0746 0,0633 0,0949 0,0504 0,1130 0,0664 0,1177 0,0354 0,3634 0,3491 0,3959 0,0319 0,0659 0,0416 0,1011 0,0159

250 0,1019 0,1100 0,1354 0,0519 0,1372 0,0669 0,1442 0,0306 0,1659 0,1705 0,2099 0,0289 0,0740 0,0505 0,1113 0,0143

500 0,1138 0,1487 0,1857 0,0559 0,1847 0,0701 0,1670 0,0268 0,1097 0,1198 0,1568 0,0680 0,1119 0,0560 0,1242 0,0513

125 0,0563 0,0289 0,0424 0,0276 0,0584 0,0355 0,0562 0,0146 0,3932 0,3629 0,4097 0,0162 0,0399 0,0304 0,0642 0,0053

250 0,0787 0,0567 0,0606 0,0275 0,0734 0,0313 0,0587 0,0111 0,1840 0,1610 0,1942 0,0126 0,0471 0,0332 0,0689 0,0037

500 0,0640 0,0712 0,0782 0,0216 0,0890 0,0235 0,0575 0,0077 0,0766 0,0608 0,0882 0,0769 0,0750 0,0538 0,0702 0,0690

(37)

Panel A: GHyp (theoretical significance 5%)

N Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound Berkowitz Kerkhof Wong Wong compound Acerbi Z1 Acerbi Z2 Acerbi Z3 Straelen compound 125 0,0450 0,0212* 0,0291* 0,0108* 0,0754* 0,0433 0,0583 0,0122* 0,0488 0,2195* 0,0534 0,2098* 0,0562 0,0485 0,0497 0,0029* 250 0,0533 0,0407* 0,0428 0,0128* 0,0767* 0,0553 0,0693* 0,0129* 0,0439 0,074* 0,0495 0,0493 0,0546 0,0634* 0,0622* 0,0025* 500 0,0318* 0,0336* 0,0444 0,014* 0,0759* 0,0862* 0,0891* 0,0133* 0,0577 0,0218* 0,0485 0,0184* 0,0704* 0,0941* 0,0853* 0,0172*

125 0,1308 0,1669 0,2257 0,0760 0,2650 0,1514 0,2794 0,1179 0,1107 0,2646 0,3266 0,0368 0,1631 0,0773 0,1847 0,0657

250 0,2282 0,2804 0,3590 0,1064 0,3333 0,1905 0,3913 0,1691 0,1818 0,2523 0,3192 0,0515 0,1934 0,1008 0,2716 0,0973

500 0,3746 0,4243 0,5236 0,1680 0,4725 0,2796 0,5508 0,2659 0,3197 0,3533 0,4510 0,0914 0,2852 0,1310 0,4086 0,1708

125 0,0542 0,0666 0,0874 0,0172 0,1454 0,0636 0,1308 0,0241 0,0533 0,2235 0,2599 0,0063 0,0921 0,0560 0,0877 0,0074

250 0,0714 0,1154 0,1433 0,0233 0,1661 0,0864 0,1759 0,0302 0,0533 0,1248 0,1495 0,0068 0,1034 0,0808 0,1170 0,0088

500 0,0938 0,1610 0,2029 0,0316 0,2135 0,1267 0,2320 0,0385 0,0885 0,1194 0,1526 0,0183 0,1207 0,1011 0,1523 0,0209

(38)

4.8 Generalized hyperbolic distribution, conditional

Lastly we consider the conditional approach. In Panel A of Table 4.7, results are shown for the empirical test sizes. Again, we use a standardized GHype because we do not consider estimation errors. Clearly the Wong and Van Straelen test perform best when left-sided tests are considered; for all values of n the sizes are close to the theoretical sizes set beforehand. Even in the two-sided case when n “ 3, 5 the sizes are close to the theoretical one. The Wong compound test, which is designed to work in an unconditional setup, is clearly incorrect in terms of test sizes. We can conclude that the Wong and Van Straelen test are the best of both worlds: the estimated ES is not to optimistic nor too conservative.

In Panel B, C and D the power is shown for different degrees of freedom when a standardized Student’s-t distribution is used. The Van Straelen test shows the highest power. A second candidate in terms of power are the Acerbi Z1 and Z3 tests, although they are too conservative in terms of size.

In Table 4.8 the results are shown when we include estimation errors. In general the empirical test sizes increases compared to Panel A of Table 4.7. Again, the Wong and Van Straelen test perform best in terms of left sizes test sizes. If two-sided test are considered Wong and Van Straelen perform again size correct for n “ 3, 5.

In Panel B and C we see that all tests show high power when the incorrect standardized Student’s-t distributions needs to be rejected. Again, the Van Straelen test shows the highest power; both left and two-sided.

15 years of backtesting expected shortfall : what do we know?

what do we know?

Sander Nijland

Contents

1

Introduction

2

History of risk measures and their desired properties

3

The different backtests

4

Monte Carlo simulation