• No results found

Assessment of Density Models 38

i.e. a function that assigns a numerical value S(P, y) to a combination of a distribution P ∈ P and observed outcome y ∈ Ω.

Intuitively, an appropriate scoring rule should be expected to give the highest score to the true underlying distribution of the observed outcomes. The latter can be formalized by the notion of a proper scoring rule S, which must satisfy

S(P, P) := EY ∼PS(P, Y ) ≥ EY ∼PS(Q, Y ) =: S(Q, P), for all P, Q ∈ P. (5.2) That is, a proper scoring rule attains maximal expected value in the true distribution.

The divergence function DS associated with a scoring rule S is given by

DS(P, Q) := S(P, P) − S(Q, P), for P, Q ∈ P. (5.3) Note that a scoring rule S is proper if and only if the divergence function DS is non-negative, since DS(P, P) = 0 for all P ∈ P.

Logarithmic Scoring Rule

It is assumed from now on that y ∈ Ω = R and that distributions P, Q ∈ P can be represented by density functions p, q with support over R.

The logarithmic scoring rule Sl is given by

Sl P, y = Sl p, y := log p(y). (5.4) Diks et al. (2011) show that the logarithmic scoring rule is proper, by noting that the divergence function DSl equals the non-negative Kullback-Leibler divergence DKL:

0 ≤ DKL(P, Q) := EP

"

log dP dQ

#

= Z

−∞

log p(y) q(y)

 p(y)dy

= Z

−∞

log p(y)p(y)dy − Z

−∞

log q(y)p(y)dy = DSl(P, Q).

(5.5)

Moreover, they argue that log-likelihood based scoring rules such as the logarithmic scoring rule have desirable properties: the relative scores are invariant to smooth trans-formations of the outcome space Ω and tests based on the scoring rules lead to Likelihood Ratio test statistics, which are known for their optimal power.

The property of invariance under smooth transformations is especially useful when modelling IS. For example, changing IS from percentages to basis points or monetary units does not affect relative logarithmic scores. The invariance property is illustrated in Appendix Section (A.2).

Weighted Logarithmic Scoring Rules

Depending on the practical application, one might want to consider scoring rules that give more weight certain regions of the density according to some weight function w(y).

A naive approach to this would be to just scale the logarithmic scoring rule Slwith some weight function w to get a weighted scoring rule Swl

Swl p, y := w(y) log p(y). (5.6)

This scoring rule is generally not proper however. Diks et al.(2011) therefore consider two alternatives: the conditional likelihood scoring rule Scl and the censored likelihood scoring rule Scsl

Scl p, y := w(y) log p(y) R

−∞w(s)p(s)ds

!

, (5.7)

Scsl p, y := w(y) log p(y) + 1 − w(y) log 1 − Z

−∞

w(s)p(s)ds

!

. (5.8) Both Scl and Scsl are proper under mild conditions4 on the weight function w. In particular, the conditions are satisfied for the weight functions wmiddle and wtails given by

(wmiddle(y) :=1{yl ≤ y ≤ yu}

wtails(y) := 1 − wmiddle(y) , where

(yl, yu ∈ R

yl< yu, (5.9) Weight function wtailsrepresents an indicator for the tails of the distribution, which might be of interest in risk management purposes. Similarly, wtails represents an indicator for the middle part of the distribution, which might be of interest for producing prediction intervals. The latter weight functions are used for the model assessment in Chapter6.

5.1.2. Comparative Assessment with Scoring Rules

Suppose one has a set of observed outcomes {yi}Ni=1 and associated density forecasts generated by two models, i.e. {ˆpki}Ni=1for k = 1, 2. Also, suppose that S is a likelihood-based scoring rule, for example Scl or Scsl.

Intuitively then, model k is assigned a high score if the observed yi lies in a region of high predictive density ˆpki and a low score otherwise. Different modelled densities can therefore be ranked according to their average scores N1 PN

i=1S ˆpi, yi.

Similar to with point forecast assessment measures, one should consider whether an observed difference in the scores of two models is statistically significant. For this, we introduce the ith score differential di between two considered models, given by

di := S ˆp1i, yi − S ˆp2i, yi. (5.10) Differences in the average scores can be tested for statistical significance by testing the hypothesis H0: E[di] = 0 with a t-test. In the presence of autocorrelation in the density forecasts, e.g. in the case of dynamic forecasts in a time-series setting, the test statistic should be computed using HAC standard errors. However, given the cross-sectional nature of the forecasts for IS, this is considered to not be an issue.

4SeeDiks et al.(2011) for these conditions.

5.2. Absolute Assessment

Comparative assessment is useful for comparing and ranking models, but gives no indi-cation of how close the best ranking model is to the true distribution. Ideally, one would like to compare a modelled distribution with the true distribution, but the latter is not observed. Still, a method for absolute assessment is discussed in this section.

5.2.1. Probability Integral Transformed (PIT) Observations

Bao et al.(2007) state that it has become common practice to evaluate a probabilistic forecast model using the probability integral transform (PIT). The latter refers to the result that for a continuous random variable Y with distribution function F , it holds that F (Y ) has a uniform distribution over [0, 1]5.

The result can be used to test the null hypothesis that each modelled distribution ˆFi

equals the true distribution Fi of Yi. Under that hypothesis it holds that Fˆi(yi) N i=1

is a sample from a uniform distribution over [0, 1]. The transformed observations are referred to as PIT observations.

Statistical Test

Applying the inverse of the CDF Φ of a standard Gaussian distribution to the PIT observations, gives a sampleΦ−1i(yi) N

i=1that has a standard Gaussian distribution under the null hypothesis. The null hypothesis can thus be tested, by testing whether

−1i(yi) N

i=1 come from a standard Gaussian distribution. Popular tests for a Gaussian distribution include the Jarque-Bera test and the Kolmogorov-Smirnov test.

Although the step of applying Φ−1 is not strictly necessary, Mitchell and Hall (2005) argue that testing normality is convenient as normality tests are widely seen to be more powerful than uniformity tests.

Visual Assessment

It is also possible to visually assess how close the distribution of the PIT observations is to a uniform distribution. More specifically, the normalized histogram of PIT observations can be compared to the density of a uniform distribution over [0,1].

The latter may help identify what parts of the modelled distribution do not fit the observations. The visual assessment can also be used for comparative assessment of the models, where a model is considered better if the histogram is visually closer to the uniform distribution.

5This result dates back to at leastRosenblatt(1952)

5.3. Conclusion

This chapter addresses sub-question (Q.3); which asks how probabilistic models of IS can be appropriately assessed and compared. The considered methodology is mainly based on literature on the assessment of density forecasts of IS.

Numerical scoring rules are considered for comparative assessment of probabilistic models. They provide a systematic and quantitative method of assessing and statistically testing performance differences of probabilistic models. More specifically, log-likelihood based scoring rules are considered, as they have some attractive properties. For exam-ple, they are invariant to smooth transformations of the outcome variable and lead to powerful Likelihood Ratio tests.

Weighted scoring rules are also considered, in case an application needs a probabilistic model of IS that fits especially well in certain parts of the distribution. More specifically, the conditional likelihood and censored likelihood scoring rules proposed byDiks et al.

(2011) are considered, as they are also based on the log-likelihood.

Although scoring rules are useful for comparative assessment of probabilistic models, they are not useful for assessing whether the modelled distributions are close to the true distribution, i.e. absolute assessment. A methodology based on probability integral transformed (PIT) observations is considered for the latter. The PIT observations can be used to visually assess and statistically test probabilistic model specifications.

The methodology described in this chapter is used for comparing and assessing prob-abilistic models of IS in Chapter 6.