Solvency capital requirement and data scarcity

(1)

and

Data Scarcity

G.L. Wiersma

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8

SCR

loss p(loss) loss99.5 loss SCR

Master’s Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Dr. G.L. Wiersma

Student nr: 10327495

Email: gwiersma@portage.nl

Date: July 28, 2016

Supervisor: Dr. S.U. Can Second reader: Dr. T.J. Boonen

(2)

This document is written by Gerlof Wiersma who declares to take full responsibility for the contents of this document.

“I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.”

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

In this thesis we consider potential errors and uncertainties involved with statistical modelling on the basis of small numbers of observa-tions.

The Solvency II Capital Requirement (SCR) for insurance risks is de-fined in terms of extreme (99.5%) quantiles of the associated loss dis-tributions.

If one only has a limited number of sample points for the calibration of the loss distribution, the reliability of the SCR estimates obtained from these data will be limited too.

We perform numerical experiments to assess the small sample conse-quences for the bias and the uncertainty of the SCR estimates. In the simulations random samples are drawn from a lognormal distri-bution and from these sample the SCR and other population properties are estimated with three different estimators. The experiments are re-peated for a number of lognormal distributions with different skewness values and with samples sizes in range from 5 to 100.

From the simulations we conclude, that if the SCR has to be estimated from a small number of observations, the estimator should be selected with great care. In fact, if one wants to obtain a prudent value for the SCR, one should not use the estimator with the best overall small sample performance, but rather use a less well performing estimator for which the SCR estimation bias is positive and thus on average overestimates rather than underestimates the real SCR value.

Keywords Small sample modelling, parameter uncertainty, estimator, maximum likelihood method, method of moments, uniform minimum variance unbiased estimator, Solvency II, SCR, lognormal distribution, estimator bias, estimator variance.

(4)

Preface vi

1 Introduction 1

2 Solvency II 5

2.1 Risk or uncertainty . . . 6

3 Estimation and estimators 9 3.1 Sampling and estimation . . . 9

3.2 Bias and variance . . . 10

3.2.1 Asymptotic properties . . . 11

3.3 Normal distribution estimators . . . 12

3.4 UMVU Estimators . . . 14

4 Lognormal estimators 18 4.1 The lognormal distribution . . . 18

4.2 Estimating lognormal parameters . . . 20

4.2.1 Method of moments (MOM) estimators . . . 21

4.2.2 Maximum likelihood estimator . . . 22

4.2.3 Uniform mininum variance unbiased estimator . . . 24

4.3 SCR estimators . . . 25

5 Simulations 29 5.1 Theory versus simulation . . . 31

5.1.1 Bias of M -estimators. . . 31

5.1.2 MSE of M -estimators . . . 32

5.1.3 Bias and MSE of V -estimators . . . 33

5.1.4 MOM estimator for bV . . . 34

5.2 Estimators of CV: dCV . . . 36

5.2.1 Bias of dCV . . . 36

5.2.2 MSE of dCV . . . 37

5.3 Estimators of the Mean: bM . . . 38

5.3.1 Bias of bM . . . 38

5.3.2 MSE of bM . . . 39

5.4 Estimators of the variance: bV . . . 40

5.4.1 Bias of bV . . . 40 5.4.2 MSE of bV . . . 41 5.5 Estimators of the SCR: dSCR . . . 42 5.5.1 Bias of dSCR . . . 42 5.5.2 MSE of dSCR . . . 43 5.5.3 Efficiency of dSCR . . . 44 5.6 Simulation conclusions . . . 46 iv

(5)

A Lognormal UMVU Estimator 47

A.1 Bias of MLE estimators . . . 47

A.1.1 The expectation E[eˆµ] . . . 49

A.2 Constructing the UMVUE for M . . . 49

A.2.1 Moments of the χ2_{-distribution} _{. . . .} ₅₁

B Cram´er-Rao variance bounds 52 C SCR estimator asymptotics 54 D Sample variance: expectation and variance 57 D.1 Expectation of the sample variance . . . 57

D.2 Variance of the sample variance . . . 58

E R-code 60 E.1 Simulation Main . . . 60

E.2 Auxiliary functions . . . 66

(6)

In my regulatory activities at De Nederlandsche Bank (DNB) I came across several topics that would have made suitable subjects for a the-sis project. General indecisiveness, the fact that new and more exciting topics kept popping up, and being occupied with family and work, de-layed the finishing of my thesis considerably. But in the end, selecting an interesting topic and pursuing the investigations for a sufficient length of time, finally led to this product. Writing this thesis would not have been possible without the stimulating support of Anneke, Joke and Johan.

(7)

Introduction

This thesis considers some of the issues one has to deal with if one seeks to determine statistical properties from a mere handful of observations. One, of course, always should be cautious about statistical modelling based on small samples, but if the modelling is conducted to determine values for extreme quantiles of skewed distributions, one should be even more careful about errors that might be introduced by sample size limitations. Our interest in these modelling problems has to do with the Solvency II requirement that insurers have to determine the 99.5 percentile values for the probability distribu-tions for their expected losses. The Solvency II directive, that came into effect on January 1st 2016, prescribes insurers to reserve capital buffers, called the Solvency Capital Re-quirement (SCR), that will be sufficient for the undertaking to survive catastrophic loss events that on average occur only once in a period of 200 years. This 1-in-200 year limit for the undertaking’s ruin rate translates into an amount of money, the SCR, that is equivalent to the 99.5% quantile of the probability distribution for the losses the insurer will be exposed to in the coming year.

If one only has a small number of data points for the calibration and estimation, the associated SCR will be highly uncertain and might even be plagued by data limitations induced systematic errors. The estimation problems associated with small numbers of observations might actually lead to incorrect capital buffers that might not suffice for the undertaking to survive the occasional 1-in-200 loss event.

To assess the errors and uncertainties associated with the SCR estimation for small data samples, we performed a simulation study in which we estimated the SCRs for random samples with varying sizes. The smallest samples that we used contained 5 points and the largest sample in the simulations consisted of 100 points. To assess the impact of the skewness of the distribution, we repeated the sampling experiments for lognormal distributions with 4 different skewness values. For each data sample the parameters for the underlying distribution were estimated and using these parameters we then determined the sample SCR.

In the simulation experiments we compared the performance of a number of estima-tion techniques for the lognormal distribuestima-tion:

• Method Of Moments (MOM)

straightforward approach in which the distribution parameters are found from the moments of the calibration data.

• Maximum Likelihood Estimation (MLE)

well-known and frequently used technique in which the parameter estimates are defined by the values for which the likelihood function attains a maximum. • Uniformly Minimum Variance Unbiased Estimator (UMVUE)

extension to the MLE estimator. The MLE expressions for population properties are multiplied with correction terms that remove the estimator bias.

(8)

The first two parameter estimation methods, MOM and MLE, are widely used and need no further introduction, but the UMVUE for the lognormal distribution is less familiar, and we hence pay some extra attention to this estimator. In section 3 we consider the theoretical background of the UMVUE and discuss some of the properties of this special type of estimator. For now it suffices to say that the UMVUE is an unbiased estimator in the sense that the expectation of the estimator is equal to the true value of the population property that is being estimated. The uniform minimum variance in the name means that the UMVUE is the unbiased estimator with smallest noise contribution for all possible values of the parameter. If the sample size becomes very large, the biases of MOM and MLE diminish also and the same holds true for the MOM and MLE noise levels in the limit of large samples. The UMVUE, however, is unbiased for all sample sizes and UMVUE results are also plagued less by noise than the results obtained with other estimators as MOM or MLE.

If statistical modelling has to be based on small samples, one thus ideally would like to use the UMVUE for the estimation of specific population properties. The fact that theory promises the existence of a unique UMVUE for each population property that one might want to estimate, does, however, not imply one always knows an explicit expression for the UMVUE. In fact, for the lognormal distribution the literature only contains UMVUEs for the population average and the population variance. For derived properties such as the SCR, we do not have closed-form expressions for the UMVUE. But by constructing an SCR estimator from quantities such as the population average and variance for which we knew the UMVUE, we obtained an SCR estimator that performed remarkably well compared to the SCR estimators based on the MLE and MOM approaches. The UMVUE-based estimator for the SCR is not unbiased but the bias is relatively small and the noise amplitude is also smaller than for the MLE and MOM estimators.

One would thus expect that the UMVUE-based SCR estimator, being the most precise and the most accurate of our candidates, would also be the best choice for the estimation of the SCR for lognormal distributions. From our simulations we conclude, however, that if one takes into account the bias of the estimators, one actually should use the MLE method. This estimator has the worst accuracy and precision figures for small sample sizes, but our reasons for favouring the MLE have to do with the sign of the esti-mation bias. The simulations show that MLE-estimated SCR has a positive bias whilst the biases for the MOM and UMVUE methods are negative. The differences between the estimators are significant for small sample sizes and become even more noticeable for lognormal distributions with large skewness values. If the skewness increases the small sample biases for the estimators also increases and so does the threshold sample size above which the different estimators are again comparable.

The SCR is a statistical risk measure, which means that the estimated SCR in general will not be equivalent to the real SCR value. But one still would like to expect estimated and true SCR values to agree on average. If one, however, does not have an unbiased estimator for the SCR, the prudent choice would be to use an estimator that on average does not underestimate the real SCR value.

So, if one wants to determine the SCR for a skewed lognormal distribution from a small number of observations, one should not use the UMVUE, the estimator with the best overall performance, but rather use the MLE, the estimator that yields more prudent estimates for the Solvency II SCR.

The data in table 1.1 summarizes the results of the sampling experiments to deter-mine the SCR biases for two lognormal distributions with different skewness values. The samples were obtained from distributions with the same average (M = 10) and with variances given by Var = 100 (table 1.2) and Var = 1600 (table1.3). The coefficient of variation: CV =

√ Var

M , can be considered as a measure of the skewness of the lognormal

(9)

variation CV = 1 and CV = 4, respectively. The numerical SCR values in the tables are expressed as a percentage of the associated population value for the SCR.

The tables show that the SCR bias is significant for small sample sizes and that the bias decreases if the sample size goes up. From the tables it is also clear that the performances of MLE and UMVUE estimators are more or less the same for CV = 1, but that the estimator biases differ significantly for the CV = 4 distribution. The tables suggest that, in order to obtain prudent SCR estimates for skewed lognormal distributions, one should use the MLE approach if the calibration data contains fewer than 100 observations.

Table 1.1: Bias of SCR estimates as a function of the sample size for two lognormal populations with different skewness values.

n MOM MLE UMVUE

5 73.89 129.54 77.15 6 76.30 123.10 79.94 7 78.05 118.79 81.99 8 79.65 116.13 83.80 9 81.04 114.06 85.23 10 82.04 112.29 86.30 11 82.96 110.97 87.28 12 83.82 110.00 88.18 13 84.54 109.20 88.96 14 85.26 108.44 89.59 15 85.70 107.77 90.13 20 87.96 105.63 92.23 30 90.78 103.68 94.60 40 92.37 102.77 95.88 50 93.48 102.19 96.64 100 96.02 101.07 98.25

Table 1.2: SCR estimates for CV=1, ex-pressed as percentage of the true SCR

n MOM MLE UMVUE

5 45.32 382.67 52.85 6 47.83 291.18 56.84 7 49.75 241.86 59.92 8 51.39 211.12 62.39 9 52.95 192.10 64.81 10 54.58 179.12 67.02 11 55.44 168.28 68.63 12 56.60 160.68 70.27 13 57.49 154.59 71.76 14 58.31 149.23 72.96 15 59.02 144.85 74.04 20 62.34 131.36 78.54 30 66.26 119.49 83.88 40 68.94 114.13 87.05 50 70.92 111.10 89.16 100 76.20 105.35 94.00

Table 1.3: SCR estimates for CV=4, ex-pressed as percentage of the true SCR

The remainder of the thesis describes the way we came to the conclusion that the MLE method is to be preferred if one is after a prudent SCR value for a skewed lognormal population and one only has a small set of observations available for the statistical modelling.

This thesis continues in chapter 2 with a discussion of the European Solvency II regulations that, among other things, requires insurers to determine the 99.5 percentiles of the loss distributions for their risks. The chapter on Solvency II concludes with a brief discussion on the distinction between risk and uncertainty. Solvency II seems to focus on risk quantification and seems to be less concerned about the potential impact of the uncertainties surrounding the obtained SCR results. We argue that risk quantifying and the associated uncertainties should be dealt with in an integral manner and that the modelling output, i.e. the SCR, should explicitly account for the fact that risk and uncertainty, in absence of any a priori information about the underlying population, cannot be separated.

We then proceed in chapter 3 with a discussion on general data sampling aspects. In this we consider the concept of the UMVUE in slightly more detail.

The next part, chapter 4, deals with the lognormal distribution. We briefly review the basic properties of the lognormal model and discuss in slightly more detail the different techniques one can use to estimate the values for specific population properties from a limited number of observations.

(10)

In this we pay some special attention to the estimators of the SCR for lognormal populations. We discuss the SCR estimators for our three estimation methods and con-sider the bias and other performance figures for the estimators. This section concludes with the derivation of the expression for the efficiency of the SCR estimator for the lognormal distribution.

The last section, chapter 5, describes the numerical experiments. We begin with the set-up of the simulation study. Next, we compare the results of numerical simulations with theoretical predictions for the estimators for the population mean and variance. The remaining sections of chapter5contain the graphs with the results of the numerical simulations for the SCR and for some other population quantities. The chapter concludes with a summary of our findings and the conclusions that we draw from our results.

The thesis appendices contain some tedious calculations that are ancillary to the main line of reasoning. We included these technical sections to make the thesis more self-contained, but also because we were not able to retrieve some of the proofs and derivations in the general literature. In appendix A we consider the way the UMVUE for the the population mean is obtained and in appendix B we derive a generalized form of the Cram´er-Rao lower bound that we use for the numerical comparison of the efficiencies of the SCR estimators. In appendixCwe discuss the asymptotic properties of the SCR estimator, and in appendixDwe deal with the expressions for the expectation and variance of the sample variance. The last section, appendixE, contains the R-code that we used in the simulation study.

(11)

Solvency II

The financial crisis of 2008 initiated a paradigm shift in the assessment and the man-agement of financial risks of banks and other financial institutions. For the regulation of the insurance industry in Europe this implied an additional incentive for the transi-tion from the Solvency I regime with simple balance sheet-based methods to determine solvency capital buffers to the Solvency II world in which insurers have to hold specific capital buffers that depend on risk characteristics and portfolio properties. The report Lessons learned from the crisis (Solvency II and beyond), see CEIOPS (2009), describes the concerns about potential insurance industry instabilities that the new Solvency II regimes intends to address.

The capital buffer or Solvency Capital Requirement (SCR) is the amount of money the insurer should set aside to protect the insurance undertaking against any adverse financial developments in the coming year. In fact, Solvency II regulations stipulate that the SCR buffers should be sufficient to survive catastrophic events occurring only once in a period of 200 years. In order to come up with a figure for the SCR, one first somehow has to devise a probability distribution for the financial losses one will be exposed to during the next calendar year. After selecting an appropriate statistical model, one then has to use the available observational data to parametrize the selected loss distribution. And then in the last step one uses the fully defined probability distribution to calculate the financial loss associated with the 99.5 percentile.

In Solvency II the SCR is defined as the excess of the 99.5 percentile over the best estimate (BE) of the loss distribution for the losses the insurer will be exposed in the coming year. If the statistical properties of the insurance losses are described by the probability density function (PDF) floss or the associated cumulative distribution

function (CDF) Floss, the SCR can be expressed as

SCR = Floss−1(0.995) − E[Xloss], (2.1)

in which E[Xloss] is the average loss sum for the coming year and F −1

loss(0.995) is the loss

amount associated with the 99.5% quantile of the cumulative loss distribution.

Solvency II comes with a so-called standard model or standard formula (SF) intended to simplify matters for the run of the mill insurer with standard risks and standard port-folios. Using the SF insurers readily calculate the SCR for market and underwriting risks by applying a simple SF shocks to risk-specific volume measures such as the portfolio exposure or the amounts invested in the various asset classes. The SF calculations are based on EIOPA1monitored calibrations of the European insurance industry and should hence suffice for the average insurer.

Solvency II allows insurers with niche types of risk or with non-standard portfolios to use their own risk models for the calculation of the SCR. In this thesis we will not

1

The European Insurance and Occupational Pensions Authority (EIOPA) is the European Supervi-sory Authority for the insurance and pension industry.

(12)

consider the SF, but rather restrict ourselves to insurer-specific calculations of the SCR on the basis of in-house developed and validated Internal Models (IMs).

Solvency II is principle-based rather than rule-based. This means that Solvency II regulations do not stipulate which model or method insurers have to use for the modelling of theirs risks and for the SCR calculations. Solvency II restricts itself to specifying general statistical quality standards and principles for the modelling. When it comes to selecting a probability density function (PDF) for the modelling of a particular risk, Solvency II does not specify the PDF one should use, but rather describes some general principles for the assessment of the model appropriateness.

Article 121 Statistical quality standards of the Solvency II Directive, see Solvency II (2014), says the following about modelling principles:

The methods used to calculate the probability distribution forecast shall be based on adequate, applicable and relevant actuarial and statistical tech-niques and shall be consistent with the methods used to calculate technical provisions.

The methods used to calculate the probability distribution forecast shall be based upon current and credible information and realistic assumptions. Insurance and reinsurance undertakings shall be able to justify the assump-tions underlying their internal model to the supervisory authorities.

This means insurers are free to use their favourite model and to select statistical methods they prefer. Solvency II does, however, require insurers to be able to explain and substantiate their modelling choices.

Solvency II is also not very specific about the sample data that is used in the modelling of a particular risk. The Solvecny 2 directive restricts itself to the following general data requirements, see article 121 in Solvency II (2014):

Data used for the internal model shall be accurate, complete and appropri-ate.

The Solvency II regulations thus do not set clear standards for the data or the model that are used in the SCR calculations. The interpretation of the general principles that Solvency II defines at first sight seem to be obvious and simple, but the real world implementation of these principles might well meet with quite some discussions and also with a fair amount of confusion.

One of the issues Solvency II does not address explicitly, has to do with the dis-tinction between risk and uncertainty. Risk usually is associated with the output of a risk model, whilst uncertainty fills our remaining ignorance that is not captured by our model. As such one might argue that there is no a priori distinction between risk and uncertainty. If one has specific pre-knowledge about the model that should be used to describe a particular type of risk, one might be able to distinguish risk and uncertainty in a clear and unequivocal manner. The next section considers the problems involved with the distinction between risk and uncertainty in the Solvency II SCR determination.

2.1 Risk or uncertainty

The Solvency II SCR is intended as a risk measure. The risk-specific SCR is calculated from a specific model that is assumed to capture the essential characteristics of the risk at hand. And although model selection and parametrization can be shrouded in uncertainty, the uncertainty does not show up explicitly in the final SCR.

This seems rather counter-intuitive. The intention of the whole SCR determination is to ensure, with the rather high probability of 99.5%, that the capital buffers will be

(13)

sufficient for the insurer, come what may, to meet his financial obligations in the coming year. The final SCR, however, does not take into account that the methods and numbers used in the SCR calculation can be highly uncertain in themselves. To the uninitiated it might seem that the potentially huge uncertainties involved with the modelling might well jeopardize the 199-in-200 financial survival guarantee of the SCR-approach.

The safe and formal answer to this critique is that the SCR exclusively deals with the quantifiable aspects of risk and as such does not have to take into account any uncertainties, be it in the method or be it in the data. In this formal approach risk is considered as that part of the insurance business that can be calculated and uncertainty is viewed as a kind of error bound around the numbers and methods that are used. This notion of uncertainty is akin to the way uncertainties are treated in the measurements of lengths or weights in real world circumstances or in engineering. But this analogy is somewhat flawed in the sense that the uncertainty involved with the measurement of the length of a table is a concept that is well understood. This kind of uncertainty will depend on the accuracy of the instrument that one uses and on the number of times the measurements are repeated. The outcome of the measurement is reported as value = measurement ± uncertainty, in which the error bound is also used to determine the error bound for any derived quantities.

But the supposedly clear-cut distinction between risk and uncertainty becomes slightly problematic if we are not certain about the method we should use for the risk quantification.

If one, as is often the case in risk modelling, has to work with some historical observations and, of course, some broad notion about the risk, it is not at all clear which distribution function one should use in the modelling. And if the risk modelling has to be performed on the basis of a rather limited number of historical observations, the question which distribution one should use cannot be decided conclusively by the data. And even if one knows the correct model, the sample that is used will yield only an approximate value for the model parameters. The errors involved with the parameter misrepresentation are, however, usually taken into account, but the errors or, for that matter the uncertainties, caused by fact that most of time one does not know the statistical distribution of the underlying population, is hardly ever accounted for in SCR calculations.

To illustrate the modelling issues one has to deal with in the SCR calculation we consider a simple example. Suppose we want to calculate the SCR for a particular type of risk and that the only information we have consists of a set with 76 historical loss events. The loss data is shown in figure 2.1.

Historical losses loss [Euro mln.] Frequency 20 40 60 80 100 0 5 10 15 20

Figure 2.1: Sample data

Using the maximal likelihood estimation method to fit the loss data to the usual statistical suspects: normal, gamma and lognormal distributions, leads to the data fits

(14)

depicted in figure2.2.

Loss data fitting

loss [Euro mln.] Frequency 20 40 60 80 100 0 5 10 15 20 normal lognormal gamma

Figure 2.2: normal, lognormal and gamma data fits

As such there are, of course, no compelling reasons one should prefer one of the candidate distributions to the other two. The fitting error for the gamma function is slightly smaller than for the normal and lognormal distributions, but these differences are negligible. The SCR calculations for the three distributions, however, do lead to sig-nificant discrepancies between the distributions. In our example the normal distribution has the lowest SCR value ofe39 mln., the lognormal yields the highest SCR of e61 mln. and the Gamma-fitted SCR of e50 mln. is somewhere in between the two extremes.

The determination of the SCR value from a given set of the historical loss observa-tions is not a simple and straightforward exercise. The process from data to final SCR involves various choices that each can have a significant impact on the final SCR result. Since Solvency II is about the calculation of appropriate capital buffers, we cannot just ignore the SCR differences for our three distributions. But we still somehow have to deal with the annoying problem that we really do not know the genuine distribution function for the population from which the loss data derived. And we hence also do not know whether or not the real distribution is actually close to one of the candidate distributions. If the selection is to be restricted to the above distributions, the SCR outcome might be as low as e39 mln., but one might also argue that e61 mln., being the most prudent outcome, is the correct SCR.

The common practice in the modelling of insurance risks is that one selects the most appropriate or, for that matter, the least inappropriate of the statistical distributions that are readily available in the statistical toolbox that one uses. As such it might seem akin to the situation in which the police requests an eyewitness to identify the perpetrator of a crime by selecting the photo with the greatest resemblance from a short list of celebrity photos.

The question that is beyond the scope of the thesis is whether or not the SCR figures should explicitly incorporate the uncertainties involved with the loss data modelling in general and with the SCR calculation in particular. The short answer, I think, should be yes. If one really wants to determine reliable figures for financial buffer sizes that have to be able to withstand the losses for a 1-in-200 year catastrophe, one simply has to incorporate the impact of modelling and estimation uncertainties. It seems to me that the interests of policy holders would be served poorly if the SCR calculation would be based on inadequate statistical modelling leading to seemingly precise results that are however highly uncertain.

(15)

Estimation and estimators

3.1 Sampling and estimation

Sampling-based statistical statements are, of course, true only in specific, infinity-approaching settings in which one either has a sufficiently large sample to work with or, in case of more modest sample sizes, one can repeat the whole sampling exercise a sufficiently large number of times.

Real world statistical modelling, however, is hardly ever based on sufficiently large samples or on a sufficient number of repetitions of the sampling that the statistical theory assumes. However, making use of established methods from sampling and esti-mation theory, the statistical modelling with scarce resources can still come up with meaningful results. The main difference between the value for a particular population property obtained from a small number of observations and the value for that property obtained from a sufficiently large number of observations, has to do with the reliability of the result. The key message of sampling and estimation theory can be summarized in a truism stating that if one infers the value for population properties from a small number of observations, one has to be aware that the uncertainties of the outcomes will increase if the sample sizes decrease.

Suppose that θ is a population property and that the estimator bθ can be used to obtain approximate values for θ on the basis of a number of observations from the population. If one does not know the real value of the population property one can, of course, not claim that the estimation yields a reliable value for this quantity. The only thing one can infer from the estimation, is the precision of the sampling and estimation. The errors in statistical estimation can be noise-like, but the estimation can also be plagued by systematic deviations. These two types of errors are associated with to the estimator precision and the estimator accuracy.

• Precision

The scatter of the estimator around the average value of the estimator. The preci-sion can be improved by repeating the sampling and estimation a sufficient number of times.

• Accuracy

The systematic error between the true population value and the average output of the estimator. The systematic error cannot be diminished or indeed be removed by repeating the estimation for more data samples.

The distinction between estimator precision and accuracy can readily be explained by considering the shooting practice results as shown in figure3.1. The results a and b are produced both by accurate marksmen but the precision of shooter a is much higher than the precision of b. The second row also shows high precision c and low precision d results, but these bullet patterns both exhibit a significant systematic deviation from the bullseye.

(16)

accurate & precise

(a) Shooting practice accurate and precision

accurate & imprecise

(b) Accurate and imprecise

inaccurate & precise

(c) Inaccurate and precise

inaccurate & imprecise

(d) Inaccurate and imprecise

Figure 3.1: Accuracy and precision

3.2 Bias and variance

The objective of parameter estimation is to determine approximate values for population parameters from a limited number of observations from this population. In the sequel we will use θ for a general population property and bθ for an estimator of this property. We assume that θ has to be estimated from population samples that each consist of n data points x = (x1, . . . , xn). In estimation theory such samples are considered as

concrete realization of n independent and identically distributed (i.i.d.) random variables X = (X1, . . . , Xn), in which it is understood that the n random variables all follow the

distribution of the population.

For an estimator to be useful, its value bθ should provide a reliable approximation of the true population value θ. Potential differences between actual and estimated values can be due to random fluctuations, but there can also exist systematic discrepancies between the value of the population parameter and the estimator outputs. Such a sys-tematic deviation between estimator bθ and population parameter θ is called the bias of the estimator. This bias corresponds to the difference between the true value θ and the expectation of the estimator bθ. The expression E[bθ] is used for the expectation of b

θ, i.e. the average value one obtains if the whole sampling and subsequent estimation would be repeated a large number, in fact: an infinite number, of times. The bias of the estimator bθ is the systematic discrepancy that cannot be averaged away. It is the difference between the estimator expectation and the true population value, or

BIAS[bθ] = E[bθ] − θ. (3.1)

The bias of the estimator measures the accuracy of the estimator. It is clear that, all other things being equal, an unbiased estimator with BIAS[bθ] = 0 is to be preferred to biased estimators. But next to the accuracy one also has to take into account the

(17)

precision of the estimator. In real world estimation it is not always possible to repeat the sampling process over and over again, and in such situations it would be of little use to rely on an estimator that is 100% correct on average but for which the distance between estimated and population parameter values is highly volatile. The standard measure for estimator precision is the variance that one obtains by repeating the sampling and estimation M times (in which it is again understood that M is a large number). If the estimation results are given by the sequence t1, . . . , tM, the estimator variance (ˆσ2M) is

defined as ˆ σ_M2 = 1 M − 1 M X i=1 (ti− ¯t)2, (3.2) with ¯t = _M1 PM

i=1ti as the average of the estimator outcomes. In the limit of M → ∞

this average operator is replaced by the population average E, and in this limit the estimator variance becomes

var[bθ] = E[(bθ − E[bθ])2]. (3.3)

The relations between the true population value θ and the estimator properties such as the probability distribution p

b

θ, the accuracy (BIAS[bθ]) and the precision ( var[bθ]) are

depicted in fig. 3.2.

Population value θ and its estimator θ^

θ ^ p( θ ^) 2 VAR[θ^] BIAS[θ^_] θ ^ = θ θ^ = E[θ^]

Figure 3.2: The true population value θ and the characteristics of its estimator bθ as the probability distribution (p

b

θ), bias (BIAS[bθ]) and variance (var[bθ]).

The two estimator performance measures can be combined into one number, the so-called Mean Squared Error (MSE). This quantity is defined in terms of the fluctuations of the estimator values around the population value: MSE[bθ] ≡ E[(bθ − θ)2]. The mean square error of bθ can also be written as the sum of the variance and the square of the estimator bias

MSE[bθ] = E[(bθ − θ)2]

= E[(bθ − E[bθ])2+ (E[bθ] − θ)2] = var[bθ] + BIAS[bθ]2.

(3.4)

3.2.1 Asymptotic properties

The bias and variance of an estimator usually depend on the size n of the sample that is used in the estimation. To account explicitly for the sample size dependency one might

(18)

use the notation bθnfor the estimator. If the sample size increases, one would expect the

estimator outcome to approach the true value of the population property.

An estimator is called asymptotically unbiased if the estimator in the limit of large sample sizes tends to the population value. For such estimators one thus has the limiting behaviour

lim

n→∞E[bθn] = θ.

The estimator bθnof θ is called consistent if the distance between bθnand θ can be made

smaller than any given number by increasing the size of the samples. The estimator bθn

of θ is consistent if for any > 0 the probability that the distance between estimator and true value is larger than this , vanishes in the limit of large sample sizes, that is, if P (|bθn− θ| > ) → 0 as n → ∞. Applying the Chebychev inequality, see Johnson et

al. (1994), it follows that bθn is a consistent if the squared error MSE[bθn], see equation

3.4, goes to zero for n → ∞.

A considerable part of sampling and estimation theory is devoted to the asymptotic properties of estimators. For applications in which one has a sufficiently large number of observations, the asymptotic statements can be relied upon as an appropriate ap-proximation. In real world estimation problems one often has to work with rather finite numbers of observations and in these situations one really has to take into account the estimation bias and variance introduced by the limited size of the samples that are used.

3.3 Normal distribution estimators

We illustrate the estimation bias and variance issues of the previous section with a simple example. In this we estimate the parameters for a population described by the normal distribution N (µ, σ2) with mean µ and variance σ2. The standard estimators for the parameters are the sample mean _bµ = P Xi

n for µ, and the sample variance bσ

2 n−1 = S 2 n−1 with S2 =P (Xi−µ)b 2 _{to estimate σ}2_.

The estimator µ is unbiased since_b

E[bµ] = E[1 n n X i=1 Xi] = 1 n n X i=1 E[Xi] = E[X] = µ.

In rewriting the above expression we used the fact that a sample with n observations is assumed to be a realization of n i.i.d. random variables X1, . . . , Xn.

The estimator variance for µ is also readily evaluatedb var[µ] =_b 1 n2var[ n X i=1 xi] = n n2var[x] = σ2 n, and from this it follows that the MSE of _bµ is given by

MSE[_bµ] = σ

2

n.

From this expression it is clear that the MSE diminishes in the limit of large sample sizes (n → ∞). In the previous section we mentioned that this implies that the µ is ab consitent estimator.

Next, we consider the estimation of the variance of the population σ2_{. In this we}

use the shorthand notation S2 for the sum of the squared deviations from the sample mean S2 = n X i=1 (Xi−µ)b 2_. _(3.5)

To explain the role played by estimator bias and MSE, we consider two different estima-tors for the variance. The first one, the one that e.g. occurs in the maximum likelihood

(19)

estimation, uses the factor _n1 to normalize S2 for the number of observations in the sample b σ2_n= 1 nS 2 ₌ 1 n n X i=1 (Xi−µ)b 2_. _(3.6)

The second variance estimator that we consider, is also defined in terms of S2_{, but uses}

the number of degrees of freedom (n − 1) to account for the size of the sample

b σ_n−12 = 1 n − 1S 2 ₌ 1 n − 1 n X i=1 (Xi−µ)b 2_. _(3.7)

The only difference between these two estimators is a different factor for the sample size normalization. This trivial distinction leads, however, to different outcomes for the bias and MSE of the estimators.

Expressions involving sample variances for normal distributions can readily be eval-uated by using the relation between sample variances and χ2 distributions. In fact, for samples with n data points from a normal distribution the quotient of S2/σ2 follows a χ2 distribution with n − 1 degrees of freedom, see e.g. Johnson et al. (1994). For our normal distribution sample with n data points we thus have

S2 σ2 ∼ χ

2

n−1. (3.8)

Inserting this relation, one finds that σb

2 n is a biased estimator E[bσ2n] = 1 nE[S 2_{] =} σ2 nE[S 2_/σ2_{] =} σ2 nE[χ 2 n−1] = σ2 n − 1 n .

In the last step we used the identity E[χ2n−1] = n − 1 for the mean of the χ2 distribution

with n − 1 degrees of freedom.

The estimator bias is the difference between the estimator expectation and the pop-ulation value

BIAS[σ_b2_n] = E[bσ_n2] − σ2 = σ2n − 1

n − σ

2_{= −σ}21

n

Repeating these steps for the estimator _bσ_n−12 , or by directly rescaling the result for E[bσ2n], one finds1

E[bσ2n−1] = σ2.

The above results show that σ_b2_n−1 is an unbiased and _bσ_n2 is a biased estimator for the population variance.

If the estimator selection would merely be based on the estimator bias, the choice would be but a simple task: just take the candidate estimator with the smallest bias. In this case with the two estimators for the population variance this would imply that the unbiased estimator bσ

2

n−1 would be favoured. But, as it is, other factors apart from

the bias might also play a role in deciding on the most appropriate estimator. Next to the bias one could e.g. also use the estimator MSE as an additional criterion in the estimator selection.

To illustrate the potential problems involved with estimator ranking, we also calcu-late the MSE for the candidate estimators _bσ_n2 and _bσ_n−12 .

The variance of bσ 2 n−1 is given by var[σb 2 n−1] = σ4 1 (n − 1)2var[ S2 σ2] = σ4 (n − 1)2var[χ 2 n−1] = σ4 2 (n − 1). 1

This result can be found also without using the relation between S2 and the χ2-distribution. In fact, in appendixDwe show that the expectation of the sample variance, for samples defined in terms of i.i.d. random variables, is equal to the population variance.

(20)

In this we used the identity var[χ2_n−1] = 2(n − 1) for the variance of the chi-square distribution. Using the above result for _bσ2

n−1, the variance for the second estimator

quickly follows var[_bσ_n2] = n − 1 n 2 var[_bσ_n−12 ] = n − 1 n 2 σ4 2 (n − 1) = σ 42(n − 1) n2 .

Combining the expressions for the variances and biases leads to the MSE values for the estimators

MSE[σ_b2_n] = var[σ_b2_n] + BIAS[σ_b2_n] = 2n − 1 n2 σ 4 and MSE[bσ 2 n−1] = 2 n − 1σ 4

So the MSE for the biased estimator σ_b2_n turns out to be smaller than the MSE for the unbiased estimator _bσ_n−12

MSE[σ_b_n2] < MSE[_bσ_n−12 ]

This shows that depending on the ranking metric one uses, the biased estimator can actually outperform the unbiased estimator. For specific applications it might hence well be that one picks a biased estimator due to the fact that its bias is adequately compensated by e.g. a significantly smaller volatility.

3.4 UMVU Estimators

The normal distribution example in section3.3showed that the seemingly simple ques-tion: ”What is the best estimator?”, cannot be answered in a simple and unambiguous way. If we assume that the estimator performance can be measured by its MSE, the best estimator for θ would be the estimator with the smallest MSE value for all values of the parameter. We will show that such a minimum MSE estimator is ill-defined if one does not somehow constrain the set of estimators for θ. In fact, suppose that bθ is the MSE-optimal estimator for θ. If one then selects an arbitrary value θ0 for the parameter,

one can introduce a new and rather trivial estimator bθ0for θ defined by bθ0≡ θ0. Since bθ

is MSE-optimal, one has E[(bθ − θ)2] ≤ E[(bθ0− θ)2] = E[(θ0− θ)2] for all values of θ. But

in the point θ = θ0 this would imply that the MSE for our optimal estimator would be

zero: E[(bθ−θ0)2] = 0, and this, of course, can only be true if the optimal estimator would

satisfy bθ = θ0. Since θ0 was picked arbitrarily, it follows that the optimal estimator bθ

has to be identical to the parameter that it should be estimating.

However, if one restricts the search for the MSE-optimal estimator to the class of unbiased estimators, it turns out that such an optimal estimator exists and actually can be found for large numbers of parameters of various probability distributions.

The unbiased estimator bθ with the lowest MSE value is, of course, also the minimum variance estimator, and is thus accordingly known as the minimum-variance unbiased estimator (MVUE) for θ. And if the MVUE is the minimum variance estimator for all values of θ, it also is the unique uniform minimum-variance unbiased estimator (UMVUE) for θ. The formal definition of the UMVUE for bθ reads as follows:

Definition: UMVU estimator or UMVUE

Consider the family of probability densities pθ defined in terms of the parameter θ, in

which the domain of the parameter is denoted by Ω. If the parameter θ is estimated from samples X = (X1, . . . , Xn) consisting of n i.i.d. random variables, we define the

unbiased estimator bθ(X) of θ to be the UMVUE if the inequality var[bθ(X)] < var[bθ∗(X)]

(21)

holds for all other unbiased estimators bθ∗ and also uniformly holds for all parameter values θ ∈ Ω.

There actually exists an algorithm to construct the unique UMVUE. It only requires one has an unbiased estimator bθ for θ. From this one can prove that there exists a unique UMVUE for θ, and furthermore that the UMVUE can be found or approached by (repeatedly) applying the transformation implied by the Rao-Blackwell theorem. This theorem provides an effective method to transform any unbiased estimator into another estimator that is closer to the UMVUE than the input estimator. For this to work one merely has to find a complete sufficient statistic for the parameter that is being estimated and then condition the initial unbiased estimator on this statistic.

Rao-Blackwell theorem

The Rao-Blackwell theorem states that if bθ(X) is an estimator of a property θ, then the conditional expectation of bθ(X) for given S(X), where S is a sufficient statistic, will be an estimator with a MSE that is equal or lower than the MSE of the input estimator. One can thus simply start from a crude unbiased estimator bθ0(X), and then determine

the conditional expectation of this estimator for the sufficient statistic S(X), to obtain the estimator bθ1(X) = E[bθ0(X)|S(X)] with an MSE value that is smaller than or equal

to the MSE of the initial estimator.

Sufficient statistic

In sampling and estimation applications a statistic is a function of the sample data X = (X1, . . . , Xn). A simple example of a statistic is provided by the sum of the data

points: S(X) = Pn

i=1Xi, but another example would be the statistic defined by the

single value of the first observation F (X) = X1. These examples differ in the sense

that the first statistic is based on the whole sample while the second statistic ignores a significant part of the information that might be contained in the sample. A statistic is called a sufficient statistic for a population property if there are no other statistics for this sample that contain any additional information about the value of the property that is being estimated. For the estimation of the parameters of the lognormal distribution one can use the statistics defined by the first and second moments of the sample data. And since these moments contain all the information in the sample data about the values of the parameters, the first and second moments of the sample are indeed sufficient statistics for the estimation of the parameters of the lognormal distribution.

So to summarize, if the sample X defined in terms of i.i.d. data is used to estimate the property θ, a sufficient statistic is a function of the sample data S(X) whose values contain all the information in the sample that can be used to obtain an estimate of the property. Based on the factorization theorem, see Johnson et al. (1994), the joint distribution for the sample data p(X, θ) can be expressed as a product of two terms, in which the first term depends on the sample data X but not on the parameter θ, and in which the second term depends both on θ and on the sample data but in which the sample data dependence is defined in terms of sufficient statistics S(X). The factoriza-tion theorem thus tells us that the joint distribufactoriza-tion for the sample can be written in the form p(X, θ) = h(X) g(θ, S(X)) . From this factorized form it is clear that the values for the parameters that are obtained from the sample data depend only on this data through one or more of sufficient statistics.

In dealing with the UMVUE we will also need the concept of the complete sufficient statistic. A statistic S(X) is complete w.r.t. the quantity θ that is being estimated, if the fact that E[g(S(X))] = 0 for the function g for all values of θ, implies that g(S(X)) = 0 almost everywhere. So for a complete sufficient statistic it is impossible to find any non-trivial functions for which the expected value is equal to zero.

(22)

Lehmann-Scheff´e theorem

The quest for the UMVUE for a particular quantity is made slightly easier by the theoretical result that it suffices to find an estimator that is unbiased and for which the sample-dependence is exclusively defined in terms of complete sufficient statistics. If one is able to find such an estimator, it follows from the Lehmann-Scheff´e theorem, that this estimator is the unique UMVUE for the population property that is being estimated. The Lehmann-Scheff´e theorem, see Johnson et al. (1994), states that, if one has an unbiased estimator that only depends on complete sufficient statistics, this estimator is the unique UMVUE.

Normal distribution sufficient statistic

As a simple example we consider the normal distribution N (µ, σ2_{) and we assume}

that the distribution mean µ has to be estimated from a sample with n points X = (X1, . . . , Xn). The sample average µ =b

1

nP Xi is a sufficient statistic, since the joint

distribution for the sample can be written in the form p(x; µ, σ2) = n Y i=1 1 √ 2πσ2e −(xi−µ)2 2σ2 = n Y i=1 1 √ 2πσ2 ! e −P x2_{i +2µP xi−nµ}2 2σ2 .

The last form of the joint distribution, or for that matter: the likelihood function, shows that the parameters µ and σ2only depend on the sample data via the complete sufficient statistics S1(x) =Pn_i=1xi and S2(x) =Pn_i=1x2i.

Cram´er-Rao lower bound

The quality of an estimator is a function of the systematic bias or accuracy, and the estimation variance or precision. These two estimator quality criteria are not completely independent. In fact, the minimum value for the variance of an estimator is defined in terms of bias characteristics of the estimator. This minimum value for the estimator variance is called the Cramér-Rao lower bound, see Johnson et al. (1994). The Cram´ er-Rao bound is primarily used to compare the accuracy of unbiased estimators but, if the estimator bias can be evaluated explicitly, the bound can be used also for the assessment of biased estimators. The Cramér-Rao bound defines the minimum value for the variance of an estimator. The efficiency of an estimator is a relative measure of the estimator accuracy. It is defined by the distance between the estimator variance and the Cramér-Rao bound for the estimator.

First, we consider the Cram´er-Rao lower bound for the unbiased bθ that estimates the value of the population parameter θ. The estimation samples consist of n i.i.d. observations X = (X1, . . . , Xn) and the joint probability density function for the sample

for a given value of the parameter p(X|θ), can be interpreted also as the probability function for θ conditional on the observed data X. This probability density for θ is the so-called likelihood function L(θ|X). The values for the likelihood function are obtained from the sample distribution function for given parameter value L(θ|X) ≡ p(X|θ).

The Cram´er-Rao bound for this unbiased estimator bθ is the lower bound for the variance defined by the expression:

var(bθ) ≥ 1

I(θ) (3.9)

in which I(θ) is the Fisher information defined in terms of the expectation of the likelihood function derivative

(23)

In this ∂θ denotes the partial derivative with respect to θ, that is ∂θf (θ) = ∂f (θ)_∂θ . The

Fisher information can be interpreted as the local curvature of the parameter manifold in the point θ on the parameter manifold.

The Fischer information for a sample with n observations is defined in terms of the joint probability function for the sample. If one assumes that the sample consists of n i.i.d. elements, the sample probability function reduces to a product of probability densities for the n components p(X|θ) = Qn

i=1p(Xi|θ). And for samples consisting of

i.i.d. elements, the Fisher information for the whole sample is the sum of the Fisher informations for the sample elements

The Cram´er-Rao bound of equation 3.9 defines the variance lower bound for an unbiased estimator. The Cram´er-Rao bound for the biased estimator bθbiasedof θ extends

the expression for the unbiased estimator. The right hand side is defined in terms of the expectation of estimator E[bθbiased] = θ + b(θ), in which b(θ) denotes the estimator bias.

For such a biased estimator the Cram´er-Rao lower bound takes on the following form var(bθ) ≥ (1 + ∂θb(θ))

2

I(θ) . (3.12)

It is clear that if one removes the bias contribution, the more general Cram´er-Rao of equation 3.12reduces to the form of equation 3.9for unbiased estimators.

Since we are interested in the estimation of general functions of the distribution parameters such as the SCR, we need a more general form of the Cram´er-Rao lower bound. In appendixBwe derive a more general expression for the Cram´er-Rao variance bound of the estimator bφ(X) that is used to obtain sample-base values for the function φ(θ) of the distribution parameter

var( bφ(X)) ≥ 1 I(θ) ∂E[bφ(X)] ∂θ !2 . (3.13)

In dealing with two-parameter distributions, such as the normal and the lognor-mal distributions, the general Cram´er-Rao bound becomes a relation between matrix quantities. In this thesis we restrict ourselves to parameter variations confined to one-dimensional sub-manifolds and the above scalar versions will hence suffice if the deriva-tive operators properly account for the constraints imposed on the parameters. This issue will be elucidated when we consider the special case of 3.13 that we use for our lognormal populations.

(24)

Lognormal estimators

In this chapter we consider a number of estimators that one can use to estimate the parameters of the lognormal distribution. We introduce three different estimators for the lognormal distribution parameters and we consider the small sample bias and MSE if the estimators are used to estimate the population mean and variance. To assess the reliability of our numerical simulations we also compare the numerical results with the analytical expressions for the estimator bias and MSE. But before discussing lognormal estimator matters we first consider some general properties of the lognormal distribution.

4.1 The lognormal distribution

The lognormal distribution is closely related to the normal distribution. In fact, if the random variable X follows the lognormal distribution, X ∼ logN(µ, σ2) and if Y is the logarithm of X, i.e. Y = log(X), we know that this Y is normally distributed with mean µ and variance σ2, or Y ∼ N(µ, σ2).

The scale parameter µ and the shape1 parameter σ2 of the lognormal probability density p(x; µ, σ2) = 1 x √ 2πσ2e −(log(x)−µ)2 2σ2 (4.1)

are easier to understand in terms of characteristics of the associated normal distribution, than in terms of the properties of the lognormal population for X itself. The lognormal distribution thus merely is the normal distribution in disguise. It is but the normal distribution that has been transformed to the logarithmic domain.

The lognormal and the normal distribution also both appear as limiting distribu-tions for combinadistribu-tions of large numbers of i.i.d. variables. From the central limit theorem we know that the sum of a sufficient number of i.i.d. variables approaches the normal distribution. And analogously there is a product version of the central limit theorem that says that the lognormal distribution is obtained as the product of a sufficiently large number of i.i.d. random variables. The similarity is also apparent if one consid-ers the broad ranges of the applications of the two distributions. The ubiquity of the normal distribution is mirrored by a wide range of applications of the lognormal distri-bution in different fields of science and technology. The lognormal model is used e.g. in chemistry in reaction modelling where reaction kinetics depend on the product of the concentrations of the reactants, in geochemistry in abundance models for trace elements, in biology to describe stochastic growth processes, and in technology to model failure rates in reliability analysis. We refer to the article by Limpert (Limpert et al, 2001) for an overview of applications of the lognormal models in bioscience and technology. Apart from the lognormal applications proper, the article also provides some informa-tion on the typical ranges for the populainforma-tion parameters for these different types of

1

If one interprets the probability density as a function of log(x) rather than as a function of x, the parameters µ and σ2 _{become the location and scale parameters of the normal distribution.}

(25)

applications. The lognormal distribution furthermore appears in social sciences and in financial modelling, e.g. to describe income distributions or to model the statistics of stock price fluctuations. The lognormal distribution is rather versatile. It can be used to describe the populations that are nearly normal but it can also be used to model the properties of populations exhibiting significant tail behaviour. In this thesis we consider the application of the lognormal distribution for the modelling of insurance losses. In ideal situations the lognormal distribution function captures in a single equation both bulk and tail characteristics of the historical loss data. In this the data bulk consists of the high frequency and low impact losses and the tail contains the extreme and less frequent loss events.

The events in the tail are rare by definition, but proper modelling requires that the tail properties of the risk are taken into account adequately . If an insurer wants to stay in business, he simply has to be able to assess the probabilities of extreme loss events for a particular risk. The tail of the risk plays an important role if the insurer wants to buy reinsurance coverage for extreme loss events. The premium for an excess of loss treaty will be determined primarily by the tail characteristics of the losses. And the Solvency II regime, with its 99.5% quantile definition for the SCR, also requires the insurer to have a proper understanding of the tail behaviour of his loss distribution.

If the lognormal distribution is close to the normal distribution, tail effects might be ignored, but if the skewness of the distribution increases, the tail properties of the distribution become more and more important. For the general random variable X the skewness of its distribution is defined in terms of the central moments

skewness ≡ E[(X − E[X])

3_]

(E[(X − E[X])2_])32

.

If X follows the lognormal distribution, the above skewness becomes an expression that depends only on the scale parameter σ2

skewness =eσ2 + 2 peσ2

− 1. (4.2)

In the sequel we will frequently use the coefficient of variation (CV) to compare lognormal distributions with different skewness values. The CV is a simple and accessible measure of the skewness of the population. It is defined as the ratio of the standard deviation to the mean of the population

CV ≡ (E[(X − E[X])

2_])1₂

E[X] .

For the lognormal distribution this CV definition evaluates to CV =peσ2

− 1. (4.3)

From equations 4.3 and 4.2 it is clear that these two tail measures are more or less equivalent.

In figure 4.1we plot a number of lognormal probability densities for different skew-ness settings. The densities have the same population average: mean = 10, but the lognormal curves have different standard deviations, SD = {5, 10, 20, 40}. The coeffi-cients of variation for the densities are thus given by CV = {0.5, 1, 2, 4}. The graphs clearly explain the relation between the skewness and CV of the lognormal distribution. The estimation of population properties is relatively easy and straightforward if the distribution is symmetric and nicely concentrated in the centre. But if the skewness of the lognormal distribution increases, the parameter estimation task becomes more and more involved and difficult. And finally, if the distribution is highly skewed and, at the same time, the estimation has to be based on a small number of observations,

(26)

0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.4

Skewness of lognormal distributions

x p(x) CV = 0.5 CV = 1.0 CV = 2.0 CV = 4.0

Figure 4.1: Lognormal population variance dependence

it becomes almost impossible to come up with meaningful values for the parameters of the lognormal distribution. The problems associated with the sampling of skewed (lognormal) distributions are discussed in a lucid and stimulating way by Fleming in publications in which he summarises his concerns with the catchy title: Yep We’re Skewed!, see Fleming (2007) and (2008).

To ensure that the modelling arrives at meaningful numbers for extreme tail quantile events, the parameter estimation should, of course, be able to come up with reliable values for the distribution parameters. The combination of a highly skewed distribution and a rather limited number of observations, however, can make it impossible to come up with values for the parameters that are both accurate and reliable at the same time. If the parameters obtained from the estimation are highly uncertain it would be best to state the uncertainties explicitly and also to incorporate these uncertainties in any subsequent calculations in which the parameter values are used.

4.2 Estimating lognormal parameters

In this section we consider several methods that can be used to estimate the values for the parameters of the lognormal distribution. Suppose one knows that the statistical properties of the population of interest can be described by the lognormal distribution but that one does not know the correct values for the lognormal parameters µ and σ2. If one then receives a number of observations from the underlying population, one can use these to define estimatesµ and_b _bσ2 for the population parameters µ and σ2. In the estimation one starts from the n observations X = (X1, . . . , Xn) and by combining

the properties of the sample data and the general form of the lognormal distribution one can use specific estimation techniques to define estimated values for the lognormal parameters µ and σ2.

In this thesis we will consider three approaches that one can use for the parameter estimation for lognormal distributions

• Methods of moments (MOM) with the parameter estimators _bµMOM and bσ

2

MOM;

• Maximum likelihood estimation (MLE), with the estimators:µ_bMLE and bσ

2

(27)

• Uniformly minimum variance unbiased estimators (UMVUE):µ_bUMVUEandbσ

2

UMVUE.

The different methods each come with their specific merits and limitations. We will briefly describe these lognormal estimators and consider the bias and the variance of the estimators for the population mean (M ) and population variance (V ).

4.2.1 Method of moments (MOM) estimators

The moment mk(k = 1, 2, . . . ) of a probability distribution is defined as the expectation

of the k − th power of the random variable

mk= E[Xk] for k = 1, 2, . . . . (4.4)

From equation A.12 one finds that the moments of the lognormal distribution can be evaluated explicitly

mk= E[Xk] = ekµ+k 2 σ2

2 . (4.5)

In the MOM approach the lognormal parameters estimates are obtained from the estimators for the first two moments. The first sample momentm_b1=P xi/n, estimates

the population average, and the second sample moment m_b2 =P x2i/n is an estimator

for the second moment m2 of the population. Making use of equation 4.5 the MOM

distribution parametersµ_bMOM and _bσ2MOM can be found from the moment estimatorsmb1 andm_b2. The values for the population mean can be obtained from the estimatorsµ andb b σ2 eµbMOM+ b σ2_MOM 2 = 1 n n X i=1 Xi,

and for the second moment estimator one has an analogous relation e2(µbMOM+bσ 2 MOM) ₌ 1 n n X i=1 X_i2.

Combining these relations one readily obtains the MOM values for the parameters of the lognormal distribution

   b µMOM = 2 log(mb1) − 1 2log(mb2) b

σ2MOM = log(mb2) − 2 log(mb1).

(4.6)

The MOM provides a relatively simple and straightforward way to estimate the parameters of the lognormal distribution. One simply has to calculate the momentsm_b1

and m_b2 for the sample data and then to insert these moment values into the above

expressions to obtain the estimates for the parametersµ and_b _bσ2_.

MOM bias and variance for M and V

Since the estimates for the distribution mean (M = E[X]) and variance (V = E[(X − E[X])2]) are directly obtained from the sample data, the expressions for the bias and the MSE of the MOM estimators for these properties can be found in a straightforward manner. Evaluating the expectation of cM = m_b1, one finds that the estimator of the

population mean is unbiased

E[ cMMOM] = E[mb1] = E[X] = M. (4.7) Making use of equation 4.5for the first and second moments, the variance of cM =mb1 can be written as var[ cM ] = var[1 n X Xi] = 1 n2var[ X Xi] = 1 nvar[X] = e2µ+σ2(eσ2 − 1) n . (4.8)

(28)

Proceeding in the same way, one finds that the second and higher order moments of the sample data are also unbiased estimators of the associated population moments

E[mbp] = E[ 1 n n X i=1 X_ip] = mp for p = 1, 2, . . . .

In appendix D we show that the MOM estimator, or which is the same, the sample variance bVMOM = _n−11

Pn

i=1(Xi−mb1)

2_{, is an unbiased estimator of the population}

variance V , that is

E[ bVMOM] = E[X − (E[X])2] = V. (4.9)

The estimation variance or MSE of mb2 is given by the expression var[mb2] = var[ 1 n X X_i2] = var[X 2_] n = E[X4] − (E[X2])2 n = e4µ+4σ2(e4σ2− 1) n ,

and a similar, although slightly more involved exercise, see appendix D, yields the expression for the variance or MSE of the variance estimator bVMOM

var( bVMOM) = MSE( bVMOM) = 1_n

eµ+12σ 24n e6σ2− 4e3σ2 − e2σ2 + 8eσ2− 4o + _n(n−1)2 eµ+12σ 24n e2σ2− 2eσ2 + 1 o . (4.10)

4.2.2 Maximum likelihood estimator

The maximum likelihood method is another standard method for the estimation of the parameters θ that occur in the distribution function p(x; θ) for the random variable X. The Maximum Likelihood Estimator (MLE) bθMLE is used to define appropriate values

for distribution parameters on the basis of n i.i.d. observations x = (x1, x2, . . . , xn). The

values for the parameters are found by maximizing the likelihood function L(θ|x) for fixed values of the sample data x. The likelihood function is the θ-probability density conditional on the given sample x. The values for the likelihood function L(θ|x1, . . . , xn)

are equal to the sample joint probability p(x|θ) function conditioned on the values for the parameters θ L(θ|x) = p(x|θ) = n Y i=1 p(xi; θ),

in which we used the i.i.d. property to rewrite the joint distribution for the sample vector X as a product of probability densities for the random variables X1, . . . , Xn.

The MLE value bθMLE is defined as the θ for which the likelihood function L or the

log-likelihood ` = log(L) attains a maximum value. Differentiating L(θ|x) with respect to θ and setting the derivatives to zero, leads to one of more conditions from which the optimum values for the distribution parameters can be solved.

The lognormal distribution is defined in terms of the parameters θ = (µ, σ2). The MLE values for the parameters are found by optimizing the log-likelihood

`(θ|x) = log(L(θ|x)) = −n 2 log(2πσ 2_{) −} 1 2 n X i=1 log(xi) − n X i=1 (log(xi) − µ)2 2σ2 .

Differentiating `(µ, σ2|x) with respect to µ and setting the result to zero, leads to the condition for the parameter µ in the likelihood maximum

µmax= 1 n n X i=1 log(xi). (4.11)

(29)

The MLE value for the parameter σ2 is found in the same way. Setting the σ2-derivative of the log-likelihood to zero, yields the σ2_{-condition for the `(µ, σ}2_{|x) maximum}

σ2max= 1 n n X i=1 (log(xi) − µmax)2. (4.12)

This σ2max estimator that directly follows from the likelihood optimum requirements

is biased. In section3.3we considered different estimators for the variance of the normal distribution. The unbiased estimator was the one that used the factor 1/(n − 1) for the sample size normalization. Since the lognormal scale factor σ2 is the same as the variance of the associated normal distribution, we will use the unbiased variant for our MLE estimator _bσ2_MLE. In the sequel we will thus use the following MLE estimators

   b µMLE = _n1 Pn i=1log(Xi) b σ2MLE = 1 n−1 Pn

i=1(log(Xi) −bµMLE)

2

(4.13)

The MLE approach is, no doubt, the most frequently used method for the estima-tion of lognormal parameters. It is soundly based and the statistical properties of the MLE parameters, such as the error bounds, are clearly understood and can readily be calculated. One of the reasons for the MLE popularity might be related to the fact that the MLE-technique is available in all toolboxes that are used in statistical modelling. MLE bias and variance for M and V

The MLE estimator for the population mean (M ) is defined in terms of the MLE estimators for the distribution parameters

c

MMLE= eµbMLE+ b

σ2_MLE

2 , (4.14)

and the MLE estimator for the variance (V ) is given by b VMLE= e2bµMLE+bσ 2 MLE eσb 2 MLE− 1 . (4.15)

The MLE-based estimators for M and V are biased. In section A.1 we show that the expected value of cMMLE for samples with n observations, is given by

E[ cMMLE] = M e σ2 2 ( 1−n n ) 1 − σ 2 2(n−1₂ ) !−n−1₂ (4.16)

And in a similar way, combining equations4.13and4.15, one finds that the expectation of the variance estimator bVMLE can be written as

E[ bVMLE] = V eσ2(2n−1) eσ2 − 1   1 − 2σ2 n−1 2 !−n−1₂ − 1 − σ 2 n−1 2 !−n−1₂  . (4.17)

From the above expression it follows that the MLE estimators for M and V are asymptotically unbiased in the sense that the expectations for large values of the sample size n become equivalent to the true population values. In fact, making use of the following limit for the power factors occurring in the MLE expectations

lim n→∞ 1 − κσ2 (n−1₂ ) !−n−1₂ = e−κσ2, (4.18)