Modeling insurance loss data : the applicability of mixture distributions using the EM-algorithm

(1)

Modeling insurance loss data

— the applicability of mixture

distributions using the EM-algorithm

Master’s Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Florian Maassen

Student nr: 11371706

Email: florianmaassen@icloud.com

Date: August 15, 2017

Supervisor: mw. dr. K. (Katrien) Antonio

(2)

(3)

Statement of Originality

This document is written by Florian Maassen, who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

Abstract

In this thesis we investigate the applicability of mixture distributions to insurance loss data using the expectation-maximization algorithm. In particular, we compare the use of Erlang mixtures with a common scale paramater versus various other light- and heavy-tailed mixture distributions where all parameters are free. We consider complete and truncated data, and provide an implementation in R of the methods described. We fit the mixtures on two insurance loss datasets to compare the two approaches. In addition we conduct a simulation study to explore the limits of the two approaches. We observe that the Erlang mixtures are not flexible enough to capture good fits for the body of the distirubtions in case data is multimodal and heavily skewed. The more flexible approach, where all distribution parameters are free, provide a better fit in the bodies of the distributions compared to the Erlang mixtures. However, we do observe limits of the applicability on heavy-tailed data. In addition, we also elaborate on the advantages and disadvantages of both approaches.

Keywords Mixture Distributions, Erlang Mixtures, Insurance Loss Data, EM-algorithm, Heavy-tailed Data

(5)

In the master program of Actuarial Science and Mathematical Finance often the focus lies on the distributions that are used in financial and actuarial modeling. Characteristics of particular financial or insurance related datasets are described and it is explained why the assumptions we make in our models are incorrect given the real life data. Therefore, I had a natural interest in what kind of distributions might be a better choice for the data Actuaries deal with on a regular basis. Through my supervisor mw. dr. Katrien Antonio, I started to study the available literature on mixture distributions, something that had not been a part of the curriculum of my bachelor and master’s degree so far. Although mathematically challenging, I was curious to explore the applications of these mixture distributions to insurance loss data. Writing this thesis has given me greater insights in the mathematics of modeling insurance loss data. In addition, it brought me right to the border of the research area. I am certain that at some point I will use the techniques described in this thesis in my future career path. At last, i’m happy to stress that my academic adventure has neared its end.

(7)

Chapter 1

Introduction

Financial losses have always been an important topic in insurance and finance. Mod-eling losses can help an insurance actuary with setting adequate premia for insurance contracts. In addition, risk managers can determine appropriate capital requirements through the use of risk measures to satisfy solvency regulations.

Datasets containing insurance claims sizes are called insurance loss data. Claim sizes are non-negative observations. Typically loss data show positive skewness. It may also be multimodal. Multimodality means that a probability distribution has multiple modes, or values that appear more often than others. This usually indicates that observations in a subset of the total data have something in common. Multimodality is also present in insurance portfolios where multiple risks are pooled. Another characteristic of these data is that they are often heavy-tailed. Heavy-tailed datasets are recognized by having a few observations which are great outliers compared to the majority of the data. These few outliers can have great consequences for an insurer and therefore stress the importance of not underestimating these scarce events. Two examples of such datasets are the widely studied Danish Fire Losses (Davison, 2013) and Secura Re insurance claims (Beirland et al., 2004) both shown in figure 1.1.

Claim size Frequency 0 50 100 150 200 250 0 500 1000 1500

(a) Danish Fire Losses

Claim size

Frequency

1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06

0

5

10

15

(b) Secura Re Claims

Figure 1.1: Histograms of the Danish Fire and Secure Re data

These two datasets will also be of interest in this thesis. The Danish Fire Losses consist of 2492 claims in millions of Danish Krone that are collected by Copenhagen Reinsur-ance between 1980 and 1990. This particular dataset is a good example of heavy tailed data, since the three largest claims are 145, 152 and 263, which are severe outliers given the histogram in figure 1.1 (a). Insurance loss data may often also be truncated due to deductibles in insurance contracts. This is usually the case in reinsurance contracts. Truncation means that the domain of the data is bounded by some lower or

(8)

bound. An example is the Secura Re Claims dataset in figure 1.1 (b), which contain 371 automobile claims from European insurance companies between 1988 and 2001. These data are left-truncated by 1 200 000, which means that claims below this value are not reported. In case truncation is present, we refer to incomplete data.

Finding adequate distributions that mimic these types of datasets can be a challenge, and has been the topic of many research papers. To use one or more parametric distribu-tions to model these losses has been done a lot in the literature. Usually, a distribution-based approach is followed, by which first a model is identified, and with the help of the available literature one can determine a best possible model for that specific situation (Klugman and Rioux, 2006). For example Kreer et. al. (2015) propose a goodness of fit procedure for left-truncated Weibull distributions. However, it would be useful to find a single approach that can be used to fitting insurance losses (Klugman and Rioux, 2006). Although given a certain dataset parametric distributions may be a good fit, they are often not flexible enough to capture all the characteristics of insurance loss data. Com-posite (or splicing) models have been used by Klugman et al. (2012) to better deal with for example the tail risk, by combining two distributions for the body and the tail of the distribution. However, when the risks of an insurance portfolio are aggregated, compos-ite models are not able to account for multimodality. Mixture distributions are therefore of interest due to their flexibility, since they are not limited to two distributions. They are also attractive to model complex data, as they can express rather complex densities in terms of a combination of simpler distributions. Mixture models have a broad range of applications, for example in cluster analysis (McLachlan and Peel, 2001) and medi-cal applications (Schlattmann, 2009). With respect to insurance loss modeling, limited research has been done on mixture distributions in comparison with single distribution approaches or composite models. So far Lee and Lin (2010) and Verbelen et. al. (2015, 2016) have studied Erlang mixtures. Miljkovic and Gr¨un (2016), have considered various other mixture distributions to model insurance loss data, in which all model parameters are free. Both make use of the Expectation-Maximization algorithm.

In this thesis we will focus on the implementation of fitting mixture distributions to insurance loss data, by considering mixtures of Erlangs vs various other light- and heavy-tailed mixtures without common parameters, which we will refer to as the flexible approach. By first fitting them to the two datasets described above, we investigate the impact on risk measures such as Value-at-Risk (VaR) and Tail-Value-at-Risk (TVAR). We investigate the goodness of fit using various graphical plots. To further determine the best strategy in case of heavy tailed data, a simulation study is done to observe the behaviour of these mixtures and the EM-algorithm when dealing with heavily skewed data.

(9)

Chapter 2

Mixture distributions for loss

modeling

In this chapter we first introduce the concept of mixture distributions, also referred to as mixture models, on complete data. We will introduce a set of distributions which are used in Verbelen et. al. (2015) and Miljkovic and Gr¨un (2016). In addition, the consequence of truncation on mixture distributions is described.

2.1 Mixture distributions

2.1.1 Mixture distributions on complete data

In this thesis so called variable-component mixture distributions are considered. We will formally define mixture distributions following definition 4.10 of Klugman et. al. (2012). Definition 2.1.1. A variable component mixture distribution has a distribution func-tion that can be written as

F (x) = M X j=1 αjFj(x), (2.1) wherePM j=1αj = 1, all αj > 0 and M ∈ Z .

Here α = (α1, ..., αM) are mixing weights, and Fj are component distributions. We

restrict M to be a finite number, representing the number of components in the mix-ture. By construnction of α, a mixture distribution is thus a convex combination of M component distributions. For the pdf, it holds that

f (x) = M X j=1 αjfj(x), (2.2) where again PM

j=1αj = 1, all αj > 0. The mixture models that will be used in this

thesis contain component densities fj(x) that are all assumed to belong to the same

parametric family. Let θj be the parameter vector of component density f (x|θj) for all

j = 1, ..., M , and define Θ = (α, θ1, ..., θM). Then we write the mixture model as

f (x|Θ) =

M

X

j=1

αjf (x; θj). (2.3)

Note that we drop the subscript in fj(·|θj) and use f (·|θj) instead, since each component

density has the same pdf only with unique parameters θj.

(10)

2.1.2 Mixture distributions on truncated data

Truncating a distribution means to restrict its original support D = [a, b] to a subset of that support [tl, tu] where a ≤ tl ≤ tu ≤ b, without violating the properties of

a probability distribution. Here tl refers to lower truncation and tu refers to upper

truncation. Consider a random variable X that is distributed according to a continuous probability density f (x) and cdf F (x) with support D. In order to construct a new distribution, where only observations xi∈ [tl, tu] are of interest,

F (x|X ∈ [tl, tu]) =

F (x) − F (tl)

F (tu) − F (tl)

. (2.4)

The corresponding pdf is given by

f (x|X ∈ [tl, tu]) =

f (x) F (tu) − F (tl)

. (2.5)

We assume tl and tu to be known constants. In order to truncate a mixture density of

form (2.1) one can simply truncate each individual density in the mixture. Using (2.5), the probability density function for a mixture of truncated densities with parameter vector Θ = (α, θ1, ..., θM) becomes f (xi; tu, tl, Θ) = f (xi; Θ) F (tu; Θ) − F (tl; Θ) = M X j=1 αj f (xi, θj) F (tu; Θ) − F (tl; Θ) = M X j=1 αj F (tu; θj) − F (tl; θj) F (tu; Θ) − F (tl; Θ) × f (xi, θj) F (tu; θj) − F (tl; θj) (2.6) Defining βj = αj F (tu; θj) − F (tl; θj) F (tu; Θ) − F (tl; Θ) (2.7) and f (xi; tl, tu, θj) = f (xi, θj) F (tu; θj) − F (tl; θj) , (2.8) we can write (2.6) as f (xi; tu, tl, Θ) = M X j=1 βjf (xi; tl, tu, θj). (2.9)

Equation (2.9) is in fact a mixture distribution with new “truncated” densities f (xi; tl, tu, θj)and

mixture weights βj.

2.2 Suitable mixture distributions for loss modeling

The following probability distributions are often used as model densities in actuarial science, and more specifically in non-life insurance. We will use them here as mixing components. This list is inspired by Verbelen et. al. (2015) and Miljkovic and Gr¨un (2016). These densities will be used in the case and simulation studies further on in this thesis. Note that for all densities it holds that x ≥ 0. In figure 2.1 an graphical overview

(11)

10 Florian Maassen — Modeling insurance loss data

of all the considered densities below is given to illustrate the behaviour with respect to the distribution parameters. For completeness, all distributions other than the Erlangs are considered by the more flexible approach in the rest of the thesis.

Erlang: X ∼ Erlang(r, θ)

The Erlang distribution is a form of a Gamma distribution where its shape parameter is restricted to integers, and scale parameter θ > 0. It is actually the sum of independent exponential distributions. It was first used in modeling insurance loss data in Lee & Lin (2010), and used in more extensive research by Verbelen et. al. (2015, 2016). The Erlang pdf is given by

f (x; r, θ) = x

r−1_e−x/θ

θr_{(r − 1)!}. (2.10)

In order to obtain the cumulative distribution function we need to integrate (2.10) by parts r times which yields

F (x; r, θ) = Z x 0 sr−1e−s/θ θr_{(r − 1)!}ds = 1 − r−1 X n=0 e−x/θ(x/θ) n n! . (2.11)

The moment generating function of an Erlang distributed random variable X is obtained by evaluating MX(t) = E(etX) = Z ∞ 0 etxx r−1_e−x/θ θr_{(r − 1)!}dx = ( 1 θ− 1) −r θr Z ∞ 0 (1_θ − t)rxr−1e−x(1/θ−s) (r − 1)! dx. (2.12)

Note that if t < 1_θ, the integral in (2.12) is actually a density and thus integrates to one. Hence,

MX(t) =

(1_θ − t)−r

θr = (1 − θt)

−r _(2.13)

The mean, or the first moment, can be derived by taking the first order derivative of (2.13) with respect to t evaluated at zero.

E(X) = d dtMX(t) t=0 = rθ(1 − tθ)−r−1 t=0= rθ (2.14)

The variance is then given by,

V ar(X) =d 2 dt2MX(t) t=0 − (rθ)2 =(r + 1)rθ2(1 − tθ)−r−2 t=0 − (rθ)2 =rθ2, (2.15)

using V ar(X) = E(X2) − E(X)2. Burr: X ∼ Burr(λ, θ, γ)

The Burr is a distribution function first introduced by Burr (1942) and has the form

f (x; λ, θ, γ) = λγ(x/θ)

γ

(12)

with shape parameters λ > 0 and γ > 0, and scale parameter θ > 0. When shape pa-rameter λ is equal to one, the Burr distribution is equal to a Pareto type II distribution, which is a heavy tailed distribution with a lot of actuarial applications. Its cumulative distribution function is given by

F (x; λ, θ, γ) = 1 −h1 +x θ

γi−λ

. (2.17)

Inverse-Burr: X ∼ IBurr(τ, γ, θ)

The Inverse Burr distribution, also known as the Dagum distribution, has shape param-eters τ > 0 and γ > 0 and scale parameter θ > 0. It’s pdf is given by

f (x; τ, γ, θ) = τ γ(x/θ)

τ γ

x(1 + (x/θ)γ₎τ +1. (2.18)

Its cumulative distribution function is given by F (x; τ, γ, θ) =1 +x

θ

−γ−τ

. (2.19)

Gamma: X ∼ G(λ, θ)

The gamma density, with shape parameter λ > 0 and scale parameter θ > 0, has a probability density function of the form

f (x; λ, θ) = 1 Γ(λ)θλx λ − 1_e−x θ, (2.20) where Γ(λ) = (λ − 1)!. It’s cdf is given by F (x; λ, θ) = 1 Γ(λ)γ λ, x θ , (2.21)

where γ(λ,x_θ) is the lower incomplete gamma function defined as γ(λ,x_θ) =Rxθ

0 tλ−1e −t_dt.

Weibull: X ∼ W(λ, k)

The density function of a Weibull distribution is f (x; λ, k) = k λ x λ k−1 e−(x/λ)k, (2.22) where k > 0 is the shape parameter and λ > 0 the scale parameter. The cdf is given by

F (x; λ, k) = 1 − e−(x/λ)k. (2.23)

Log-normal: X ∼ LN(µ, σ2)

The log-normal density is a distribution whos logarithm follows a normal distribution, although less applications in actuarial science, it is also often used in economics and finance as seen in Black and Scholes (1973). The probabilty density function is

f (x; µ, σ) = 1 x · 1 σ√2π exp −(log x − µ) 2 2σ2 , (2.24)

where µ ∈ R and σ > 0. The cdf is given by F (x; µ, σ) = 1 2+ 1 2erf hln x − µ √ 2σ i , (2.25)

(13)

where erf(x) = √2 π

Z x

0

e−t2dt is the Gauss error function. Inverse Gaussian: X ∼ IG(µ, λ)

The inverse Gaussian has the property that the cumulative distribution function is the inverse of the Gaussian cumulative distribution function. Therefore, the name might be somewhat misleading as it is not the inverse of a Gaussian distribution. Its pdf with parameters µ > 0 and λ > 0 is f (x; µ, λ) = r λ 2πx3exp −λ(x − µ)2 2µ2_x ! . (2.26) The cdf is given by F (x; µ, λ) = Φ r λ x x µ− 1 ! + exp 2λ µ Φ − r λ x x µ+ 1 ! , (2.27)

(14)

0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5 f_er lang r = 1.0; θ = 2.0 r = 3.0; θ = 2.0 r = 5.0; θ = 2.0 r = 7.0; θ = 1.0 r = 9.0; θ = 0.5 r = 1.0; θ = 1.0 (a) X ∼ Erlang(r, θ) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 f_b urr λ = 1.0; γ = 1.0 λ = 1.0; γ = 2.0 λ = 1.0; γ = 3.0 λ = 2.0; γ = 1.0 λ = 3.0; γ = 1.0 λ = 0.5; γ = 0.5 (b) X ∼ Burr(λ, θ, γ) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 f_in vb urr τ = 1.0; γ = 1.0 τ = 1.0; γ = 2.0 τ = 1.0; γ = 3.0 τ = 2.0; γ = 1.0 τ = 3.0; γ = 1.0 τ = 0.5; γ = 0.5 (c) X ∼ IBurr(τ, γ, θ) 0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5 f_gamma λ = 1.0; θ = 0.5 λ = 2.0; θ = 0.5 λ = 3.0; θ = 0.5 λ = 5.0; θ = 0.5 λ = 9.0; θ = 1.0 λ = 7.5; θ = 0.5 (d) X ∼ G(λ, θ) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 f_w eib ull λ = 1.0; κ = 1.0 λ = 1.0; κ = 0.5 λ = 1.0; κ = 2.0 λ = 1.0; κ = 3.0 λ = 2.0; κ = 1.0 λ = 0.5; κ = 1.0 (e) X ∼ W(λ, k) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 f_lnor m µ = 0.0; σ = 0.2 µ = 0.0; σ = 0.5 µ = 0.0; σ = 1.0 µ = 0.0; σ = 2.0 µ = 1.0; σ = 1.0 µ = 2.0; σ = 1.0 (f) X ∼ LN(µ, σ2) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 f_in vgauss µ = 1.0; λ = 0.2 µ = 1.0; λ = 1.0 µ = 1.0; λ = 3.0 µ = 2.0; λ = 0.2 µ = 3.0; λ = 1.0 µ = 3.0; λ = 3.0 (g) X ∼ IG(µ, λ)

Figure 2.1: Behaviour with respect to the distribution parmameters of the considered distributions given in section 2.2.

(15)

Chapter 3

Fitting Mixture distributions

In insurance loss data, observations in the tail have a lot of influence on the distribu-tion. Therefore it is preferred to use maximum likelihood estimation, since it considers the entire dataset. The Expectation-Maximization (EM) algorithm was first introduced by Dempster et al. (1977). They describe an iterative approach to finding maximum likelihood estimators for incomplete data and also consider various examples with mix-ture models. Since mixmix-ture distributions have a lot of parameters, it is often impossible to find closed form solutions for maximum likelihood estimators. The EM-algorithm provides a good way of finding a solution to this problem, since the E and M step in the algorithm may simplify the computation of the maximum likelihood estimators. In order to evaluate the best strategy in fitting loss models, we first introduce the general strategy when using the EM algorithm for mixture models. Then we will give a de-tailed description of the difference in the approach used by Verbelen et. al. (2015) and Miljkovic and Gr¨un (2016), as well as comment on the implementation.

3.1 The expectation maximization algorithm for mixture

distributions

3.1.1 Constructing the likelihood

Suppose we have a mixture of the form (2.3) with parameter vector Θ = (α, θ1, ..., θM),

where the set of observations X = (x1, ..., xn) is fully observed. We define the likelihood

function as the probability that X occurs given parameter vector Θ:

L(Θ|X ) =Pr(X |Θ) = n Y i=1 XM j=1 αjf (xi|θj) . (3.1)

The loglikelihood is defined as

l(Θ|X ) = log L(Θ|X ) = n X i=1 log M X j=1 αjf (xi|θj) . (3.2)

The maximum likelihood estimators (MLE) are found by optimising the loglikelihood given in (3.2). However, due to the presence of a sum in the logarithm as a consequence of the mixture structure, 3.2 is difficult to optimize. The idea in the EM algorithm is to view the sample X as incomplete, since it is unknown from which component dis-tribution an observation comes. By introducing a complete data vector, consisting of the observations X and component indicators Z, the likelihood function becomes much easier to evaluate.

(16)

First define the complete data vector as Y = (X , Z), where Z consists of M vectors z = (z1, ..., zn) for j = 1, ..., M , where

zij =

(

1 if observation xi is generated by the jth component density f (xi, θj)

0 otherwise.

(3.3) The probability distribution of the vectors (z1, ..., zn) is a multinomial distribution of

a single draw on M categories.

Z1, ..., Zn∼i.i.d.MultM(1, α). (3.4)

The corresponding probabilities α = α1, ..., αM are the mixing weights for each

com-ponent density, such that P (Zi = zi) = αz1i1...α ziM

M . Since every observation needs to

come from one of the components this as a partition and therefore a multinomial distri-bution is suitable. We can now rewrite the log-likelihood (3.2) with the new complete datavector as l(Θ|Y) = n X i=1 M X j=1 Zijlog αjf (xi|θj) . (3.5)

Note that expression (3.5) is equivalent to (3.2) by construction of Y. However, to op-timize the complete data vector log-likelihood is much easier.

Truncated data:

In case the set of observations X is truncated, we can simply rewrite the component densities and weights, and directly take truncation into account via (2.7) and (2.8). After rewriting we again have a mixture distribution with corresponding loglikelihood function l(Θ|Y, tu, tl) = n X i=1 M X j=1 Zijlog βjf (xi|θj, tu, tl) , (3.6)

where tu and tl are the upper and lower truncation bounds respectively.

3.1.2 The algorithm

The aim of the EM-algorithm is to estimate Θ such that the observed data X are the most likely. To find this maximum likelihood estimator can be difficult, since the latent variables Z are not observed. Therefore the problem is broken down in two steps that are evaluated in each iteration, starting from an initial estimate for Θ.

The E-step

The expectation step, or E-step, estimates the missing data Z from observed data X , using the conditional expectation of the log-likelihood function, which is often denoted as the Q-function:

E_{Z|X ,Θ}(k−1)[l(Θ|Y)] = Q(Θ|Θ(k−1)). (3.7)

Here Θ(k−1) denotes the estimate of the previous iteration (k − 1). Assume we are dealing incomplete data, with known tl and tu. We define the Q-function as

Q(Θ|Θ(k−1)) = E " X i∈X M X j=1 Zijlog βjf (xi|tl, tu, θ(k)) X , Θ(k−1) # =X i∈X M X j=1 EZij|X , Θ(k−1) log βjf (xi|tl, tu, θ(k)). (3.8)

(17)

By the definition of the distribution of Zij, the expectation in (3.8) corresponds to the

probability that observation xi is drawn from component density j. We define these

probabilities as z_ij(k)= EZij|X , Θ(k−1) = Pr(Zij = 1|X , Θ(k−1)) = β (k−1) j f (xi|tl, tu, θ(k−1)) PM m=1β (k−1) m f (xi|tl, tu, θ(k−1)) . (3.9)

z(k)_ij are called the posterior probabilities. They have a closed form solution for any density f . Using the definitions of the weights βj in (2.7) and f (xi|θ(k−1)) in (2.8), they

can be computed for any level of truncation. The Q function can now be written as

Q(Θ|Θ(k−1)) =X i∈X M X j=1 zijlog βjf (xi|tl, tu, θ(k)) =X i∈X M X j=1 zij log(βj) + log(f (xi|tl, tu, θ(k))) =X i∈X M X j=1 zijlog(βj) + X i∈X M X j=1 zijlog(f (xi|tl, tu, θ(k))). (3.10) The M-step

The M-step, or maximization step, maximizes the Q function obtained in the E-step with respect to the model parameter Θ. Hence,

Θ(k)= argmax

Θ Q(Θ|Θ

(k−1)₎ _(3.11)

Looking at the form of (3.10), the maximization can be split in two parts; the mixing weights β(k), and the density parameters θ(k). For the mixing weights β, we need to maximize X i∈X M X j=1 z_ij(k)log(βj). (3.12)

Note that z_ij(k)does not depend on β(k), but on β(k−1)which are known from the previous iteration (k − 1). The restrictionPM

j=1βj = 1 is enforced by setting βM = 1 −PM −1j=1 βj.

Equation (3.12) can then be rewritten as X i∈X M −1 X j=1 z_ij(k)log(βj) + z_iM(k)log(1 − M −1 X j=1 βj) ! . (3.13)

Taking the partial derivative with respect to βj and setting it equal to zero yields

X i∈X M −1 X j=1 z_ij(k) β_j(k) − z (k) iM 1 −PM −1 j=1 β (k) j ! =X i∈X M −1 X j=1 z(k)_ij β_j(k) − z (k) iM β(k)_M ! = 0 (3.14)

Rearranging terms yields

β_j(k)= P i∈Xz (k) ij P i∈Xz (k) iM β_M(k) for j = 1, ..., M − 1. (3.15)

(18)

Using the restriction βM = 1 −PM −1j=1 βj and the fact that zij is a partition (each

observation must come from one of the component densities) we obtain 1 = M X j=1 β_j(k)= P i∈X PM j=1z (k) ij β (k) M P i∈X(z (k) iM) = nβ (k) M P i∈X(z (k) iM) . (3.16) Hence β_M(k)= PM j=1z (k) iM n . (3.17)

After substituting (3.17) in (3.16), it can be seen that (3.17) actually holds for all βj.

Intuitively, they are the updated component means of the posterior probabilities. Hence, the optimal weights β(k) are

β_j(k)= P i∈Xz (k) ij n , (3.18) for j = 1, ..., M .

Next, we need to maximize X i∈X M X j=1 zijlog(f (xi|tl, tu, θ(k))). (3.19)

I will elaborate on the M-step solutions for the various densities we consider in the next section. Often there is no closed form solution for the parameters. The EM-algorithm stops when the differences in likelihood for Θ(k) and Θ(k−1) is sufficiently small.

3.1.3 Convergence

A pitfall of the EM-algorithm is that it depends on the initial estimate of Θ, and it may therefore reach a local maximum of the log-likelihood function (Miljkovic and Gr¨un, 2016). Therefore, it is important to carefully select a starting point for the algorithm. Obviously, the closer the initial value is to the maximum likelihood estimates, the faster the algorithm will converge.

3.2 Truncated Erlang mixtures

Lee & Lin (2010) first formulated the EM algorithm for a finite mixture of Erlang distributions to a complete dataset, under the assumption of a common scale parameter, fixed shape parameters r and a fixed number of mixing components M . In Verbelen et. al. (2015) this is further expanded by introducing the EM algorithm for truncated and censored data, of which also the multivariate extension is available in Verbelen et. al. (2016). I follow the approach of Verbelen et. al. (2015) for univariate truncated data without censoring. Recall density (2.10); rewritten as a truncated mixture distribution with known truncation points tl and tu we obtain

f (x|Θ) =

M

X

j=1

βjf (xi|tu, tl, rj, θ), (3.20)

where Θ = (tl, tu, r , β, θ). We write the cumulative mixture distribution function as

F (x|Θ) =

M

X

j=1

βjF (xi|tu, tl, rj, θ). (3.21)

Note that for the Erlang mixtures, there are not M different parameter vectors but only a single common scale parameter that are updated in the EM algorithm.

(19)

3.2.1 Initial Step

The denseness property by Tijms (1994) provides a suitable starting point for the EM algorithm for Erlang mixtures. The initial estimate of θ is set to

θ(0)= max(d) rM

. (3.22)

The initial weights α are set to α(0)_j =

Pn

i=1I rj−1θ(0) < di≤ rjθ(0)

n , (3.23)

for j ∈ (1, ..., M ). Note that rM is the largest shape value and r0 = 0. Intuitively

the weights are the relative frequencies of the number of datapoints in the interval (θ(0)_r

j−1, θ(0)rj]. Transformation of the weights αj into βj as described in (2.7) takes

truncation into account.

3.2.2 M-step solution

Since there is only one common scale parameter θ to estimate rather than M parameter vectors θj, there is only one partial derivative in the M-step with respect to density

parameters. This partial derivative is calculated in Appendix B of Verbelen et. al. (2015), and is equal to ∂Q(Θ|Θ(k−1)) ∂θ θ=θ(k) =1 θ2 X i∈X −n θ M X j−1 β_j(k)rj − n θ2 M X j=1 β_j(k) (tl) rj_e−tl/θ_{− (t} u)rje−tu/θ θrj−1_(r j− 1)!(F (tu|rj, θ) − F (tl|rj, θ)) θ=θ(k) (3.24) for truncated data. Setting (3.24) equal to zero yields no closed form solutions for θ(k), therefore a Newton-type algorithm is used. A Newton-type algorithm is a heuristic to iteratively approximate the root of a continuous function by starting with an initial guess xn and setting xn+1 = xn− _ff (x0_(xn)

n), until the difference is sufficiently small. Here

the initial guess for θ(k) is θ(k−1).

3.2.3 Adjusting and Tuning

Shape parameters

For the Erlang mixtures we have fixed the shape parameters and used as rj = s × j for

j = 1, ..., M and some integer s as initial values. Here s is a spread factor which allows for a wider spread on which the initial weights are determined. However, the likelihood could still improve with a different set of shape parameters. A relative straightforward but complete method byt Verbelen et. al. (2015) to update the shape parameters is as follows. After finding the best fitted mixture with the EM algorithm:

1. Repeat the EM algorithm with the final estimate for the weights and θ, but with shape parameters (r1, ..., rM −1, rM + 1). Repeat this until the likelihood does not

improve anymore. Then apply this to all other shape parameters rM −1to r1, until

all shape paremeters are updated and the likelihood does not increase anymore. 2. Do the same as in step 1, but now deducting one from each shape parameter in

turn until all are updated.

3. Repeat step 1. and 2. until the likelihood does not converge anymore.

After executing these steps for multiple spreads, the likelihood cannot be improved anymore by changing any of the parameters in the current mixture.

(20)

Reducing the number of Erlangs

After adjusting the shape parameters, it is important to investigate possible overfitting. The Akaike’s Information Criterion (AIC, Akaike 1974) and Bayesian Information Cri-terion (BIC, Schwarz, 1978) penalize for an increasing complexity of the model. More components used in the mixture, will likely result a better fit. However, model complex-ity increases severely, and therefore the information criteria allow us to quantify this trade-off. The AIC and BIC are defined as

AIC = 2 × df − 2 log(L(Θ(k)|Y)) BIC = log(n) × df − 2 log(L(Θ(k)|Y)),

where df is the number of parameters estimated in the model, or the degrees of freedom. The lower the values of these information criteria, the better the model. Starting off with a relative high number of M initial Erlangs, we can determine the best estimate using the EM-algorithm and the shape updating scheme described above. Then using a backward selection technique, we then use M −1 as the number of Erlangs in de mixture, by deleting the mixture that has the lowest weight βj assigned. Backward selection is

preferred over forward selection, since otherwise it is unclear which shape parameters or weight distribution to use for the expanded model.

3.3 Other truncated light- and heavy-tailed distributions

For the remaining mixture distributions described in section 2.2 the R package FlexMix, first introduced by Leisch (2004) can be used for untruncated data. The FlexMix package is written in such a way that it generalizes the already available R environments on mixture models such as fpc for linear models (Hening, 2000) and mmlcr for mixed latent class regression (Buyske, 2003). However, FlexMix does not allow for truncation and we have thus build a manual implementation of the EM-algorithm which allows for truncation. These implementations are inspired by the R implementation by Verbelen et. al. (2015) for Erlang mixtures. Again, we will consider the complete data vector Y and a parameter vector Θ consisting of weights α and the distribution specific parameter vectors θj for j ∈ (1, ...M ). Note that there are multiple different parameter vectors θj

for each of the mixture components, rather than using common parameters as is the case in the mixture of Erlangs. This increases flexibility of the mixture distribution and prevents the need for additional updating schemes for fixed parameters, since there are none. However, increasing flexiblity comes at a price as keeping all parameters free does mean that the computation times will increase significantly in comparison to the Erlang case.

3.3.1 Initial Step

The first step is again to determine an initial value for all the mixtures that are con-sidered. In Miljkovic and Gr¨un (2016) there are three initialization strategies proposed. These three strategies partition the initial data X into M clusters Xj for j = 1, ..., M

1. Euclidean distance based, which minimizes the distance between the obervations and M randomly selected cluster centers.

2. K-means, which obtains a partition that optimizes the means criterion. The k-means algorithm is introduced by Forgy (1965) and MacQueen (1967), and is an iterative algorithm. It aims to minimize the squared euclidean distance between the observations in each cluster and their cluster means. It then reassigns the observations to the minimized distance cluster and recalculates the cluster means.

(21)

3. Random clustering, which simply assigns a number from the set {1, ..., M } with replacement to each observation.

Note that these only determine the initial mixing weights α(0)_{, which are the relative}

frequencies of the cluster sizes normalised to one. Transformation of the weights αj into

βj as described in (2.7) takes truncation into account. The initial density parameters

θj(0) for j = 1, ...M are obtained by taking the maximum likelihood estimators with

respect to Xj:

θj(0) = argmax θj

L(θj|Xj). (3.25)

These initial estimates are the MLE estimators for non-mixture distributions which can be readily calculated using closed form solutions if available (depending on the distribution), or a newton-type algorithm.

3.3.2 M-step solutions

Following the approach in Miljkovic and Gr¨un (2016), new estimates for the M different parameter vectors are obtained by solving a weighted maximum likelihood estimation problem for each of the component distributions, where the weights are the posterior probabilities found in the E-step:

θ(k)_j = argmax θj Q(Θ|Θ(k−1)) = argmax θj X i∈X zijlog(f (xi|tl, tu, θj)) for j = 1, ..., M . (3.26)

When there is no truncation, hence tl = 0 and tu = ∞, there may be closed form

solutions for some of the distribution parameters, as is shown in section 2.2 of Miljkovic and Gr¨un (2016). However, when tu > 0 and tl < ∞, (3.26) can be rewritten using

(2.8): θ(k)_j = argmax θj X i∈X zijlog f (x_i, θ_j) F (tu; θj) − F (tl; θj) = argmax θj X i∈X zij h log f (xi, θj)) i − z_ijhlog(F (tu; θj) − F (tl; θj) i . (3.27)

Since the last term contains the difference of the cdf’s inside the logarithm, there is no closed form solution available for any of the densities considered. Hence we solve this using the optim package in R, using the estimates from the previous iteration as a starting point. In appendix A we have given one implementation for the case of a Burr mixture distribution, which is based on the publicly availble R implementation for Erlang mixtures by Verbelen et. al. (2015). The method can be extended to the various other distributions considered using widely available R extensions such as the actuar package first introduced by Dutang et. al. (2008).

3.3.3 Adjusting and Tuning

As described before, we can end up at a local maximum, which may be dependent on the initial value. Therefore, we execute the EM algorithm multiple times with all three initialization strategies to find the best fit. To prevent overfitting we again resort to the AIC and BIC. However, we do not need to use a backwards approach as with the Erlang mixtures, since we do not assume any parameters to be fixed in the EM-algorithm. This means we start off by evaluating the EM-algorithm using only one component, and increase M by one, until the AIC and BIC show signs of overfitting when adding an additional component.

(22)

Case studies

In this chapter we first introduce two risk measures. Then we conduct two case studies on the datasets described in the introduction. We will fit the data using the EM algorithm described in chapter 3, on the Erlang mixtures and the other distributions described in 2.2, and assess several model selection tools in order to judge the results.

4.1 Risk Measures for mixture distributions

Various risk measures have been developed in order to quantify the exposure to risk of an insurer. Two of those are the Value at Risk (VaR) and Tail Value at Risk (TVaR). The Value at Risk is defined as a quantile of the distribution of the underlying losses. It is defined as

VaRp= F−1(p|Θ) (4.1)

where p is the chosen quantile. The Tail-at-Risk is a closely related to the Value-at-Risk but takes into account the values of in the right tail rather than looking only at the value corresponding to p. It is defined as

TVaRp = E(X|X > VaRp) = VaRp+

Z ∞

VaRp

(1 − F (s|Θ))ds. (4.2)

When the underlying distribution F (·|Θ) is a mixture distribution, we can not find a closed form solution for the VaR or TVaR. However we can resort to newton-type algorithms to obtain results with high precision.

4.2 Danish Fire losses

The Danish fire insurance set, introduced in chapter 1, is the first dataset we will fit our mixture distributions on. There are n = 2492 non-truncated observations, hence we can set tl = 0 and tu = ∞. Some summary statistics are given in table 4.1. The high

positive skewness indicates a heavy right tail in the distribution.

First we fit a mixture of Erlang distributions to the data, using various spread factors s, and an initial number of 20 Erlangs. We stop the EM algorithm when the increase in likelihood is smaller than ε = 0.001. We loop through spread factors s from 1 to 10 with intervals of 0.01 to find the best possible mixture model using Erlangs. As described in section 3.2, the spread factors allow for a wider spread of the initial shape parameters from which the weights are determined. We find the best model with the BIC as a selection criterion has a spread factor of s = 9.24. The reason why we use the more conservative BIC as a model criteria is that it penalizes more heavily for model complexity than the AIC. This is often preferred in the literature, see e.g. Fraley

(23)

22 Florian Maassen — Modeling insurance loss data Minimum 0.313404 First quartile 1.157184 Mean 3.062699 Median 1.633858 Third quartile 2.645484 Maximum 263.2504 Skewness 19.88415

Table 4.1: Summary statistics of the Danish fire data.

and Raftery (2002). The best Erlang mixture model is summarized in table 4.2. We immediately observe the very skew distribution of the weights α. The majority of the data is described with two Erlangs with weight of 80.7% and 13.3%. Three of the component weights have a value below 1.0%, with high shape parameters. The high difference in shape parameter, but a common scale parameter could indicate the pitfall of the flexibility when dealing with heavy tailed data.

αj rj θ AIC BIC NLL 0.807357022 6.2 0.2590372 8238.557 8320.049 4105.278 0.133424489 17.4 0.030236281 37.6 0.017516855 66.0 0.007438830 108.2 0.002822671 199.4 0.001203852 694.0

Table 4.2: Best fit using the mixture of Erlang distribution with a spread factor s = 7.20. NLL indicates the negative log likelihood.

Next we fit the other mixtures of light- and heavy tailed distributions using all three initial methods with the BIC as a model criteria. We used all the initialization methods 25 times. We run the algorithm starting from a one component mixture, every time increasing the number of components by one until the BIC does not decrease anymore. The best models for all densities are given in 4.3. These results are in line with the estimates in table 1 of Miljkovic and Gr¨un (2016). The best mixture with respect to the BIC is the 2-component Burr mixture, of which the parameters are given in table 4.4.

Mixture Burr Inv-Burr Gamma Log-normal Inv-Gaussian Weibull

M 2 3 5 5 5 7

BIC 7627.713 7647.386 7725.342 7667.419 7678.105 7779.905

AIC 7586.967 7583.357 7643.851 7585.927 7596.613 7663.488

NLL 3786.483 3780.678 3807.925 3778.964 3784.307 3811.744

Table 4.3: Best mixture models of other heavy and light-tailed distributions and the corresponding information criteria fitted to the Danish fire losses. Bold indicates the best model.

αj λj γj θj

0.6915052 0.02586789 48.766388 0.8577909

0.3084948 0.27742547 6.592776 1.2957470

(24)

0 50 100 150 200 250 300 0.00 0.05 0.10 0.15 0.20 0.25 Claim size Density

Fitted Density Function Observed Relative Frequency

(a) Erlang mixture: Full domain

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 Claim size Density

(b) Erlang mixture: Zoomed on the body

0 50 100 150 200 250 300 0.00 0.05 0.10 0.15 Claim size Density

(c) Burr mixture: Full domain

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 Claim size Density

(d) Burr mixture: Zoomed on the body

Figure 4.1: Graphical representation of the 2-component mixture of Burrs and the his-togram of the data, for the full range and zoomed on the x-axis.

We immediately observe the difference in likelihood and the information criteria, in favour of the Burr mixture. Even though the mixture of Erlangs contain more parame-ters and mixing components than the mixture of two Burrs, the likelihood is significantly lower. To investigate the difference more closely a graphical comparison of the fitted mix-tures and the histogram of the data is given in figure 4.1. Here we both show the full domain of the data, and we zoom in on the x-axis to observe the fit in the body of the mixtures. By body of the distribution we refer to all observations that are smaller than the 90th percentile of the full dataset. From the full density graph it is rather hard to observe the fit, however, it is clear from figure 4.1b that the mixture of Erlang is not able to obtain a good fit for the body of the data. Especially the light left tail is overestimated. Since the first and second component try to model the majority of the observations using mainly one Erlang, it is not able to fully capture the characteristics of the data. In figure 4.1c we do observe a decent fit in the whole body of the distribution for the Burr mixture, even though there are only 2 mixing components. This shows that the increase is flexiblity is highly significant.

To better investigate the fit of the tails we look at the QQ-plots in figure 4.2a-d. For the Erlang mixture we observe a better fit in the tail, since there are only three major deviations from the 45-degree line. The burr mixture finds a somewhat weaker fit in the tail, since there are more deviations from the 45-degree line. The difference in tail fitting arises from the Erlangs minimum weight components to extrapolate a fit in the tail. The simulation example using generalized pareto distributions in Verbelen et. al. (2015) shows similar behaviour. However, for comparison the best inverse-burr and

(25)

24 Florian Maassen — Modeling insurance loss data 0 50 100 150 200 250 0 50 100 150

Observed Danish fire data

Fitted 7-component Erlang

(a) Erlang mixture

0 50 100 150 200 250

0

50

100

150

Fitted 2-component Burr

(b) Burr mixture 0 50 100 150 200 250 0 20 40 60 80 100

Fitted 3-component inverse Burr

(c) Inverse-Burr mixture 0 50 100 150 200 250 0 20 40 60 80

Fitted 5-component log-normal

(d) Lognormal mixture

Figure 4.2: Graphical representation of the mixture model quantiles against the empir-ical quantiles of the Danish fire losses.

lognormal mixture from table 4.3 contain more components while still being a better model than the Erlang in terms of the BIC and log-likelihood. We observe that the inclusion of more mixing components certainly gives no worse fit in the tails than the Erlang mixture. On the contrary, in the Erlang mixture we observe underestimation in the tails, whereas the Inverse-Burr and Lognormal display overestimation. From the perspective of a risk manager it might be more wise the overestimate these extreme events than to underestimate them.

VaR0.95 TVaR0.95 VaR0.995 VaR0.995

Empirical estimate 8.41 22.16 32.43 79.24 7 − M Erlang 8.88 21.74 31.07 78.71 2 − M Burr 8.28 35.43 45.86 211.16 3 − M Inv-Burr 8.51 22.04 35.53 88.20 5 − M Gamma 8.18 22.44 41.53 55.99 5 − M Log-normal 8.46 20.92 38.00 62.82 5 − M Inv-Gaussian 8.35 22.30 43.12 67.20 7 − M Weibull 8.75 22.28 32.90 81.56

Table 4.5: Summary of risk measures VaRp and TVaRp.

The results for the risk measures are summarized in table 4.5. For a quantile level of p = 0.95 the VaRp and TVaRp mixture estimates are relatively equal to the empirical

(26)

Gamma, Log-normal and inverse-Gaussian show relatively low VaRp compared to the

empirical estimate. The fact that the Erlang VaR and TVaR lies very close to the empirical values is caused by the fact that there are multiple components with small weights that coincide with the tail observations. Notable is also the severe difference of TVaR for the Burr mixture, which indicates a rather fat tail in the fit in comparison to the other models, this might be model dependend.

4.3 Secura Re Car Insurance

The Secura Re car insurance dataset obtained from Beirlant et. al. (2004) contains n = 371 truncated automobile claims. The claims are only reported if the size is at least 1 200 000 euro, hence we set tl = 1200000 and tu = ∞. In Verbelen et. al. (2015) this

dataset has been used to fit a mixture of Erlangs, and in addition they show the pricing of a reinsurance contract based on the model. Here we will first replicate those results, and conduct a similar analysis for the presumably more flexible approach for the other light and heavy tailed mixtures. Again the BIC is used as a selection criterion for all the models. Some summary statistics of the Secura Re data are given in table 4.6. Notice the difference in skewness compared to the Danish fire losses. Recalling the histogram in section 1, and the summary statistics below, there is also no sign of multimodality in this dataset. Minimum 1 208 123 First quartile 1 572 880 Mean 2 230 667 Median 1 944 358 Third quartile 2 609 292 Maximum 7 898 639 Skewness 2.421635

Table 4.6: Summary statistics of the Danish fire data.

Fitting a mixture of Erlangs yields an optimal model of 2 components. We again fitted the model starting from 20 initial components and spread factors ranging from 1 tot 10. The optimal model is given in table 4.7.

αj rj θ AIC BIC NLL

0.96904674 5.0 359554 11007.9 11023.57 5499.95

0.03095326 15.5

Table 4.7: Best fit using the mixture of Erlang distribution with a spread factor s = 1.5. We observe that one Erlang component explains the majority of the data, and an ad-ditional component with very low weight to extract a reasonable fit in the tail of the disitribution. The density plot compared to the data, as well as the QQ plot show in figure 4.3 show a good fit in both the body and the tail of the distribution.

Fitting all other 6 light and heavy tailed distribution yields remarkable results for the Secura Re dataset. All flexible mixture components do not further improve after adding one mixture component starting from M = 1, as can be seen in table 4.8. This means that a mixture model is certainly not improving the fit of this particular insurance loss data. However, one must note that in comparison with the Danish fire data, there are not as many extreme outliers, and the skewness is not as high. Also, there is no clear multimodality visual in the data. The inverse-Burr mixture provides the best fit for the Secura Re claims, and is also slightly better in terms of the BIC than the mixture of 2 Erlangs, this is because the inverse-Burr is a heavy-tailed distribution and the Erlang

(27)

2e+06 4e+06 6e+06 8e+06

0e+00 2e-07 4e-07 6e-07 Claim size Density

Fitted Truncated Density Function Observed Relative Frequency

(a) Fitted density vs. Secura data histogram

1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06

(b) QQ-plot of fitted quantiles vs empirical quan-tiles

Figure 4.3: Fitted 2-component Erlang mixture compared to the Secura Re data

is not, hence it is able to better approximate the data in the tail. This evidence would suggest that it is certainly not preferable to use an Erlang mixture or even any other mixture model as it only increases complexity where it is not necessary. It is clear that single parametric distributions can provide a good enough fit for the Secura Re dataset. The actuary should resort to more specific approaches widely available in the literature on single parametric distributions as already indicated in the section 1. To investigate the best possible fit using a single parameter approach is outside the scope of this master thesis.

Mixture Burr Inv-Burr Gamma Log-normal Inv-Gaussian Weibull

M 1 1 1 1 1 1

BIC 11100.64 10859.88 11024.78 11018.37 11020.58 11026.18

AIC 11088.89 10848.13 11016.95 11010.54 11012.75 11018.35

NLL 5541.444 5421.065 5506.476 5503.268 5504.376 5507.173

Table 4.8: Best mixture models of other heavy and light-tailed distributions fitted to the Secure Re data.

(28)

Simulation Study

In order to more carefully assess the behaviour of the models with respect to heavily positive skewed data such as the Danish fire losses we conduct a simulation study. In Verbelen et. al. (2015) generalized pareto data is simulated to assess the behaviour in the tail of the mixtures. It is concluded that there are limitations to the Erlang mixture since it will assign a lot of components with very small weights to coincide with observations in the right tail. Here we will conduct a similar analysis but also on the other light and heavytailed mixtures introduced in 2.2. It is claimed in Miljkovic and Gr¨un (2016) that these distributions provide a decent fit on the whole distribution including its tail, however these conclusions are drawn from a single data set and it is therefore wise to further inspect its behaviour when dealing with heavily skewed data. In addition we also simulate some bimodal data to study the flexibility of both approaches under heavy tailed data.

0 1 2 3 4 5 0.0 0.5 1.0 1.5 f_gpd σ = 0.5 σ = 1.0 σ = 3.0 σ = 5.0 (a) ξ = 0.5, µ = 1 0 1 2 3 4 5 0.0 0.5 1.0 1.5 f_gpd ξ = 0.5 ξ = 1.0 ξ = 1.5 ξ = 2.0 (b) σ = 1, µ = 1

Figure 5.1: Generalized pareto density for various parameters

5.1 GPD-simulation

The generalized pareto distribution is a generalized distribution with three parameters parameters ξ > 0, σ > 0 and µ > 0. The pdf is given by

f (x; µ, σ, ξ) = 1 σ 1 + ξ(x − µ) σ !(−1/ξ−1) , (5.1)

where x > 0. The GPD is generalized in the sense that the GPD can be reduced to three seperate other known distributions for certain parameters. Namely, if ξ and µ are zero,

(29)

the pdf reduces to the exponential distribution. If ξ > 0 and µ = σ/ξ then the GPD is equal to a pareto distribution. If ξ < 0 then the GPD is a Pareto type II distribution. The parameter ξ of the GPD can be seen as a measure of how heavy the tail of the distribution is. Here only the case ξ > 0 is of interest to find the behaviour with respect to heavy tailed data. The higher the parameter ξ, the more skewed the distribution. Figure 5.1 shows the behaviour of the density with respect to the parameter ξ. Notice that the location parameter can be seen as a truncation parameter tl.

5.1.1 Unimodal GPD data

We start off by generating two data sets of n = 500 observations, with µ = 1, σ = 1 and ξ = (0.5; 1). For the Erlang mixture we follow the approach in example 5.4 of Verbelen et. al. (2015) and try to find an optimal fit starting from all combinations of s ∈ (1, ..., 10) and M = max(x)/i for i ∈ (1, ..., 10). For all other we again loop through all the initalization methods in order to find maxima. The best fitted mixtures are given in table 5.1. ξ = 0.5 ξ = 1 M NLL BIC M NLL BIC Erlang 4 729.4792 1508.675 9 970.6783 2053.22 Burr 1 733.3418 1510.186 1 982.7595 1984.163 Inverse-Burr 2 734.4232 1512.349 2 987.5498 1998.251 Gamma 2 733.4018 1497.877 2 978.7216 2010.214 Log-normal 2 733.4404 1497.954 2 981.2498 1993.573 Inv-Gaussian 2 731.8991 1494.871 2 980.0977 1991.269 Weibull 1 739.6943 1499.818 4 962.2009 1992.762

Table 5.1: Best fitted mixture models for 2 GPD distributions with µ = 1 and σ = 1.

For the Erlang mixture, we observe similar behaviour as in the GPD-example by Verbe-len et. al. (2015); a lot of Erlang components with small weights that coincide with the extreme value observations in both cases for ξ. The other distributions provide better fits in terms of the BIC than the Erlang distribution. This is no suprise, as the Erlang is no heavy tailed distribution itself. The 2-component Inverse-gaussian for ξ = 0.5 and the 1-component Burr for ξ = 1 provide good fits for the body of the distribution, but are not able to extrapolate the heavy tails through additional components. Although likelihoods do increase with additional components, the trade-off in terms of model com-plexity is too high. The BIC as a model criteria doesn’t allow the algorithm to add more components with minimum weights in the tail. We again need to stress the fact that the Erlang mixtures deal with common scale parameters and is therefore not as heavily punished for an increased number of components than the other mixture distributions. In the QQ-plots in figure 5.3 we can see the severe underestimation of the tail values for the Inverse Gaussian and Burr mixtures, whereas the Erlang has presumably found a better fit for the tail since there are less deviations from the 45-degree line. However, only through coinciding with heavy tailed observations in the tail as discussed earlier. In general, considering heavily skewed unimodal data, the mixture models provide no preferred modeling alternative to single distribution approaches or composite modeling approaches already widely available in the literature as discussed in section 1, since they result in maximum 2 mixture components which are not able to capture a fit as good as an extreme value distribution such as the GPD itsself. The simulation here is somewhat stylized since we know the underlying distribution of the data.

(30)

(a) Best fitted Inv-Gauss for ξ = 0.5 0 500 1000 1500 2000 0 20 40 60 Observed GPD data

Fitted 1-component Burr

(b) Best fitted Burr for ξ = 1

0 20 40 60 80 0 10 20 30 40 Observed GPD data

(c) Best fitted Erlang for ξ = 0.5

0 500 1000 1500 2000 0 500 1000 1500 Observed GPD data

(d) Best fitted Erlang for ξ = 1

Figure 5.3: QQ plots of the best fitted Erlang mixtures and best other mixture models for 2 GPD distributions with µ = 1 and σ = 1.

x frequency 0 20 40 60 80 0 20 40 60 80 100

(a) Full data

x frequency 0 1 2 3 4 5 6 0 5 10 15 20 25 30

(b) Body of the data

Figure 5.2: Histograms of the bimodal simulated GPD-data

5.1.2 Bimodal GPD data

Since the use of mixture models greatly increase flexibility in comparison to single distribution or composite modeling approaches, we further investigate the behaviour of our two strategies by simulating bimodal data with heavy skewness. In the Danish Fire losses case we have seen that the Erlang mixture was not able to obtain a good fit in

(31)

the body compared to the other light- and heavytailed distributions. However, this is caused by a few observations in the left tail which had small values compared to the majority of the data. To further inspect if a mixture of Erlang is capable of finding a good fit for multimodal and heavily skewed data, we conduct a simulation of Bimodal data. We again make use of the generalized pareto distribution, but construct a sample of the following form:

fb(x) =0.5f (x; µ = 1, σ = 1, ξ = 0.5)

+ 0.5f (x; µ = 1.75, σ = 0.75, ξ = 0.5). (5.2)

Next we simulate a sample of n = 500 observations from the bimodal pareto distribution fb. In figure 5.4 histograms of the data are given of both the full set and the body of

the data. We clearly observe the bimodality of the distribution in the histogram of the body. After fitting the Erlangs and all 6 other distributions we compare the best fit for the Erlangs with the best fit of the other distributions, which is a 2 component Burr. Optimal estimates of the two models are given in table 5.2.

Component 1 2 3 4 Erlang (BIC: 1603.633, NLL: 776.9581) θ 0.5449263 r 4 12 30 137 α 0.930687946 0.057007681 0.008730312 0.003574061 Burr (BIC: 1568.987, NLL: 762.7425) λ 0.003387 0.001984 γ 561.148 0.89013 θ 1.743585 4937.5114 α 0.2770925 0.7229075

Table 5.2: Parameter estimates for the best fitted Erlang and Burr model on the bimodal GPD-data.

We again observe very low weight components in the Erlang mixture as seen in the unimodal case. As with the Danish Fire losses, again the Erlang is not capable of finding a suitable fit in the body of the distribution as it assings only one component with a weight of 0.93 to the majority of the data, while the data is clearly bimodal. On the other hand, the Burr mixture is flexible enough to find a good fit in the body of the distribution. This again indicates that the erlang distribution is not a suitable choice when modeling multimodal and heavy tailed data, and one could rather resort to a more flexible approach in which all distribution parameters are free. Although suitable fits for the body can be found using flexible mixture models, we must stress that for the tails of the distribution underestimation is again present, as already seen in the unimodal case. However, it does show the applicability on multimodal data.

(32)

0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Claim size Density

(a) 4-component Erlang

0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 GPD-data 2-component Burr

(b) 2-component Burr

Figure 5.4: Graphical comparison of the body of the best fitted Erlang and Burr mixtures on the bimodal GPD-data.

(33)

Chapter 6

Conclusion

This master thesis is devoted to investigate the behaviour of various mixture models on insurance loss data. Firstly, a description of how mixture models are constructed for truncated and non-truncated univariate distributions is given. We introduce the mixture distributions we consider in this thesis, which are the Erlang distribution and various other light- and heavy-tailed distributions. This list is inspired by Verbelen et. al. (2015) and Miljkovic and Gr¨un (2016).

Secondly, we investigate the use of the expectation maximization algorithm and its ap-plication to mixture models. We compare the imap-plications for the M-step solutions for both the Erlang mixtures and the more flexible approach of other various light- and heavytailed distributions. Two direct consequences of the two different approaches are the computational simplicity and flexibility. On the one hand, the Erlang mixture ap-proach is computationally faster and simpler, as it is executed under the assumption of a common scale parameters. Further tuning is done based on the initial step and the reduction mixture of components from there on. On the other hand, the approach introduced by Miljkovic and Gr¨un (2016) is much more flexible as they do not impose restrictions on parameters. This does mean that computations in the M-step are more complex due to non-closed form solutions for the weighted maximum likelihood esti-mates.

Thirdly, we conduct a case study on the Danish Fire losses dataset, and the Secura Re insurance losses. For the Danish fire losses, we observe a limited fit in the body of the distribution of the Erlang mixture. We also observe that due to a common scale parameter in the Erlang mixtures it is not as heavily punished for adding additional components to obtain a better likelihood. However, as seen in the examples and the simulated data this does not mean it is able to obtain better fits, as the algorithm tries to coincide with the data instead of extracting a better fit in general. In the Secura Re case, we conclude that in case no significant multimodality is present in the data the use of mixture models are rather cumbersome. Given the complexity of the mixture distribu-tions and time constraints, it is not straightforward to conduct goodness of fit tests such as the Kolmogorov Smirnov tests. However, this might be of interest in further research. Finally, the study using simulated heavy tailed GPD-data shows that the more flexible model using only free parameters is able to obtain a better fit in the body of these mixtures than the Erlang mixtures. However, the EM algorithm is not able to find additional components that can extrapolate the heaviness in the tails, since additional components are not beneficiary in terms of the BIC. The whole procedure in general for both the Erlangs en Flexible mixture models can be quite time consuming due to the reliance on newton-type algorithms in the M-steps. Therefore, mixture modeling should only be used when clear multimodality is visible. The outcomes of risk measures

(34)

such as the VaR and TVaR often depend on the tail components of these mixtures, and since they are of low weight and heavily dependend on the tail observations on which the mixture is fitted, the actuary should be rather careful in drawing conclusions from the fits obtained.

(35)

Appendix A: R implementation

of the flexible EM-algorithm

1 ## p d f o f t h e m i x t u r e d i s t r i b u t i o n

p d f <− f u n c t i o n( x , lambda , gamma , t h e t a , a l p h a , t r u n c l o w e r = 0 , t r u n c u p p e r = I n f , l o g = FALSE) {

3 f<−o u t e r( x [ 1 ] , lambda , dburr , gamma ,s c a l e=t h e t a ) i f(l e n g t h( x ) >1){

5 f o r( i i n 2 :l e n g t h( x ) ) {

f<−r b i n d( f ,o u t e r( x [ i ] , lambda , dburr , gamma ,s c a l e=t h e t a ) )

7 }

}

9 d <− rowSums (t(t( f )∗a l p h a ) )

i f(!( t r u n c l o w e r==0 & t r u n c u p p e r==I n f ) ) {

11 d <− d / ( c d f ( t r u n c u p p e r , lambda , gamma , t h e t a , a l p h a ) − c d f ( t r u n c l o w e r ,

lambda , gamma , t h e t a , a l p h a ) ) ∗ ( ( t r u n c l o w e r <= x ) & ( x <= t r u n c u p p e r ) ) } 13 i f(l o g) { d <− l o g( d ) 15 } d 17 } 19## c d f o f t h e m i x t u r e d i s t r i b u t i o n c d f <− f u n c t i o n( x , lambda , gamma , t h e t a , a l p h a , t r u n c l o w e r = 0 , t r u n c u p p e r = I n f , l o w e r. t a i l = TRUE, l o g. p = FALSE) {

21 c<−o u t e r( x [ 1 ] , lambda , pburr , gamma ,s c a l e=t h e t a ) i f(l e n g t h( x ) >1){

23 f o r( i i n 2 :l e n g t h( x ) ) {

c<−r b i n d(c,o u t e r( x [ i ] , lambda , pburr , gamma ,s c a l e=t h e t a ) )

25 } } 27 p <− rowSums (t(t(c)∗a l p h a ) ) i f(!( t r u n c l o w e r==0 & t r u n c u p p e r==I n f ) ) { 29 l <− c d f ( t r u n c l o w e r , lambda , gamma , t h e t a , a l p h a ) u <− c d f ( t r u n c u p p e r , lambda , gamma , t h e t a , a l p h a ) 31 p <− ( ( p − l ) / ( u − l ) ) ˆ { ( x <= t r u n c u p p e r ) } ∗ ( t r u n c l o w e r <= x ) } 33 i f(! l o w e r. t a i l ) { p <− 1 − p 35 } i f(l o g. p ) { 37 p <− l o g( p ) } 39 p } 41 ## Help f u n c i t o n t o c o n s t r u c t a m a t r i x c o n t a i n i n g m i x t u r e d e n s i t i e s 43 c o n s t r d e n s i t i e s<−f u n c t i o n( x , lambda , gamma , t h e t a , t r u n c l o w e r = 0 , t r u n c u p p e r = I n f ) {

f<−o u t e r( x [ 1 ] , lambda , dburr , gamma ,s c a l e=t h e t a )

45 i f(l e n g t h( x ) >1){

f o r( i i n 2 :l e n g t h( x ) ) {

(36)

47 f<−r b i n d( f ,o u t e r( x [ i ] , lambda , dburr , gamma ,s c a l e=t h e t a ) ) } 49 } f 51 } 53## I n i t i a l c l u s t e r g e n e r a t i o n i n i t i a l i z e w e i g h t s<−f u n c t i o n( x , K, t y p e =1){ 55 c l u s t e r<−r e p(NA,l e n g t h( x ) ) i f( t y p e ==1){ 57 #E u c l i d e a n d i s t a n c e b a s e d i n i t i a l i z a t i o n

c e n t e r s 1<−sample( x , K,r e p l a c e=FALSE) #c o n s t r u c t random c l u s t e r s

59 temp min<−r e p( 1 0 e9 ,l e n g t h( x ) ) #h e l p v e c t o r f o r( j i n 1 :K) { 61 f o r( i i n 1 :l e n g t h( x ) ) { d i s t<−d i s t (r b i n d( x [ i ] , c e n t e r s 1 [ j ] ) ) 63 i f( temp min[ i ]> d i s t ) { c l u s t e r [ i ]<−j 65 temp min[ i ]<−d i s t } 67 } } 69 } i f( t y p e ==2){ 71 #K−means b a s e d i n i t i a l i z a t i o n mydata<−a s.m a t r i x( x ) ; 73 f i t <− kmeans ( mydata , K) mydata <− d a t a.f r a m e( mydata , f i t$c l u s t e r ) 75 c l u s t e r [ ]<−mydata [ , 2 ] } 77 i f( t y p e ==3){ #Random b a s e d i n i t i a l i z a t i o n 79 pr<−r e p( 1/K,K) c l u s t e r [ ]<−sample(s e q( 1 ,K, 1 ) ,l e n g t h( x ) ,TRUE, pr ) 81 } r e t u r n( c l u s t e r ) 83 } 85## I n i t i a l s t e p p a r a m e t e r s i n i t i a l <− f u n c t i o n( x , t r u n c l o w e r =0 , t r u n c u p p e r=I n f , M, i n i t m e t h =1){ 87 c l u s t e r<−i n i t i a l i z e w e i g h t s( x ,M, i n i t m e t h ) a l p h a<−a s.d a t a.f r a m e(t a b l e( c l u s t e r ) ) [ , 2 ]/ l e n g t h( x ) 89 lambda<−r e p(NA,M) gamma<−r e p(NA,M) 91 t h e t a<−r e p(NA,M) f o r( i i n 1 :M) {

93 MLE<−optim(c( 1 , 1 ) , b u r r MLE, x=x , method=”L−BFGS−B”,l o w e r=c( 0 , 0 ) ,upper=c(

I n f , I n f ) ) # i n i t i a l 1 , 1

gamma [ i ]<−MLE$ p a r[ 1 ]∗a l p h a [ i ]

95 t h e t a [ i ]<−MLE$ p a r[ 2 ]∗a l p h a [ i ]

lambda [ i ]<−l e n g t h( x )/sum(l o g(1+( x/t h e t a [ i ] ) ˆgamma [ i ] ) )∗a l p h a [ i ]

97 }

t p r o b a b i l i t i e s <− p b u r r ( t r u n c u p p e r , lambda , gamma ,s c a l e=t h e t a ) − p b u r r ( t r u n c l o w e r , lambda , gamma ,s c a l e=t h e t a )

99 b e t a = a l p h a ∗ t p r o b a b i l i t i e s / sum( a l p h a∗ t p r o b a b i l i t i e s )

l i s t( lambda=lambda , gamma=gamma , t h e t a=t h e t a , a l p h a=a l p h a , b e t a=b e t a )

101 }

103 #Weighted MLE f o r i n i t i a l e s t i m a t e s

b u r r MLE<−f u n c t i o n( parms , x ) {

105 s h a p e 1 <− l e n g t h( x )/sum(l o g(1+( x/parms [ 2 ] ) ˆ parms [ 1 ] ) )

−sum( d b u r r ( x , s h a p e 1 = shape1 , s h a p e 2 = parms [ 1 ] , s c a l e = parms [ 2 ] ,

l o g = TRUE) )

107 }

Modeling insurance loss data : the applicability of mixture distributions using the EM-algorithm