Dependence modelling of automobile claims data through Copula-based regression : using R-package `CopulaRegression

(1)

Dependence Modelling of Automobile

Claims Data through Copula-based

Regression

| using R-package `CopulaRegression'

Shadee van Vlaanderen

Master's Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Shadee van Vlaanderen Student nr: 10221018

Email: shadee1993@hotmail.com Date: August 10, 2016

Supervisor: Dr. Sami Umut Can Second reader: Dr. Katrien Antonio

(2)

This document is written by Shadee van Vlaanderen who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen iii

Abstract

In this thesis a bivariate copula-based regression method is applied to automobile claims data with the use of R-package `CopulaRegression' (Kramer & Silvestrini, 2012). The copula-based regression method gives the ability to t one model to two dependent variables conditional to their covariates. First a Generalized Linear Model (GLM) is implemented for both claim numbers and claim sizes, in order to describe their marginal distributions in terms of relevant covariates. Second, the dependency between claim numbers and claim sizes is modelled through a copula function. Finally, through Maximum Likelihood Estimation both the GLM re-gression models and the copula combining the two models are tted. By Akaike's Information Criteria (AIC) the tted models are examined and the best tting model is determined, also the resulting loss estimation is analysed.

Keywords Copula Regression, Dependence Modelling, Generalized Linear Models, Automobile Claims, Claim Sizes, Claim Numbers, Policy Loss

(4)

Preface v

1 Introduction to Bivariate Copula-based Regression 1

2 Copula-based Regression Method 3

2.1 Bivariate Copula Model . . . 3

2.2 Generalized Linear Models . . . 5

2.2.1 Model for Marginal Distributions . . . 6

2.2.2 Selection of Covariates . . . 7

2.3 Copula-based Regression Model . . . 9

2.4 Maximum Likelihood Estimation . . . 9

2.4.1 Finding the Best Fitting Model . . . 10

2.4.2 Best Fitting Model Parameters . . . 11

2.5 Policy Loss Estimation . . . 11

3 Application 14 4 Results and Analysis 16 5 Conclusion 22 5.1 Further Research . . . 22

References 24

Appendix I 26

(5)

Preface

The making of this thesis to me was the closing of an era of studying pensions, insurances and risk management. I have been acquiring a lot of knowledge in those elds in the past years during my bachelor and master in actuarial sciences. I am glad to have acquired this knowledge and I know I will prot from it during my whole life. However, the next step in my life is to continue studying at the University of Amsterdam, focussing on teaching mathematics, that is the path where my pursuit of nding the occupation I enjoy has led me. It has been quite a lengthy journey here, but it was worth it. I would like to thank my family and close friends for supporting me in my choices and also for supporting me during my studies.

I would like to thank Dr. Sami Umut Can for his kind supervi-sion and expert guidance throughout the making of my thesis. He has always been very helpfull when advice or help was needed. I also thank him for providing me with the relevant articles for my thesis.

Lastly, I am proud and happy that I will nally be able to say that I have attained a Master's degree in Actuarial Science and Matematical Finance.

(6)

(7)

Chapter 1 Introduction to Bivariate

Copula-based Regression

For insurance companies it is important to get a clear image of future claims. Insurance companies anticipate future pay-outs by estimating the total loss and adjusting for instance their premium amounts to the right levels to compensate for these predicted losses. In insecure times like these it is even more important for insurance companies to be able to oer clients consistency in pay-outs as well as in the amount of premium demanded for insurance coverage. It is therefore important to model future losses properly.

In this thesis the focus is on automobile insurance companies, where individual policy losses equal an individual's number of claims multiplied by the average size of the individual's claims. The total of these individual policy losses gives the estimate for future total loss. Earlier, when modelling losses, it was often assumed that the amount of claims and the size of claims for each policy were independent, as in the compound model by Lundberg (1903). Then claim amounts and claim sizes would be modelled separately. However, this assumption is not justied. If as a result of a wrong assumption, total loss is overestimated, premiums will be set too high. And if on the other hand total loss is underestimated, insurance companies cannot full their pay-outs and might go bankrupt.

Now Kramer, Brechmann, Silvestrini and Czado (2013) have presented a method to apply copula theory in conjunction with Generalized Linear Models (GLMs) to insurance data to obtain a close t for the loss distribution of insurance policies. In recent years, the application of copula theory to model dependencies amongst variables has been gaining ground. A copula function represents the dependence structure of a multivariate distribution with known marginal distributions. With the use of Sklar s Theorem (1959) any multivariate probability distribution can be characterized by a copula with as its variables the univariate marginal dis-tribution functions of the involved variables. See (Nelsen, 2006) and (Durante &

(8)

Sempi, 2015), for example, for more information on the fundamental principles of copula theory.

Copula theory is successfully applied in risk management and actuarial elds for building models, but also in other academic disciplines copulas serve various purposes, for instance in medical research, civil engineering, weather research, random vector analysis and other studies alike. Over the past years copulas have proved their use.

GLMs developed by Nelder and Wedderburn (1972) are the generalization of linear regression through ordinary least squares (OLS), where contrary to OLS an error distribution of the dependent variable other than the normal distribution is allowed. It provides the ability to model the relationship between a dependent variable and its explanatory variables. The GLM has been widely used in car insurance industry and other actuarial applications. The GLM thus provides a way to model dependent variables conditional to its covariates, where the distribution of the dependent variable is from the exponential family. This is often a plausible assumption for the marginal distribution of claim numbers and claim sizes.

Kramer and Silvestrini (2012) proposed to model claim numbers and claim sizes through GLMs, and the relationship between these quantities through cop-ulas. They have also created an R-package `CopulaRegression' that contains for-mulas that execute the algorithms presented in their method. In this thesis the method is applied to real car insurance data to see what result this method leads to when using the provided package. The goal is to nd the best tting model for the considered car insurance company's claim numbers and claim sizes over a certain time-period.

In this thesis the theory behind the application of the method to this data is discussed in Chapter 2. In Chapter 3 the used data and programming tools for implementing the method are outlined. In the next chapter, Chapter 4, the results of the implementation of the method are presented and discussed. Finally, in the last chapter, Chapter 5, a summary of the ndings in this thesis is given along with some ideas for further research. Furthermore, the conclusion to the main question of this thesis will be stated, as to how copula-based regression is applied to bivariate claims data through R-package `CopulaRegression' (Kramer & Silvestrini, 2012) and whether it delivers a best tting model, while taking dependency and signicant covariates into account.

(9)

Chapter 2 Copula-based Regression Method

Say we have got multivariate claims data of a car insurance company of which we wish to model the claim numbers and claim sizes. These two variables are each dependent on several covariates. The idea presented by Kramer et al.(2013) is to rst model the marginal distributions of claim numbers and claim sizes dependent on certain covariates separately through Generalized Linear Models (GLM). Afterwards the two models for the marginal distributions can be combined through a copula function, with the use of copula theory we are able to create a joint model that allows for dependency amongst the two variables. The parameters of the model can then be estimated through Maximum Likelihood Estimation. Finally, the policy losses and thus the total loss can be estimated with the best tting model. In the end, the results from the best tting dependent model will be compared with the results from the classical model where independence is assumed. In this chapter, the theory behind these steps is discussed. Now, since Kramer et al.(2013) have presented a ready to use R-package to implement this theory, the provided formulas that were useful for this implementation are used. The functions and implementations concerned are mentioned when they are used.

2.1 Bivariate Copula Model

From Sklar (1959) it is known that any multivariate distribution can be expressed in terms of a copula with univariate marginal distribution functions as its variables F (x1; x2; : : : ; xp) = C(F1(x1); F2(x2); : : : ; Fp(xp)); (2.1)

where C is a copula and F1; F2; : : : ; Fp are the marginal distribution functions

of the variables X1; X2; : : : ; Xp respectively. If we translate this theorem to the

bivariate case of claim numbers and claim sizes, we get the following:

FX;Y(x; y) = C(FX(x); FY(y)); (2.2)

(10)

where X represents the claim sizes and Y the claim numbers. The copula represen-tation allows us to model the marginal distributions apart from the dependency among the two variables, which is represented by the copula function. In this the-sis, like in the method by Kramer et al.(2013), four copula families will be taken into account, namely the Gauss, Gumbel, Clayton and Frank copulas. Further-more, for comparison with a model where there is no dependence amongst the variables, also a model where independence is assumed will be considered. The copula representation for the joint bivariate distribution model with independence is dened as

FX;Y(x; y) = C(FX(x); FY(y)) = FX(x)FY(y) (2.3)

and in Table 1 (Kramer et al., 2013) the four dependent bivariate copulas and their denitions are shown.

Table 1

Bivariate Copulas and Their Parameter Range

Adapted from \Total Loss Estimation Using Copula-based Regression Models" by Kramer, N., Brechmann, E. C., Silvestrini, D., Czado, C., 2013, Insurance: Mathematics and Economics, 53, p. 837. Copyright 2013 by Elsevier B.V.

Since the copula function is a parametric function depending on parameter , Equation 2.2 can be written as

FX;Y(x; yj) = C(FX(x); FY(y)j): (2.4)

To be able to perform Maximum Likelihood Estimation in the later section of this method, the joint density of X and Y is required. To derive the joint density the partial derivative D1 of copula function C is required, by Kramer et al. (2013)

this partial derivative with regards to variable u was denoted as

(11)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 5 with u; v 2]0; 1[. The joint density was then also by Kramer et al. (2013) derived

as follows: fX;Y(x; y) = _@x@ P (X x; Y = y) = _@x@ P (X x; Y y) _@x@ P (X x; Y y 1) = @ @xC(Fx(x); Fy(y)j) @ @xC(Fx(x); Fy(y 1)j) = fX(x)(D1(FX(x); FY(y)j) D1(FX(x); FY(y 1)j)): (2.6) The rst partial derivatives of the copula families under consideration can be found in Table 2 (Kramer et al., 2013). Now that a construction for the depen-dency amongst the variables is determined, a model for the marginal distribution functions of each variable is constructed in the next section.

Table 2

First Partial Derivatives of Bivariate Copulas

Adapted from \Total Loss Estimation Using Copula-based Regression Models" by Kramer, N., Brechmann, E. C., Silvestrini, D., Czado, C., 2013, Insurance: Mathematics and Economics, 53, p. 837. Copyright 2013 by Elsevier B.V.

2.2 Generalized Linear Models

In this section a model for the marginal distributions of the concerned variables is determined. With the use of GLM the marginal distributions for the variables claim numbers and claim sizes can be estimated.

The GLM, developed by Nelder and Wedderburn (1972), is the generalization of linear regression. With a GLM a dependent, or response variable with a prob-ability distribution from the exponential family can be expressed in terms of its covariates, also called the explaining variables. This results in the following model

(12)

(Parsa & Klugman, 2011) representing the relationship between a dependent vari-able and its explanatory varivari-ables:

E(Y jX1 = x1; : : : ; Xk= xk) = g 1(0+ 1x1+ : : : + kxk); (2.7)

where g(y) is the link function that denes the relationship between the linear model of covariates X1; : : : ; Xk and the response variable Y . The coecients of

the covariates are denoted by 0; : : : ; k.

2.2.1 Model for Marginal Distributions

For the data considered in this thesis appropriate probability distributions to model the variables have to be chosen. These model choices are roughly estimated through Method of Moments (MoM) just to take the distribution model into consideration for the Maximum Likelihood Estimation later on.

Since claim numbers often follow a Poisson distribution and we are only look-ing at positive claim numbers, uslook-ing a Zero-Truncated Poisson (ZTP) distribution with parameter > 0 for the claim numbers is considered a rational option. To check whether this is a proper model for the claim numbers, a graphical compari-son of the model and the original density of the claim numbers is performed. With the use of MoM estimation the ZTP parameters for this model are determined. In MoM estimation, the sample mean Y , where Y is the random variable of claim numbers, is equated with the population mean E[Y ], the resulting denition for the parameter of the ZTP distribution is used as an estimation of the parameter. This leads to the following estimation:

E[Y ] = _{1 exp( )} Y , with > 0: (2.8) This equation can be solved using software. Afterwards the estimated parameter ^ is implemented in the ZTP model for the marginal distribution of claim numbers and can be compared to the original data of claim numbers in a plot.

Furthermore, claim sizes are often approximated via a Gamma distribution, so the GLM for claim sizes is performed using a Gamma distribution. In the same way as with the ZTP model, this model is later on compared to the original data for claim sizes in a plot. The required parameters to plot the curve can be estimated also by MoM as follows:

E[X] = X (2.9)

(13)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 7 for the variance:

Var = 2 X So, ^ = X₂ s and then ^ = X ^; (2.10)

where is the shape parameter and is the rate parameter of the Gamma dis-tributed model for claim sizes and 2

s and X are the sample variance and sample

mean of the dataset with claim sizes respectively.

For the copula-based regression method Kramer et al. (2013) use the follow-ing representation of marginal densities for claim sizes X and claim numbers Y respectively: fX(xj; ) = _{x (}1₁ ) x 1 exp x , for x > 0; (2.11) with mean parameter > 0 and dispersion factor > 0 and

fY(yj) = y

y!(1 exp( ))exp( ) , for y = 1; 2; : : : ; (2.12) with parameter > 0. The relation of the above mentioned parameters of the Gamma and ZTP distributions to the average and variance of the distributions are shown in Table 3, such as in (Kramer et al., 2013).

Table 3

Relation of Parameters to Distribution Average and Variance

From \Total Loss Estimation Using Copula-based Regression Models" by Kramer, N., Brech-mann, E. C., Silvestrini, D., Czado, C., 2013, Insurance: Mathematics and Economics, 53, p. 831. Copyright 2013 by Elsevier B.V.

2.2.2 Selection of Covariates

Before continuing to the next step, it is important that the signicant covariates for the two variables, claim numbers and claim sizes, are determined. This is done

(14)

by also applying GLM to the marginals and determining by analysis of deviance which covariates are signicant. Most freely available software programs provide the ability to obtain the just mentioned results, in this thesis the `anova' call in R will produce the analysis of deviance table. In the analysis of deviance we look at the dierence in deviance per added covariate. As stated in (Kaas et al., 2008), for a covariate to be signicant, the decrease in deviance after scaling has to be larger than the 95% quantile of the Chi-squared distribution with df degrees of freedom, when considering a 95% condence interval:

Dev

> 95% quantile of 2(df); (2.13) where df is the decrease in degrees of freedom and Dev is the decrease in deviance of the model when adding the covariate. The parameter is the scale parameter and can be estimated by: Resid: Dev

Resid: df , where Resid: Dev and Resid: df

are the residual deviance and degrees of freedom respectively of the fullest model available, in other words, the model with most covariates added. So if Equation

2.13 is satised, the added covariate is signicant and will stay in the model. Afterwards the change in deviance for adding another covariate to the model will be analysed. If Equation 2.13 is not satised, the according covariate will be left out. This process of adding other covariates and analysing the according deviances is repeated for all available covariates until an optimal model is found.

However, the above outlined approach can only be implemented for the variable of claim sizes, which is estimated through a Gamma distribution, because there is no analysis of deviance call available in the regular software for a ZTP distributed variable. So for the variable of claim numbers it is assumed that the covariates signicant to this variable will be the same as the covariates signicant to claim sizes.

The signicant covariates are represented in an n p design matrix R, where n is the amount of policies taken into account and p is the amount of covariates that dier from the base assumption for covariates with which all outcomes are compared. The rst column of this matrix consists of a vector of ones, representing a rst option of each covariate being the case. The following columns represent the separate options of each covariate except for the covariate options assumed in the rst column. Any column after the rst column where a row contains a numeral one indicates the occurrence of that specic option of covariate instead of the base assumption in the rst column. A useful tool to obtain this design matrix is available in R, namely the function `model.matrix' from package `stats' (R Development Core Team, 2012) .

(15)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 9 2013):

Xi Gamma(i; ) with ln(i) = ri> (2.14)

Yi ZTP(i) with ln(i) = ln(ei) + ri> (2.15)

, for i = 1; 2; : : : ; n;

where ri is the covariate vector for each observation i, or row i of design matrix

R and ei denotes the exposure time of policy i.

2.3 Copula-based Regression Model

Now that the signicant covariates have been selected, the marginal distribution models conditional to these covariates can be implemented in Equation 2.6. The joint probability density model as in (Kramer et al., 2013) is thus dened as

fX;Y(x; yj; ; ; ) = fX(xj; )(D1(FX(xj; ); FY(yj)j)

D1(FX(xj; ); FY(y 1j)j))

, for x > 0 and y = 1; 2; : : : (2.16) So the joint probability density model of the two variables is a model dependent on the parameters ; ; of the marginal distributions and the parameter of the copula function, where parameter is assumed to be the same for all policyholders.

2.4 Maximum Likelihood Estimation

The parameters of the joint probability density model are now to be estimated through Maximum Likelihood Estimation. In the estimation, both the GLM mod-els of the marginal distributions and the copula model for the joint structure of these marginals are maximized. The following parameter vector, as stated and explained by Kramer et al. (2013), is estimated:

v := (>_;>_{; ; )}> _{2 R}p+p+2_: _(2.17)

The log-likelihood function over the joint distribution model as in Equation 2.16

is dened as l(vjx; y) = n X i=1 ln(fX;Y(xi; yijv)); (2.18) with x = (x1; : : : ; xn)> 2 Rn and y = (y1; : : : ; yn)>2 Rn

(16)

and the resulting estimates from maximization will be denoted as ^v := argmax

v l(vjx; y): (2.19)

For convenience Kramer et al. (2013) have performed a transformation of to make sure that there will be no complications with the domain restrictions implied on as a result of numerical calculations, see Table 1 for the range of for each copula family. The transformation by function g is dened as follows:

g() = 1₂ln(1 + ₁): (2.20) The log-likelihood function can now be optimized with regards to (>_;>_{; ; g())}>_.

This maximization is done for all four kinds of copulas considered. The implemen-tation of this maximization is done with the use of the function `copreg' from the R-package `CopulaRegression' provided by Kramer and Silvestrini (2012). For the execution of the function `copreg', the observations of both the claim numbers and claim sizes, the two design matrices (which are the same design matrix in our case, since we use the same covariates to model both variables) and the copula family and exposure of the claim numbers are needed. When the model is tted with the necessary arguments implemented, `copreg' will return the estimated parameter values as well as the Maximum Likelihood and some other values for both the selected copula family and the independence model.

2.4.1 Finding the Best Fitting Model

At rst, it is important to nd out which of the four copula families provides the best t. This can be measured by Akaike's Information Criteria (AIC) (Kaas et al., 2008). AIC examines a model by considering its complexity as well as its goodness of t. The valuation takes into account that a model should not have too many parameters, which causes `overtting', that is when there are too few degrees of freedom, which is in principal not the purpose of tting a model. And a model should not have too few parameters, such that the t to the sample data is not sucient enough. The copula model that results in the lowest AIC is considered as the model providing the most balanced or `best' t. AIC is dened as

AIC = 2 maximized log-likelihood + 2 k; (2.21) where k denotes the number of tted parameters. When the best tting copula family is determined, this model is compared to the model with the independence assumption also by measuring the AIC. Now the resulting `best' model and the model assuming independence will be considered in the following section where the most suitable model for the sample data is obtained.

(17)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 11

2.4.2 Best Fitting Model Parameters

Now for the determination of the parameters ^; ^; ^; ^ for the resulting models, the estimates from the Maximum Likelihood Estimation are required. The estimated parameters ^; ^; ^; ^ are returned by the function `copreg' from the R-package `CopulaRegression' (Kramer & Silvestrini, 2012), where ^ is the vector of esti-mated coecients of the covariates for claim sizes and ^ is the vector of estimated coecients of the same covariates for claim numbers. Now ^ and ^ have to be calculated from the by `copreg' returned estimates for the parameters and as follows:

^i = exp(r>i )^ (2.22)

^i = exp(ln(ei) + r>i );^ (2.23)

where r>

i is the covariate vector of policy i and ei is the exposure time for policy

i. Now the joint copula model can be completed by implementing the estimated parameters and the best tting copula in the joint density model in Equation2.16.

2.5 Policy Loss Estimation

Now that the best tting joint model has been found and its parameters estimated, the individual and total loss estimation of the policies through both the dependent and independent model can be obtained.

Firstly, individual policy losses Li and total policy loss T can be calculated as

follows (Kramer et al., 2013):

Li = Xi Yi , for i = 1; : : : ; n (2.24)

T = Xn

i=1

Li; (2.25)

where Xi is the average claim size of policy i and Yi is the number of claims of

policy i.

For the determination of the mean and variance of the models for the total loss T the distribution of the total loss is needed. As Kramer et al.(2013) states, the to-tal loss T by Central Limit Theorem follows a normal distribution asymptotically. So p n T (T n X i=1 Li) D ! N (0; 1); (2.26)

(18)

such that the variance and mean of the total loss can be approximated by 2 T = n X i=1 2 Li (2.27) T = n X i=1 Li; (2.28) respectively, where 2

Li is the variance of each policy loss Li and Li is the mean of each policy loss Li.

First, 2

Li and Li can be calculated as follows in the case of independence: Li = E[Li] = E[Xi Yi] = E[Xi]E[Yi]

(2.29) 2

Li = Var[Li] = Var[XiYi] = Var[Xi]Var[Yi] + Var[Xi]E[Yi]2+ E[Xi]2Var[Yi]; (2.30) in Table 3 the formulas of these expected values and variances can be found.

Second, 2

Li and Li for the case of dependence are calculated. For this the probability distribution for the losses is required, this was derived by Kramer et al.(2013) via the joint probability density function of losses and claim numbers. This joint density (all parameters are left out for clarity) was dened as follows with (L; Y )> _{on R}+_{f1; 2; : : :g:}

fL;Y(l; y) = _@l@ P (L l; Y = y) = _@l@ P (X _yl; Y = y)

; as X = _YL: (2.31) Then substituting l=y by x gives

fL;Y(l; y) = _@x@ P (X x; Y = y) @x_@l = fX;Y(_yl; y) 1_y: (2.32)

The probability distribution for the losses could then be found by dierentation with regards to the variable Y

fL(lj; ; ; ) = 1 X y=1 [D1(FX(_ylj; ); FY(yj)j) D1(FX(_ylj; ); FY(y 1j)j)] _y1fX(_yl; ; ); for l > 0: (2.33) Now returning to the calculation of the means Li and variances L2i of the losses in the case of dependence we get the following:

Li = E[Li] = Z ₁ 0 l fL(l)dl (2.34) E[L2 i] = Z ₁ 0 l 2_f L(l)dl (2.35) 2

(19)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 13 The vector of mean policy losses [L1: : : Ln] for the dependent case can be

ob-tained using the function `epolicy loss' from the R-package `CopulaRegression' (Kramer & Silvestrini, 2012), due to an apparent aw in the R-code the pol-icy loss variance vector [2

L1: : : 2Ln] that is returned by the function is incorrect, therefore this vector is calculated `manually', the R-code for this calculation can be found in Appendix I.

Afterwards the sums over these means and variances can be taken as in Equa-tion2.27and Equation2.28and the estimation for the total loss of that particular model can be found. The results from the independence model and best tting dependence model can then be compared.

Next, the chosen data and model choices for the application of this theory are outlined.

(20)

Application

For application of the theory discussed in the former chapter data on insurance claims was needed. The original objective was to nd a dataset with car insurance data over a period of at least two years. However, data on car insurance claims is hard to nd and often not freely available so the choice has fallen on a dataset from the package `insuranceData' by Dominiak and Trzesiok (2015). This R-package is according to Dominiak and Trzesiok especially made for the testing of regression models. It includes amongst other datasets the dataset `dataCar' which contains 67856 observations of claims data on one-year policies issued in 2004 or 2005. Of these observations there were 4624 observations producing actual claims. In Table 4, the available variables and their description given by Dominiak and Trzesiok (2015) are shown.

For the application of the method, only actual claims were required, so the observations in the dataset where `clm' equals the numeral one have been ltered for application. Also, over the actual claim observations to which the method was applied, the average claim sizes per policy were calculated by dividing the total claim size per policy `claimcst0' by the amount of claims per policy `numclaims'. For the application of the theory the software R (R Development Core Team, 2012) has been used. R is a programming software and programming language which is used broadly for analytic and scientic purposes. It is able to execute complex calculations and scripts and can produce neat graphics. Apart from the fact that R has many possibilities, the choice of dataset and the used package for implementing the theory of this thesis, require usage of R. The R-code used for this analysis can be found in Appendix I.

In the article of Kramer et al.(2013), the method is said to be more accurate than the traditional model of independence and also the use of copula theory to model dependencies amongst variables is gaining popularity. Anticipating on the results in the next section, I expect to see whether implementing the `CopulaRe-gression' package (Kramer & Silvestrini, 2012) works properly and I also expect

(21)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 15 Table 4

Contents of Dataset `insuranceData' Name Description

veh value vehicle value, in $10,000s exposure 0-1

clm occurrence of claim (0 = no, 1 = yes) numclaims number of claims

claimcst0 claim amount (0 if no claim)

veh body vehicle body, coded as: BUS CONVT COUPE HBACK HDTOP MCARA MIBUS PANVN RDSTR SEDAN STNWG TRUCK UTE veh age 1 (youngest), 2, 3, 4

gender a factor with levels F M

area a factor with levels A B C D E F agecat 1 (youngest), 2, 3, 4, 5, 6

X OBSTAT a factor with levels 01101,0,0,0

Adapted from \A Collection of Insurance Datasets Useful in Risk Classicationin Non-life In-surance" by Dominiak, A. W., Trzesiok, M., 2015.

to nd that the results of the dependent models are indeed more accurate than the results of the independence model, since I expect that there is dependence amongst the variables claim numbers and claim sizes.

(22)

Results and Analysis

In this chapter, the copula regression method discussed earlier is applied to the chosen dataset.

First, the chosen marginals for the distributions of the variables claim numbers and claim sizes are veried. In Figure 1 on the left the model for claim numbers can be compared, where the original data of claim numbers is displayed as a histogram and the density of the ZTP model with parameter lambda estimated through MoM is represented by the curve. The left graphic in Figure 1 shows that overall the claim numbers are indeed approximately ZTP distributed and so for the GLM of claim numbers the use of a ZTP distribution is justied. The parameter lambda of the nal model that will be obtained through Maximum Likelihood Estimation is a vector that depends on the covariates and therefore diers in value for each policy.

Figure 1

(23)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 17 In Figure 1 on the right side the model for claim sizes can be compared with

the original data, where the original data of the claim sizes is again displayed as a histogram and the density of the Gamma model with parameters estimated through MoM is represented by the curve. From Figure 1 it can be concluded that overall the claim numbers are roughly Gamma distributed and so for the GLM of claim sizes the use of a Gamma distribution is reasonable. The parameter of the Gamma distribution in the nal model that will be estimated through Maximum Likelihood is also a vector that depends on the covariates, so the value of diers for each policy. Next, we continue with determining the relevant covariates for these marginal distribution models.

For selecting the covariates through analysis of deviance the `anova' call in R is used. The `anova' function is applied to the `GLM'-function which regresses the average claim sizes in terms of the implemented covariates. In the rst analysis all covariates are implemented to be analysed and its output is shown in Table 5, where `p5' represents the vector of average claim sizes and `p4' the vector of claim numbers.

Table 5

> anova(glm(p5 ~ 1+veh_value+veh_body+veh_age+gender+area+age_cat, family=Gamma(link=log),weights=p4))

Since the fullest model is the model where all covariates are included, the scale factor can be estimated by

Resid: Dev Resid: df =

7359:6

(24)

The quantiles of the Chi-squared distribution with 95% critical value for the con-cerned degrees of freedom are

95% critical value of 2_{(df) = 95% critical value of}2_{([13; 12; 5; 3; 1])}

= [22:36203; 21:02607; 11:0705; 7:814728; 3:841459] (4.2)

The scaled decreases in deviance per added covariate are respectively Dev

=

[35:986; 74:424; 3:610; 37:194; 57:296; 51:497]

= [22:41424; 46:35574; 2:248525; 23:16666; 35:68739; 32:07543] (4.3) The rst two added covariates, veh value and veh body have scaled deviances higher than the 95% critical value of the Chi-squared distribution with the accord-ing degrees of freedom, so these added coecients are signicant. Then lookaccord-ing at the third added covariate, veh age, this increase in deviance is lower than the 95% critical value, so this coecient is not signicant and will therefore be removed from the model. A new model is now estimated leaving the coecient veh age out, this results in the analysis of deviance in Table 6.

Table 6

> anova(glm(p5 ~ 1+veh_value+veh_body+gender+area+age_cat, family=Gamma(link=log),weights=p4))

(25)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 19 veh age are respectively

Dev

=

[35:986; 74:424; 37:929; 57:488; 51:110]

= [22:41424; 46:35574; 23:62446; 35:80698; 31:83438] (4.4) Comparing the values of the scaled deviances with the 95% quantiles in 4.2 it can be concluded that all remaining covariates in the model are signicant and should therefore stay in the model.

So a design matrix R is created involving the covariates `vehicle value', `vehicle body', `gender', `area' and `age category'. For this the R-function `model.matrix' (R Development Core Team, 2012) is used. This matrix R is used as design matrix for both the vectors claim numbers and claim sizes in the dataset.

The Maximum Likelihood Estimation over the two GLM regression models and the copula that combines these models is performed next. For this the `co-preg' function (Kramer & Silvestrini, 2012) in the `CopulaRegression' package is executed with as input the two vectors of average claim sizes and claim numbers, the design matrix R for both vectors, the vector of exposures of the claim num-bers and the choice of copula family. The function has been executed for all four copulas, resulting in the following values for maximized log-likelihood, number of parameters estimated and the resulting AIC, see Table 7.

Table 7

AIC for Each Copula Family

Copula family Number of parameters Max. value of log-likelihood AIC

Gauss 76 -40466.36 81084.72

Clayton 76 -40460.2 81072.40

Gumbel 76 -40466.65 81085.30

Frank 76 -40465.19 81082.38

The copula family that gives the lowest AIC value is Clayton's copula, so Clay-ton's copula is preferred over the four copula families. Now the results of the independence model are compared to the results of the dependence model with Clayton's copula, these results are shown in Table 8.

(26)

Table 8

AIC for Independence and Dependence with Clayton's Copula Copula family Number of parameters Max. value of log-likelihood AIC

Independence 75 -41383.91 82917.82

Clayton 76 -40460.2 81072.40

The model with Clayton's copula gives a lower AIC than the independence model and is thus preferred over the independence model.

Now that the best tting model is found, the estimated parameters ^ and ^ can directly be taken from the output of the `copreg' function (Kramer & Silvestrini, 2012) and the ^ and ^ vector parameters can be calculated from the estimated coecients vectors ^ and ^ by implementing the theory in Section 2.4.2. After-wards the estimated parameters and the rst partial derivative of Clayton's copula (see Table 2) are implemented in Equation2.16. This results in the following best tting joint density function for average claim sizes and claim numbers:

fX;Y(x; y) = fX(x)(D1(FX(x); FY(y)j^) D1(FX(x); FY(y 1)j^))

= fX(x)((FX(x) ^+ FY(y) ^ 1) 1=^ 1FX(x) ^ 1

(FX(x) ^+ FY(y 1) ^ 1) 1=^ 1FX(x) ^ 1); (4.5)

where FX(x) is the Gamma cumulative distribution function and fX(x) the Gamma

probability density function with estimated mean parameter ^ and estimated dis-persion parameter ^ and where FY(y) is the ZTP cumulative distribution function

with estimated parameter ^. If ^ = 0 is implemented, we automatically get the independence model.

Furthermore, the estimated parameters are implemented in the model for pol-icy losses in Equation2.33, this results in the following probability density function for policy loss:

fL(lj^; ^; ^; ^) = 1 X y=1 [D1(FX(_ylj^; ^); FY(yj^)j^) D1(FX(_ylj^; ^); FY(y 1j^)j^)] 1_yfX(_ylj^; ^); (4.6)

(27)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 21 when lling in D1 for Clayton's copula and omitting the parameters this equals

1 X y=1 [[FX(_yl) ^+ FY(y) ^ 1] 1=^ 1 FX(_yl) ^ 1 [FX(_yl) ^+ FY(y 1) ^ 1] 1=^ 1 FX(_yl) ^ 1] 1 yfX( l y); for l > 0: (4.7) Now Li and 2Li for both the independence and dependence model can be calculated over which the sums are taken to get the expected total loss T and

total loss variance 2

T. This results in an expected value of 9,446,977 for the total

loss with standard deviation 75,621.35 for the independence model. The expected value of the total loss for the dependence model is 9,581,179 with a standard devi-ation of 13,563.9. The real total loss for all policies of the dataset equals 9,314,604, so the expected value of the independence model gives the closest approximation of the total loss. However, the standard deviation of the independence model in comparison with the standard deviation of the dependence model is relatively high. Now the dependence model gives both a close total loss approximation and has a smaller standard deviation, this model tends to a more conservative approx-imation. For car insurance companies it is preferable to reduce uncertainty for the estimation of the company's loss, so the dependence model with Clayton's copula, which gives a close estimate, is preferred. Also, this model resulted in the lowest AIC, so according to that criterium this model would give the most balanced t to the original dataset.

In the end, the method with the use of R-package `CopulaRegression' (Kramer & Silvestrini, 2012) and some additional R-functions has delivered a best tting model for the dataset used in this thesis.

(28)

Conclusion

The focus in this thesis was on nding a best tting model for the dataset through copula regression using the R-package `CopulaRegression' (Kramer & Silvestrini, 2012). By selecting the signicant covariates via analysis of deviance, only the relevant covariates, covariates `vehicle value', `vehicle body', `gender', `area' and `age category,' were used. Then these were implemented in the function `copreg' via a design matrix constructed from these covariates. Hereafter through this function Maximum Likelihood Estimation was applied over the GLM models for the marginal distribution functions of the two variables claim numbers and claim sizes and the copula function representing the dependence relation between the two variables. When analysing the results of the four dierent copula families, the Clayton copula turned out to have the lowest value for AIC. The next step was to compare the results of the Clayton copula with the results of the independence model, also in this case the value of AIC for Clayton's copula was the lowest. To check the results of both the independence model and the dependence model with Clayton's copula, the total loss was calculated. The model with dependence using Clayton's copula gave the most conservative result with a close t. So if a best tting model through copula regression would be needed for the calculation of the total loss of the policies in this dataset, the Clayton model with the estimated parameters of the method implemented can be used for this.

Altogether, the bivariate copula-based regression method using R-package `CopulaRegression' (Kramer & Silvestrini, 2012) gives the ability to nd a best tting model where dependence amongst claim numbers and claim sizes is allowed and where the various eects of relevant covariates are taken into account.

5.1 Further Research

For further research it might be better to use data from a larger insurance company and over a larger time-period to see what result the method leads to when a dataset

(29)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 23 with a large number of actual claims is analysed. Also the use of data with policies

covering a period longer than a year could be used.

Further, the assumed relevant covariates for the claim numbers with ZTP model were not validated through an analysis of deviance. It would be better if the choice of covariates for both variables was validated, however.

In this thesis I made use of the R-package with the theory of the method implemented, where only four copula families were considered, it might be worth-while to also consider other copula families in future studies and test whether they provide a better t in terms of AIC or some other criterion.

Also in this thesis, the marginals for claim sizes and claim numbers were like in the article of Kramer et al.(2013) assumed to be Gamma and ZTP distributed respectively. In a continuation on this thesis it might be valuable to regard other marginal distributions as well.

For future studies this R-package might be extended to include these extra op-tions, but already the package and the method provide a useful tool to model claim numbers and claim sizes together while taking their dependency and covariates into account.

The combination of two of today's most popular statistical analysis techniques, GLM regression and copula representation of multivariate probability distribu-tions, provide a exible working method for car insurance companies to model their claims data and thus their policy losses. This compliant combination of tech-niques, with the desired feature of allowing for dependency and which is broadly applicable, might well be better exploited to obtain even more convenient and ecient techniques for statistical analysis in the future.

(30)

Dominiak, A. W., & Trzesiok, M. (2014). A Collection of Insurance Datasets Use-ful in Risk Classication in Non-life Insurance. R-package version 1.0.

Durante, F., & Sempi, C. (2015). Principles of Copula Theory. Boca Raton: Chap-man and Hall/CRC.

Kaas, R., Goovaerts, M., Dhaene, J., & Denuit, M. (2008). Modern Actuarial Risk Theory (2e ed.). Heidelberg: Springer.

Kramer, N., & Silvestrini, D. (2015). Bivariate Copula Based Regression Mod-els. R-package version 0.1-5.

Kramer, N., Brechmann, E.C., Silvestrini, D., & Czado, C. (2013). Total loss es-timation using copula-based regression models. Insurance: Mathematics and Eco-nomics(53), 829-839.

Lundberg, F. (1903). Approximerad Framstallning Afsannollikhetsfunktionen. II. aterforsakring af Kollektivrisker. Uppsala: Almqvist & Wiksells Boktr.

Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Jour-nal of the Royal Statistical Society. Series A (General), 135 (3), 370-384.

Nelsen, R. B. (2006). An Introduction to Copulas. New York: Springer. Parsa, R. A., & Klugman, S. A. (2011). Copula Regression. Variance, 45-54. R Development Core Team (2012). R: A Language and Environment for Statis-tical Computing. R Foundation for StatisStatis-tical Computing, Vienna, Austria. URL

http:\www.R-project.org/. ISBN 3-900051-07-0.

R Development Core Team (2012). The R Stats Package. R-package version 3.4.0. 24

(31)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 25

Sklar, A. (1959). Fonctions de repartition n dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris(8), 229-231.

(32)

In this appendix, the R code with which the copula-based regression method is im-plemented on the automobile claims data using the R-package `CopulaRegression' (Kramer & Silvestrini, 2012) is shown and explained.

rm(list=ls(all=TRUE))

# First the required packages are loaded library(insuranceData)

library(VGAM)

library(CopulaRegression) library(SimplicialCubature)

# Then the dataset `dataCar' from the package `insuranceData' is read data(dataCar)

# The data that were actual claims is selected from the data claimsdata<-dataCar[dataCar$clm==1,]

# Vector containing claim numbers is defined p4 <- claimsdata[,4]

# Vector containing total claim sizes is defined p5 <- claimsdata[,5]

# Changing p5 from total claim size to average claim size per policy p5 <- p5/p4

# Defining the length of the actual claims data vectors N4 <- length(p4)

N5 <- length(p5)

# MME for the lambda parameter of the ZTP distribution 26

(33)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 27 meanp4 <- mean(p4)

f <- function(lambda){lambda/(1-exp(-lambda))-meanp4} MME_lambda <- uniroot(f,c(-10^9, 10^9))$root

# A histogram is produced comparing the original density # of the data to the estimated density

hist(p4,freq=FALSE, breaks=c(0,0.5,1.5,2.5,3.5,4),ylim=c(0,1), xlim=c(0,4),xlab = "Claim Number per Policy" , main = "ZTP Model Claim Numbers")

curve(dztp(x, lambda=MME_lambda), add=TRUE, col="darkblue", lwd=2) # MME for beta (inverse scale par) of the gamma model

MME_beta <- mean(p5)/var(p5)

# MME for alpha (shape par) of the gamma model MME_alpha <- mean(p5)*MME_beta

# A histogram is produced comparing the original density # of the data to the estimated density

hist(p5,freq=FALSE, breaks=20,ylim=c(0,0.0004),

xlim=c(0,56000),xlab = "Claim Size per Policy" , main = "Gamma Model Claim Sizes")

curve(dgamma(x, shape=MME_alpha, rate=MME_beta), add=TRUE, col="darkblue", lwd=2)

# Arranging vehicle values in categories in vector veh_val veh_val <- ceiling(claimsdata[,1])

# Adding this covariate to the dataset claimsdata <- cbind(claimsdata,veh_val) # Defining exposure vector

expo <- claimsdata[,2]

# Defining covariate vectors veh_value <- claimsdata[,12] veh_body <- claimsdata[,6] veh_age <- claimsdata[,7] gender <- claimsdata[,8] area <- claimsdata[,9] age_cat <- claimsdata[,10]

(34)

# Making sure the category numbers represent factors veh_value <- as.factor(veh_value)

veh_age <- as.factor(veh_age) age_cat <- as.factor(age_cat)

# Analysis of deviance for covariates relevant to claimsize anova(glm(p5 ~ 1+veh_value+veh_body+veh_age+gender+area+age_cat, family=Gamma(link=log),weights=p4))

# Scalefactor = 7359.6/4584=1.605497

# Obtaining quantiles of Chi-squared distribution

qchisq(0.95,13) # =22.36203 veh_Value is significant with 95% confidence qchisq(0.95,12) # =21.02607 veh_body is significant

qchisq(0.95,3) # =7.814728 veh_age is not significant qchisq(0.95,1) # =3.841459

qchisq(0.95,5) # =11.0705

# Removing veh_age from the model results in the following,

# all covariates are significant, for delta(deviance)/scale_factor > # qchisq(delta(df)) with 95% confidence

anova(glm(p5 ~ 1+veh_value+veh_body+gender+area+age_cat, family=Gamma(link=log),weights=p4))

# defining the design matrix for claimsize with significant covariates R <- claimsdata[,c(6,8,9,10,12)]

R <- model.matrix(~ veh_value+veh_body+gender+area+age_cat,data=R) # One design matrix R will be used for both variables claimsizes # and claimnumbers

# Now the copula regression is performed for each copula family # Gauss' Copula fit with R as design matrix

copreg(p5,p4,R=R,S=R,family=1,exposure = expo) # llGauss <- -40466.36

# npar <- 76 # AIC <- 81084.72

(35)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 29 copreg(p5,p4,R=R,S=R,family=3,exposure = expo)

# llClayton <- -40460.2 # npar <- 76

# AIC <- 81072.4

# Gumbel's Copula fit with R as design matrix copreg(p5,p4,R=R,S=R,family=4,exposure = expo) # llGumbel <- -40466.65

# npar <- 76 # AIC <- 81085.3

# Frank's Copula fit with R as design matrix copreg(p5,p4,R=R,S=R,family=5,exposure = expo) # llFrank <- -40465.19

# npar <- 76 # AIC <-81082.38

# So the copula that gives the lowest AIC value is Clayton's copula, # so Clayton's copula is preferred.

# Now the results of the independence model are compared to the # results of dependence with Clayton's copula.

# The independance model (estimates referred to as 'parameter0') # using Clayton's copula results in the following estimates: # ll0 <- -41383.91

# npar0 <- 75 # AIC0 <- 82917.82

# So Clayton's copula is preferred over the independence model, # for it results in the lowest AIC.

# Now the resulting loss using both dependence and independence # will be compared.

# Again the independence model estimates are referred to as # 'parameter0'.

# This is the alpha0 returned by `copreg' for Clayton's copula alpha0 <- copreg(p5,p4,R=R,S=R,family=3,exposure = expo)$alpha0 # This is the beta0 returned by `copreg' for Clayton's copula beta0 <- copreg(p5,p4,R=R,S=R,family=3,exposure = expo)$beta0

(36)

delta0 <- 3.013296

theta0 <- 0 #(independence)

# Mu and lambda are calculated for the independence model mu0 <- c(1:N5) for(i in 1:N5){ mu0[i] <- exp(R[i,]%*%alpha0)} lambda0 <- c(1:N5) for(i in 1:N5){ lambda0[i] <- exp(log(expo[i])+R[i,]%*%beta0)}

# Estimation of mean and variance of total loss for independence mean.y <- c(1:N5)

var.y <- c(1:N5)

for(i in 1:N5){mean.y[i] <- sum(1:500*dztp(1:500,lambda0[i]))}

for(i in 1:N5){exp.value.y_2 <- sum((1:500)^2*dztp(1:500,lambda0[i])) var.y[i] <- (exp.value.y_2-mean.y^2)

}

# Resulting mean for total loss with the independence model muL0 <- mu0%*%mean.y #9.446.977 VarXiYi0 <- c(1:N5) for(i in 1:N5){ VarXiYi0[i] <- delta0*mu0[i]^2*var.y[i] +delta0*mu0[i]^2*mean.y[i]^2 +mu0[i]^2*var.y[i]}

# Resulting total loss variance and standard deviation for the independence model

VarL0 <- sum(VarXiYi0) #5718589112 sdL0 <- sqrt(VarL0) # 75621.35

# Continuing on the case where we use Clayton's Copula fit with R as # design matrix for both claimsize and claimnumber

# All estimated parameters are defined npar <- 76

# This is the alpha returned by `copreg' for Clayton's copula alpha <- copreg(p5,p4,R=R,S=R,family=3,exposure = expo)$alpha # This is the beta returned by `copreg' for Clayton's copula beta <- copreg(p5,p4,R=R,S=R,family=3,exposure = expo)$beta

(37)

Copula-based Regression on Automobile Claims Data | Shadee van Vlaanderen 31 delta <- 1.286966 theta <- 0.3531286 mu <- c(1:N5) for(i in 1:N5){ mu[i] <- exp(R[i,]%*%alpha)} lambda <- c(1:N5) for(i in 1:N5){ lambda[i] <- exp(log(expo[i])+R[i,]%*%beta)}

# Estimation of the mean and variances of the individual losses # using the package `CopulaRegression'

epolicy_loss(mu,delta,lambda,theta,family=3,y.max=300,compute.var = TRUE)

# Storing output of last call in eplClayton eplClayton <- .Last.value

MeanLossesClayton <- eplClayton$mean VarLossesClayton <- eplClayton$var

# Estimation of means and variance of total loss muLClayton <- sum(MeanLossesClayton) #9.581.179

VarLClayton <- sum(VarLossesClayton) #-21019032654, incorrect

# So there is a flaw in the code for the calculation of variances so # it will be calculated 'manually'

# Calculating loss from original data Realloss <- p5%*%p4 #9.314.604

# Calculating epolicy_loss for dependence 'manually' by loop over # each element over the vector of policies

# Calculating means vector Mean.loss.dep <- c(1:N5)

for(i in 1:N5){exp.value.l <- function(l){l*dpolicy_loss(l,mu[i], delta,lambda[i],theta,family = 3)}

Mean.loss.dep[i] <- integrate(exp.value.l,0,Inf)$value }

Sum.loss.dep <- sum(Mean.loss.dep)# 9.581.179 in accordance with epolicy_loss function

# Calculating variance vector Var.loss.dep <- c(1:N5)

(38)

delta,lambda[i],theta,family = 3)} exp.value.l_2 <- function(l){l^2*dpolicy_loss(l,mu[i], delta,lambda[i],theta,family = 3)} Var.loss.dep[i] <- integrate(exp.value.l_2,0,Inf)$value-integrate(exp.value.l,0,Inf)$value^2 }

Sum.var.dep <- sum(Var.loss.dep)# var=183979325, sd=13563.9 which is smaller than with independence