Explaining the claim frequency in a personal liability insurance portfolio

(1)

personal liability insurance portfolio

Vera Makhan

Master’s Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Vera Makhan

Student nr: 5742544

Email: veramakhan@gmail.com

Date: September 7, 2014

Supervisor: dr. K. Antonio Second reader: prof. dr. R. Kaas

(2)

(3)

Abstract

a.s.r. is a Dutch insurance company. The claim frequency of personal liability insurance (PLI) of a.s.r. can be explained by personal features of the policy-holder. To investigate which features have a significant impact, Generalized Linear Models (GLMs) can be used. However, GLMs do not account for se-rial correlation present in a longitudinal dataset. Extensions of the GLM are available which do account for this serial correlation. Generalized Linear Mixed Models (GLMMs) and Generalized Estimating Equations (GEE) are two ex-amples of these extensions. GLMMs account for serial correlation by adding random effects to the model and require fully specified distributional assump-tions. GEE accounts for serial correlation by adjusting the variance-covariance matrix and does not require fully specified distributional assumptions. In this thesis, these three different methods will be investigated to derive the explana-tory variables of claim frequency of PLI. We find that the significant variables resulting from the GLM do not differ from the significant variables of the GEE method. The standard errors of the GEE model are strictly smaller than those of the GLM, with an average decline of 7%. When looking at the predictive power of the different models, GEE outperforms GLM by having both a smaller sum of squared prediction errors and sum of absolute prediction errors. The GLMM method can not be investigated for the data at hand, since it requires more memory resources than available. The outcomes of the GEE model will be assumed the final and best model outcomes of this study. From the out-comes of the GEE, some conclusions about the covariates can be drawn. First, the more densely populated a district is, the less likely a policyholder in it will claim. Second, the more policies a person has in other branches of a.s.r., the more likely he/she will claim. Next, policyholders living in a large variant of housing type (for example a large appartment, or a large detached house) tend to claim more. Finally, not only the number of children in a household matters, but also the age of the eldest child in the household. Apparently, households with older children (13+ years) claim the most of all life stages.

Keywords Claim frequency, Generalized Estimating Equations, Generalized Linear Mixed

(4)

3.2.1 Variables of a.s.r. . . 21 3.2.2 Variables of Cendris . . . 26 3.2.3 Variables of Experian . . . 27 3.2.4 Conclusion . . . 29 4 GLM analysis 31 4.1 Model derivation . . . 31 4.2 GLM outcomes . . . 32 4.3 Model refinements . . . 35 4.3.1 GLIMMIX analysis . . . 35 4.3.2 GEE analysis . . . 35 4.3.3 CONTRAST statement . . . 35 5 Conclusion 44

Appendix A: Cramer’s V correlation matrix 46

Appendix B: GLM outcomes 48

Bibliography 52

(5)

Introduction

a.s.r. a.s.r. is a Dutch insurance company offering different insurance products. It employs over 4000 employees. With a turnover ofe4.3 billion a.s.r. is one of the largest insurance companies in the Netherlands.

Motivation and current situation The portfolio of a.s.r. can be segmented in three groups:

• bad insureds

This group consists of a small percentage of policyholders who structurally cost the insurer a lot of money. Their claim frequency is significantly higher than average. • average insureds

This is the largest group of policyholders. They claim on average. • good insureds

This group contains a small percentage of policyholders who significantly claim less than average and are highly profitable for a.s.r.

One way to meet the increasing costs caused by the group of bad insureds, is to increase the premium. Until now, a.s.r. raises the premium for all insureds. In this way, the good insureds are penalized because of the negative claiming behavior of the loss making group. They will be more likely to switch to another insurer where the premium is lower. The bad insureds will stay in the portfolio, because the increase in premium is not as large as their claiming amounts.

Solution In August 2013 a.s.r. has started the project Individual Pricing to resolve this problem. This projects aims to lower the group of bad insureds, by penalizing them with a premium raise. To be able to do this, bad insureds have to be specified based on their historical claiming behavior.

Research question The claiming behavior of insureds can be measured by the claim frequency. Denuit et al. (2007) describe that a priori risk characteristics can explain the number of claims. The claim frequency is defined as the number of claims incurred per unit of exposure. It is interesting to investigate which features of a person can indicate a high claim frequency. These features can be used in the Individual Pricing project to recognize the bad insureds and they can also be used in the premium structure. This thesis will study variables which significantly explain the claim frequency. Generalized Linear Models (GLMs), developed by Nelder and Wedderburn (1972), are a popular and commonly used technique to research this type of problem. The standard GLM assumes that observations are independent. The available dataset however, contains several years of data per policyholder and thus several records per policyholder (longitudinal dataset). Observations of different policyholders can be assumed independent, but observations

(6)

2 Vera Makhan — claim frequency of personal liability insurance

corresponding to the same policyholder can not be assumed independent. The GLM model does not account for this serial correlation. Generalized estimating equations (GEEs) and Generalized linear mixed models (GLMMs) are two extensions which do allow for correlations between observations. These two extensions will be used to improve the model.

Personal liability insurance To indicate all bad insureds of a.s.r., all insurance portfolios should be investigated. This will be a time intensive study. Therefore, this thesis will focus on one insurance product: personal liability insurance (PLI). PLI covers damages not intentionally done to persons and properties of persons by the insured, up to e1.250.000 or e2.500.000 per incident (depending on the policyholder’s chosen coverage). It also covers damage to the insured by somebody else’s criminal offence. One can additionally insure the damages done to a babysitter or his/her properties caused by the insured’s children or animals. PLI has a relatively simple premium structure. Three variables determine the premium: the insured amount (e1.250.000 or e2.500.000 per incident), the capacity of the household (single or multiple) and the own risk (e0, e100, e500 and e250 if incident is caused by children/animals). This relatively simple structure of PLI makes it of interest to investigate if more variables can be added. To do so not only claim frequency, but also claim severity has to be investigated. The outcomes of this study can be used in a similar study on claim severity to investigate if more variables can be added to the structure.

Structure of thesis The main goal of this paper is to identify the impact of risk factors on expected claim frequency for the personal liability insurance. To succeed in this, the first goal of this paper is to examine which variables are suitable to be included in the model. This will be done with one way analysis. Secondly, with this selection of features, a standard GLM analysis will be done. Adding and deleting covariates will lead to a final set of significant factors. Next, the GEE and GLMM methods will be used to account for serial correlation in the dataset. Similarities and differences in the outcomes of the analyses will be discussed to conclude which features will indicate a higher than expected claim frequency.

(7)

Theoretical framework

2.1 Claim frequency

The main focus of this study is to investigate the claim frequency of PLI. The claim frequency is defined for each level m of covariates, as the sum of total claims per level divided by the total exposure in that level. In formula form, this can be written as follows: Claim frequencym = Pn l=1Claimml Pn l=1Exposureml (2.1)

2.2 General linear models vs Generalized linear models

General linear models To study the effects of different variables on claim frequency in the personal liability insurance, linear regression analyses can be done. Anderson et al. (2007) describe the linear model. A linear regression model can be expressed as follows:

Yi= xiβ + i (2.2)

For observation i, with i = 1,...,n, Yiis the observed response variable, xi=(1,x2, ..., xp)

is a vector of p explanatory variables with the first 1 included for an intercept term, β=(β1, ..., βp)’ is a vector of p corresponding parameters to be estimated by the model

and i is the error term, which is assumed to be N (0,σ2) distributed. The linear model

seeks to express the observed response variable Y as a linear combination of parameters β plus an error term .

If one wants to research the observed claim frequency, a general linear model is not fully suitable to use because:

1. It assumes normally distributed random errors. Since number of claims can not take non-negative values, the random errors can not be normally distributed. 2. The mean is linear in the regression parameters, thus of the form xiβ. Many

insur-ance risks tend to vary multiplicatively with rating factors. A multiplicative model is easier to explain and easier to handle for an insurer. Therefore a multiplicative model is more plausible. (Anderson et al., 2007, p.10)

Generalized linear models Generalized linear models (GLMs) generalize the or-dinary linear models in two directions. Firstly, the random errors are allowed to have other than normal distributions. They can have any distribution from the exponential dispersion family. The exponential dispersion family includes apart from the normal dis-tribution, Poisson, Gamma, (negative) binomial and the inverse Gaussian distributions. Secondly, the mean is not a linear function of the covariates, but it may be linear on some other scale.

(8)

A generalized linear model consists of three components (Kaas et al., 2008):

1. stochastic component

The observations i =1,...,n of the dependent variable Yi are assumed to be

inde-pendent random variables with a density in the exponential dispersion family.

2. systematic component

This component is the linear predictor. It is a linear combination of the covariates xij and the coefficients βj, with j =1...p indicating the number of covariates.

ηi = p X j=1 xijβj, i = 1, ..., n, j = 1, ..., p (2.3) 3. link function

The link function g(.) transforms the expectation (µi) of the dependent variable

Yi to the linear predictor ηi as follows:

g(µi) = ηi = p

X

j=1

xijβj+ ξi (2.4)

In the last formula ξi is the offset term. If the effect of an explanatory variable is

known, one includes an offset term as a known effect in a model rather than estimating a parameter β for this covariate. (Anderson et al, 2007, p.18)

2.3 Poisson distribution for claim frequencies

2.3.1 Poisson distribution

A discrete random variable N has a Poisson distribution with parameter λ > 0, if for k = 0, 1, 2, ..., n the probability mass function of N is given by

P rob(N = k) = (λ)

k

k! e

−λ

A typical feature of the Poisson distribution is its equidispersion: the variance of the Poisson distribution is equal to its mean. (Denuit et al., 2007, p.15)

V[N] = E[N] = λ

When V[N] > E[N], the distribution is overdispersed. When V[N] < E[N], the distribu-tion is underdispersed. (Denuit et al., 2007, p.4 and p.14)

2.3.2 Poisson distribution with exposure

It is possible that policyholders are insured for only a fraction of a year and not a whole year. If the fraction of the year is indicated with E, then the Poisson probability mass function can be modified as follows:

P rob(N = k) = (λE)

k

k! e

−λE

(2.5)

(9)

2.3.3 Maximum likelihood

The maximum likelihood method seeks the values for the parameter vector β that makes the observed data most likely to have occurred, given the statistical model. This value is called the maximum likelihood estimator (MLE) and for a parameter β is denoted as ˆβ. The MLE is found by maximizing the likelihood function. The likelihood function L(β) is specified as follows: L(β) = n Y i=1 P rob(Ni = ki| β) (2.6)

As the sum of a function is mathematically simpler to maximize then a product, the logarithm of the likelihood function is maximized.

log L(β) =

n

X

i=1

log P rob(Ni= ki | β) (2.7)

Since the logarithm is a monotonic transformation, the value of the parameter that maximizes the L(β) is the same as that which maximizes the log(L(β)).

2.4 Claim frequency

2.4.1 Poisson distribution and Maximum likelihood

To model the claim frequency with a GLM, the distribution of claim frequency has to be chosen. Because the number of claims can only take positive values and most insureds will have zero claims, the Poisson distribution is commonly used for claim frequency. The distribution is corrected for exposure, since the policyholders can be insured for fractions of years (see equation (2.5)). The maximum likelihood function for a Poisson distribution with exposure is:

L(λ;Y,E) = n Y i=1 (λEi)ki ki! e−λEi _(2.8)

and the log-likelihood function is

l(λ;Y,E) =

n

X

i=1

[kilog(λEi) − (λEi) − log(ki!)] (2.9)

The MLE ˆλ that maximizes (2.9) is then:

ˆ λ = Pn i=1ki Pn i=1Ei (2.10)

Applying this to whole dataset gives a ˆλ of 0.0467.

Distribution of the data In figure 2.1 the empirical distribution and the Poisson distribution with λ=0.0467 of the number of claims are shown. Since not all policies have an exposure of one, the sum of the exposures is taken to determine the empirical distribution. The Poisson model fits the data well. It somewhat underestimates the zero probability and overestimates the probability of one.

Next to comparing the distribution plots, to check whether the Poisson distribution can be used, one can compare the average annualized frequency mN with the empirical

(10)

6 Vera Makhan — claim frequency of personal liability insurance 0 1 2 3 4 5 Claim distribution Number of claims (x) P(X=x) 0.0 0.2 0.4 0.6 0.8 Empirical POISSON(0.467)

Figure 2.1: Empirical distribution versus Poisson distribution of the number of claims.

variance SN (Charpentier, 2014, p. 523). For a Poisson distribution, when no

explana-tory variables are taken into account, the dispersion parameter φ in equation (2.11) has to be equal to 1 (Kaas et al., 2008, p. 246).

S_N2 = φ × mN (2.11)

mN and SN2 are defined as follows:

mN = Pn i=1Claimsi Pn i=1Exposurei S_N2 = Pn

i=1(Claimsi− mN × Exposurei)2

Pn

i=1Exposurei

(2.12)

Applying above formulas to the whole dataset gives the following outcomes:

mN = 0.0467

S_N2 = 0.0494 φ = 1.0630

Since the value of φ is close to 1, we decide to use the Poisson distribution in the model.

2.4.2 GLM model for claim frequency

The dependent variable to investigate is the number of claims Yi during an exposure

period of Ei. From section 2.3.2, this variable can be assumed Yi ∼POI(Eiλi). In section

2.2 was described that the link function is a component of the GLM. For a Poisson distribution, this canonical link function is the logarithm. With this logarithmic link function, the model looks like:

g(λi) = log(λi) = ηi= n X j=1 xijβj (2.13) λi = g−1(λi) = e Pn j=1xijβj _(2.14)

and the distribution of Yi becomes

Yi∼ P OI(e

Pn

j=1xijβj+log(Ei)₎ _(2.15)

(11)

Maximum likelihood method in GLM model for claim frequency Now the assumption for λ has been made and the distribution function, the link function and offset term have been set, the maximum likelihood method can be used to derive the βj’s. The maximum likelihood method seeks βj’s which produce the observed claims with

highest probability. Since λi = e

Pn

j=1xijβj_{, the log-likelihood function in (2.9) becomes:}

l(λ;Y,E) = n X i=1  Yi×   n X j=1 xijβj+ log(Ei)  − (e Pn j=1xijβj_{× E} i) − log(Yi!)  

Maximizing the log likelihood function can be done with the numerical method to optimize the GLM. This procedure derives for every covariate class xj a corresponding

βj.

2.5 Serial correlation

The standard GLM assumes the response variables Y1, ..., Ynto be to be independent.

However, when one has a longitudinal dataset, the several years of data belonging to one policyholder are not independent from each other. The heterogeneity and serial correlation underlying in the dataset lead to overdispersion. The overdispersion, in turn, causes underestimation of the standard errors of the estimated parameters (Hinde and Demtrio, 1998, p.1). Extensions to standard GLMs have been developed to allow for serial correlation between observations, such as Generalized estimating equations (GEE) and Generalized Linear Mixed Models (GLMMs). These two extensions will be discussed in this section.

Generalized estimating equations Generalized estimating equations (GEE), de-veloped by Liang and Zeger in 1986, is an extension of the standard GLM which corrects for serial correlation. GEE relaxes the assumption of independence by adjusting the Variance-Covariance matrix (Var-Cov) of the dependent variable for correlation. In this way, the Var-Cov matrix is no longer a diagonal matrix with, in a Poisson distribution, λi as diagonal elements.

Denuit et al. (2007) explain the working of GEE method in their book Actuarial Mod-elling of Claim Counts. Every policyholder i generates a sequence of claims numbers Ni = (Ni1,...,NiTi). It is questionable to assume that the claims within the sequence

Ni are independent, since these claims all belong to policyholder i. However,

indepen-dence between the policyholders can be assumed. To account for this depenindepen-dence, the Var-Cov of Ni can be adjusted. Denuit et al. (2007) explain that the idea behind the

GEE method is to replace the Var-Cov in the independence case (say Ai), by the

Var-Cov in the serial correlation situation Vi in the likelihood equations. For independent

Poisson distributed data, the likelihood equations and Var-Cov Ai are: n X i=1 XT_i(ni− E[Ni]) = 0 (2.16) Ai =      λi1 0 · · · 0 0 λi2 · · · 0 .. . ... . .. ... 0 0 · · · λiTi     

With vector Xi=(Xi1,...,Xip) containing all observable risk characteristics related to

policyholder i. With noting that _∂β∂ E[Ni] = AiXi, the likelihood equations in (2.16)

(12)

8 Vera Makhan — claim frequency of personal liability insurance n X i=1 ∂ ∂βE[Ni] T A−1_i (ni− E[Ni]) = 0. (2.17)

When the Var-Cov Ai is replaced by the Var-Cov Vi, the likelihood equations for serial

correlated data can be defined as equation (2.18).

n X i=1 ∂ ∂βE[Ni] T V−1_i (ni− E[Ni]) = 0 (2.18)

If Vi is defined as equation (2.19), it takes overdispersion into account, since

V[Nit] = φλit> λit= E[Nit], when φ > 1.

According to Denuit et al. (2007) Vialso takes the dependence between the observations

into account through the working correlation matrix Ri(α). This is the essential matrix

that the GEE method estimates. One can define the working correlation matrix itself, or one can assume some structure of this matrix, for example AR(1) or unstructured.

Vi = φA−1/2_i Ri(α)A−1/2_i (2.19)

The β’s can be computed as a solution of the likelihood equations in equation (2.18). These equations can be solved by iterating between a modified Fisher scoring for β and a moment estimation of α and φ (Liang and Zeger, 1986, p.16). Denuit et al. (2007) describe this iterative process as follows:

1. An initial estimate of β is computed, when independence is assumed.

2. Then the current working correlation matrix is computed based on the estimated β, the standardized residuals and the assumed structure of Ri(α).

3. Next, the Var-Cov Vi is estimated.

4. At last, the β is updated.

GEE is an alternative method to the maximum likelihood. In this way, the likelihood and thus its derivatives can not be derived. (Denuit et al., 2007, p. 101)

Generalized linear mixed models Next to GEE, a generalized linear mixed model (GLMM) is an extension of GLM to account for serial correlation. Garrido and Zhou (2007) give a clear explanation of the GLMM. GLMM assumes that the observations Yij are conditionally independent given the random effect Ui and that they follow a

distribution from the exponential dispersion family. Here, i denotes the subject and j denotes the number of observations within the subject. The random effects are assumed Ui ∼N (0,D), with D the Var-Cov. The Variance of the observations Yij, conditional on

the random effects, is given by V[Yij | Ui = ui] = A−1/2_i RiA−1/2_i . In this formula, Ri

is the Variance-Covariance matrix for the random effects and Ai is a diagonal matrix

containing the variance functions of the model. These variance functions express the variance of Yij as a function of its mean µij.

GLMM adds a random effect next to the fixed effects to the linear predictor, as equation (2.20) shows. In this way an explicit probability model arises which explains the origin of the correlations. GLMM uses the maximum likelihood method to determine the β’s. With the maximized value of the likelihood function, the AIC and BIC values can be derived.

(13)

In this equation β=(β1,...,βp)’ is the vector with fixed effects, ui=(ui1,...,uiq)’ is the

vector with random effects, xij=(xij1,...,xijp)’ are the covariates relating to the fixed

effects and tij=(tij1,...,tijq)’ are the covariates relating to the random effects.

The loglikelihood function for this GLMM takes the form:

l(β, D ) = −nk 2 ln(2π) − n 2|D| + n X i=1 ln Z Rk eli(β,v)−1₂v0D−1v_dv _(2.21) where li(β, ui) = ni X j=1 [(xijβ + tijui)yij − b(xijβ + tijui)] (2.22)

(Garrido and Zhou, 2007, p.65)

Garrido and Zhou (2009) sum up a few numerical techniques described by other au-thors to resolve the estimates, such as maximum likelihood with numerical quadrature and penalized quasi-likelihood by Demidenko (2004) and restricted pseudo-likelihood by Antonio and Beirlant (2007). They also describe that the log-likelihood function in (2.21) can be solved by two types of numerical algorithms. The first type is known as linearization methods which are usually doubly iterative. Schabenberger and Gregoire (1996) list numerous algorithms based on Taylor series for the case of clustered data alone. The second type is based on integral approximations, which are single iterative. Various techniques are used to compute the approximation: Laplace methods, quadra-ture methods, Monte Carlo integration, and Markov chain Monte Carlo methods. The main disadvantage of the first method is the absence a true objective function for the overall optimization. The main disadvantage is that these structures can not be fitted with techniques based on numerical integration, since it will become too complex. The advantage of the first approach is that it is a simpler linearized model for which fitting only the mean and variance of the linearized form is sufficient. (Garrido and Zhou, 2007, p.66)

GEE vs GLMM Jiang (2007) mentions disadvantages of GLMMs. One disadvan-tage of the GLMM is that the likelihood function typically does not have a closed-form expression and may involve high-dimensional integrals which can not be derived an-alytically. Another concern about GLMM is that misspecification of the model leads to undermining the efficiency of the likelihood based methods. Model misspecification often occurs in the analysis of longitudinal datasets. Non-likelihood based methods such as GEE are therefore computationally attractive. Unlike GLMM, GEE does not require full model specifications of the data distribution. GEE namely only requires specifica-tion of the mean funcspecifica-tions to foresee in consistency of the GEE estimator. Ballinger (2004) sums up some GEE drawbacks. If the number of subjects is small, the estimate of the variance produced under GEE could be highly biased. Errors in the calculation of parameter estimates can be made if the distribution of the dependent variable and the link function that will be used to linearize the regression are misspecified. Although GEE can handle missing data, it assumes that the data is missing at random. The parameter estimates may be affected when the probability of missing data depends on previous values of the response variable.

2.6 Descriptive statistics

2.6.1 Cramers V correlation

When adding variables in the GLM, the correlation matrix will be used to see if variables correlate with each other. When this is the case, interaction effects have to be added in

(14)

the model and subsequently investigated on significance. Most of the available variables are either categorical or ordinal variables (see table 3.1). Categorical variables have two or more categories, without an intrinsic ordering to the categories. Ordinal variables are similar to categorical variables, only are the categories ordered. Therefore the Cramers V correlation is used. Anderson et al. (2007) give the following formula to calculate the correlation: v u u t P i,j (nij−eij)2 eij min((a − 1), (b − 1)) × n (2.23) where

a = number of levels of factor one b = number of levels of factor two

nij = amount of the insurance period measure for the ith

level of factor one and the jth level of factor two

n =X i,j ni,j eij = P ini,j×Pjni,j n

2.7 Goodness of fit statistics

The following goodness of fit statistics can be used to rank different GLMs.

2.7.1 Deviance test

Deviance derivation To validate a GLM model, one looks at the difference of the optimal likelihood of the concerned model, compared with the maximally attainable likelihood of the full model, which has a parameter for every observation. If the likelihood ratio is defined as the maximized likelihood under the concerned model, divided by the likelihood of the full model, one can derive the scaled deviance as -2 times this ratio. When the scaled deviance is multiplied by the dispersion parameter φ, one obtains the deviance D. In a Poisson distribution, the dispersion parameter is equal to one and thus vanishes out of the equation.

D φ = −2 × log ˆ L ˜ L (2.24)

where ˆL is the likelihood of the concerned model and eL is the likelihood of the full model. (Kaas et al., 2008, p. 246)

Deviance test To compare two nested models, the analysis of deviance can be used. Two models are nested if one model arises by restricting a parameter in a more complex model to be zero. So for example: let model 2 be an expansion of model 1 by adding an explanatory variable to model 1. The addition of the explanatory variable causes a reduction of the degrees of freedom equal to the number of levels of the additional variable (in case of a continuous variable, this is equal to one) minus one. On the other hand, the maximized likelihood will increase due to the addition of an explanatory variable. To indicate whether model 2 significantly fits better than model 1, the gain in likelihood and the loss in degrees of freedom have to be compared with each other. With the analysis of deviance, the difference in deviance ∆D of the two nested models is tested against the critical value of a χ2_{distribution, because the ∆D statistic is χ}2_(k)

(15)

distributed, with k the number of extra parameters estimated. (Kaas et al., 2008, p.248). ∆D = D1− D2 = −2 × log ˆ L1 ˜ L − (−2 × log ˆ L2 ˜ L ) = −2 × log ˆL1+ 2 × log ˜L − 2 × log ˜L + 2 × log ˆL2

= −2 × log ˆL1+ 2 × log ˆL2

If the drop in deviance exceeds the χ2 critical value, model 2 fits significantly better and model 1 is rejected. If this is not the case, one can only conclude that the null-hypothesis that the extra parameters are actually equal to zero is not rejected.

2.7.2 AIC and BIC

Two other methods to test whether model 2 is an improvement of model 1, are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). These two methods penalize the inclusion of too many parameters. Even though a model with many parameters will fit the data excellent, it will have limited utility due to the few degrees of freedom it will have. If n is defined as the number of observations, L the log likelihood of the concerned model and k the number of parameters, AIC and BIC can be derived as follows (Kaas et al., 2008, p.248)

AIC = −2L + 2k (2.25)

BIC = −2L + k log n (2.26)

2.8 Software: SAS

R

The analyses will be done in SAS R _{(Statistical Analysis System). As the name says,}

SAS R _{is a software package a user can use to perform statistical analysis. It contains}

’ready to use procedures’ and one can also program manually. GLM, GEE and GLMM models are procedures in SAS R_{. This section will show the used procedures in this}

thesis.

GLM SAS R _code _{A standard GLM model can be modeled in SAS} R _{with the GENMOD}

procedure.

PROC GENMOD DATA=dataset;

CLASS Classified_variable1 (ref=first) Classified_variable2 (ref="5") Classified_variable3 (ref=last)/param=ref;

MODEL Nclaims= Classified_variable1 | Classified_variable2 Classified_variable3

/LINK=LOG DIST=POISSON offset= lexposure type1 type 3; CONTRAST ’hypothesis’ variable 0 0 -1 1 0 0 0 0 0;

RUN;

In the CLASS statement, the variables which are categorical have to be mentioned and one can define the reference levels. In this case, the reference levels are equal to levels which contain the most exposure and thus present the major part of the portfolio. The default is REF=LAST, thus when the REF statement is not used, SAS R _{will use the last}

level of the variable as reference level. In the MODEL statement, the ” | ” sign between variables is equal to the single direct effects and the interaction effects of the variables. The logarithmic link function, Poisson distribution and offset variable are explicitly defined in the procedure. The CONTRAST statement is added in the code to test the hy-pothesis that βi= βj. With this, insignificant levels of covariates can be regrouped with

(16)

The CONTRAST statement tests this. At last, the type1 an type3 statements tell SAS R

to perform these two sum of squares significance tests. The type I SS test depends on the order in which the model terms are fit. The Type III SS test is model-order independent.

GEE SAS R _code _{The GEE method is, as said before, an extension of the standard}

GLM model. In SAS R_{, the GEE method can be invoked by the REPEATED statement.}

The REPEATED statement specifies the covariance structure of multivariate responses for GEE model fitting in the GENMOD procedure.

PROC GENMOD DATA=dataset;

CLASS Classified_variable1 (ref=first) Classified_variable2 (ref="5") Classified_variable3 (ref=last)/param=ref;

MODEL Nclaims= Classified_variable1 | Classified_variable2 Classified_variable3

/LINK=LOG DIST=POISSON offset= lexposure type 3; REPEATED SUBJECT= polisnr

/WITHINSUBJECT= jaar TYPE=unstr CORRW; RUN;

The SUBJECT statement defines the effect in which the serial correlation is present. Each value of this effect identifies a new subject or cluster. If the dataset is not sorted and not all subjects have measurements for all years, the WITHINSUBJECT statement must be used. The effect defined in this statement specifies the order of the measurements within subjects. Then, as mentioned in section 2.5, the structure of the working correlation ma-trix has to be defined in a GEE method. This can be done with the TYPE statement. Structures which can be defined are: unstructured, autoregressive, exchangeable, inde-pendent, m-dependent or one can define the matrix himself with the fixed(matrix) of user(matrix) type. With the CORRW statement, the estimated working correlationmatrix is displayed.

GLMM SAS R _code _{Similar to GEE, GLMM is also an extension of the GLM}

method. It adds a random effect into the model. In contrast to GEE, in GLMM the distribution of the observations are fully specified which makes it possible to use the maximum likelihood method to estimate the β’s. In SAS R_{, GLMM has his own}

proce-dure GLIMMIX.

PROC GLIMMIX DATA=dataset METHOD=RSPL IC=PQ;

CLASS Classified_variable1 Classified_variable2 Classified_variable3; MODEL Nclaims= Classified_variable1 | Classified_variable2

Classified_variable3 /OFFSET=lexposure DIST=poisson; RANDOM int

/SUBJECT=polisnr; RUN;

The meaning of the statements are in line with the GENMOD statements. RSPL (Residual Subject based Pseudo-Likelihood) is the default method the GLIMMIX uses. With IC one can determine the computation of information criteria of the goodness of fit statistic table. SAS R _{explains that the IC=PQ option requests that the penalties include the}

number of fixed-effects parameters, when estimation in models with random effects is based on a residual (restricted) likelihood.

(17)

Cramer’s V correlation SAS R _code _{The Cramer’s V correlation can be calculated}

in SAS R _{by taking exposure as weight the proc freq statement.}

proc freq data=dataset; weight Exposure;

tables variable_1* (other variables) /chisq noprint; run;

(18)

Chapter 3

Data

Before modelling, some preliminary analysis can be done to understand the data. For all variables captured in the dataset, the distribution of exposure among their levels is examined. Each variable must contain at least 2% (based on best practice) of the total exposure, otherwise the GLM maximum likelihood optimization procedure may not converge (Anderson et al., 2007, p.41). Next, direct relations between claim frequency and explanatory variables can be investigated. Plots of claim frequencies per variable level are made. Since one-way analysis can be distorted by the correlation and interaction effects between variables and GLM adjust for this, the outcomes of these two can differ.

3.1 Data description

Develop dataset The dataset consists of 1.7 million records observed over a period of 6.5 years (January 2007 to August 2013). The first 5 years of data, 2007-2011, will be used to develop the model. This dataset contains 292.381 policies and has an average claim frequency of 4.7%. For the GEE analysis, the dataset is randomly divided into two datasets: Subset 1 and Subset 2. This has to be done, because running the total dataset requires more memory resources than available. The data modifications which are made in the develop dataset, are also made in the validation dataset.

Validation dataset The data of the last 1.5 years, 2012 to medio 2013, will only be used to validate the model and thus will not be used in preliminary analysis.

Geographical illustration of the dataset The maps in figure 3.1 show the distri-bution of the number of policies and the distridistri-bution of number of claims among the country. In North-East and South-West Holland a.s.r. has the least clients of PLI.

Figure 3.1: Left: Distribution of claims among the country. Right: Distribution of exposure among the country.

(19)

Data sources A selection of data from a.s.r. and external databases of Experian and Cendris are used to create the dataset for this study. Cendris has information about the number of households and inhabitants per official zip code of PostNL. This information is reliable and can be used in this study. More information of Cendris can be found on their website http://www.cendris.nl/. For this thesis, information about the number of households and the number of inhabitants per zip code is used, see also section 3.2.2. Experian data contains information of March 2013 and this information is at address level. The Experian data is linked to all years in the dataset. The assumption is thus made that over the last 6.5 years the Experian data has not changed. The data of Experian is based on a model and it is thus possible that the information is not correct for the whole portfolio. It should however be reliable for the majority of the dataset, according to Experian. In this study socio-economic, socio-demographic and housing information of Experian is used (see section 3.2.3). More information of Experian can be found on their website http://www.experian.nl/.

Data modifications The data of the different databases are linked to each other based on zip code and house number. Because the current available insured sums are e1.250.000 and e2.500.000, only policies with one of these insured sums are taken into account. The continuous variables are percentile binned. This is done to perform the one-way analysis and is done before the selection of insured sums. In this way, all variables included are described either as categorical or ordinal. If necessary, variables are regrouped such that all levels contain more than 2% (based on best practice) of the whole exposure.

(20)

16 V era Makhan — claim frequency of personal liability insurance Summary of Variables

Variable Model name Categories

Capacity of household cap Single

This variable indicates the number Multiple

of persons in the household of the policyholder.

Own risk er e0

e100

Own risk on children/animals r erk e0

This variable indicates if the policyolder has e250

own risk on children/animals.

Gender r gender M

O V

Age age Continuous

This variable indicates the age of the policyholder.

Age grouped r age 10 = 18-32 years

This is the quantile binned version of Age. 20 = 33-36 years

30 = 37-42 years 40 = 43-46 years 50 = 47-51 years 60 = 52-57 years 70 = 58-63 years 80 = 64-72 years 90 = 73-105 years 100 = False years

Customer duration cd Continuous

Customer duration grouped r cd 10 =<= 2

(21)

frequency o f personal liability insurance — V era Makhan 17 40 = 7-9 years 50 = 10-13 years 60 = 14-17 years 70 = 18-22 years 80 = 23-27 years 90 = 28-37 years 100 = >= 38 years

Extra coverage extra Yes

This variable indicates whether No

the policyholder has extra coverages.

Number of VAS policies r vas 0 = 0

This variable indicates how many 1 = 1

policies in the traffic branch the policyholder 2 = 2

has. 3 = > 2

Number of BAS policies r bas 0 = 0

policies in the fire branch the policyholder has. 2 = 2 3 = > 2

Number of VAR policies r var 0 = 0

policies the policyholder has in other 2 = > 1

branches then fire and traffic.

Number of restrictive clauses r restr 0 = 0

1 = > 0

Number of expanding clauses exp 0 = 0

1 = > 0

Number of households in area nhh Continuous

Indicates how many households are located in the living area of the policyholder.

(22)

18 V era Makhan — claim frequency of personal liability insurance

Number of households in area grouped r nhh 10 = <= 11

This is the quantile binned version of 20 = 12-14

Number of households in area. 30 = 15-16

Indicates how many households are 40 = 17-18

located in the living area of the policyholder. 50 = 19-20 60 = 21-23 70 = 24-25 80 = 26-30 90 = 31-43 100 = >= 44

Number of inhabitants in area ninw Continuous

This variable indicates how many inhabitants live in the living area of the policyholder.

Number of inhabitants in area grouped r ninw 10 = <= 24

This is the quantile binned version of 20 = 25-31

Number of inhabitants in area. 30 = 32-36

40 = 37-41 50 = 42-45 60 = 46-50 70 = 51-57 80 = 58-68 90 = 69-95 100 = >= 96

Number of children in household nch 1 = 1 child

This variable indicates how many children 2 = 2 children

live in the household of the policyholder 3 = 3 children

9 = No registration of children known Age of the youngest child in household agech 1 = Young children (youngest 0-5 years) This variable indicates how old the youngest 2 = Older children (youngest 6-12 years)

(23)

frequency o f personal liability insurance — V era Makhan 19

4 = Adult children (youngest 19-24 years) 9 = No registration of children known

Life stage life 1 = Single or couple, < 35 year

This variable indicates the life 2 = Family with children, eldest child 0-5 years

stage of the policyholder. 3 = Family with children, eldest child 6-12 years

4 = Family with children, eldest child 13-19 years 5 = Family with children, eldest child 20+ years 6 = Single or couple, 35-49 years

7 = Single or couple, 50-64 years 8 = Single or couple, 65 years or older

Gross family income fi 1 = Below modal

This variable indicates the gross 2 = Modal

family income of the policyholder 3 = 1.5 times modal

4 = 2 times modal

5 = More than 2 times modal

Working situation work 1 = Full-time

This variable indicates the 2= Part-time

working situation of the policyholder 3 = Pensioner

4 = Student 5 = Without job

Education edu 1 = Low

This variable indicates the degree 2 = Medium

of education of the policyholder. 3 = High

4 = University

Parcel size parc Missing

This variable indicates the parcel <97

size where the policyholder lives. 98-136

137-158 159-196

(24)

20 V era Makhan — claim frequency of personal liability insurance 197-256 257-367 368-798 >=799

Housing type house 1 = Detached House Large

This variable indicates the type 2 = Detached House Medium

of house the policyholder lives in. 3 = Detached House Small

4 = Semi-detached house Large 5 = Semi-detached house Medium 6 = Semi-detached house Small 7 = End of terrace Large 8 = End of terrace Medium 9 = End of terrace Small

10 = Terrace or link houses Large 11 = Terrace or link houses Medium 12 = Terrace or link houses Small 13 = Apartments

(25)

3.2 One way analyses

Table 3.1 gives an outline of all variables included in this study. These variables will be investigated in this study to explain the claim frequency of personal liability insurance (PLI).

3.2.1 Variables of a.s.r.

Capacity of household The capacity of household (cap) indicates whether the in-sured lives alone or with more people. Together with excess, this is currently the only variable used in premium definition and thus must be included in this investigation. Most of the insureds live in multiple households (see figure 3.2). This can be with a partner, no partner but with children or both partner and children. A.s.r. unfortunately did not record this specific information about the household composition. They are plan-ning to do so from 2014 on. For this thesis however, this information is not available. As for the claim frequency, households with more than one person living, have a higher claim frequency than households with a single person living in (5.5% versus 2.4%). This is shown in figure 3.2. Single Multiple Capacity of household Exposure (thousands) 0 100 300 500 235 652 Single Multiple Capacity of household Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.05 0.024 0.055

Figure 3.2: One way analyses of the capacity of households. The left figure shows the distribution of exposure among the different levels of the capacity of households. The right figure shows the average claim frequency per level of capacity of households.

Own risk and Own risk on children and animals As mentioned before, next to cap is own risk (er) the other variable which defines the premium of the insurance product and is thus essential to include in this study. A policyholder can choose own risk to reduce the premium. The premium for personal liability insurance is relatively low. An own risk of 100 euros in return for a premium reduction is not interesting for most of the insureds. The histograms in figure 3.3 show that only one percent of the policies has an own risk. Because the level with zero own risk contains less than 2% of the total exposure, this variable does not contain enough information to add in the model. Own risk will therefore not be further discussed in this study as an explanatory variable.

A way to roughly indicate whether the insured has children, is to look at the own risk on children/animals (erk). Only multiple households can use this own risk. Initially it was meant for people with children, but in reality mostly people without children have these own risk in their insurance. erk namely lowers their premium. Since they do not have children, the additional own risk will not raise costs for them because they will not claim any claims caused by children. Together with the assumption that people with children will have a higher expected claim frequency, one would expect that insureds with erk will have a lower claim frequency. This is confirmed by the plot in figure 3.3.

(26)

0 250

Own risk children/animals (in €)

Exposure (thousands) 0 100 300 500 700 753 134 0 250

Own risk children/animals (in €)

Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.05 0.05 0.031 0 100/500

Own risk (in €)

Exposure (thousands) 0 200 400 600 800 877 10

Figure 3.3: One way analyses of the own risk and own risk on children/animals. The left figure shows the distribution of exposure among the different levels of excess on children/animals. The histogram in the middle shows the average claim frequency per level of excess on children/animals. The right figure shows the distribution of exposure per level of excess.

Gender Most of the policyholders are males. A small part of the portfolio has un-known sex, labeled O. This unun-known gender will not be deleted out of the model, because other valuable information will then also be deleted. Secondly, by not deleting these records, there can be investigated if not knowing the sex has an influence on claim frequency. Although sex does not say much if the policyholder lives with his/her part-ner, because then the policy holds for both of them (male and female in most cases), it can be of influence if the policyholder lives alone. It is therefore of interest to investigate the cross table of sex and cap. For the single households, 50% is male and 47% is female and the other 3% is unknown (see figure 3.4). Thus this variable contains enough infor-mation to capture in the model. From the right plot in figure 3.4 can be seen that there is an interaction effect between gender and cap, since the pattern in claim frequency among gender differs for Single and Multiple households. The plot of claimfrequencies in the middle of figure 3.4 shows that women tend to claim less than men (4.4% versus 4.8%). M O V Gender Exposure (thousands) 0 100 200 300 400 500 600 604 41 242 M O V Gender Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.048 0.038 0.044 M O V Gender Claim frequenc y 0.00 0.02 0.04 0.06 Single Multiple

Figure 3.4: One way analyses of gender. The left figure shows the distribution of exposure among the genders. The central figure shows the average claim frequency per gender. The right figure shows the claim frequency per gender for Single and Multiple households. The pattern in claim frequency differs between the two levels of cap, which implicates an interaction effect between gender and age.

Age For some policyholders, the date of birth is incorrect or not at all recorded in the database. These policyholders have an age less than 18 years (holds for 0.04% of the database) or larger than 105 years (holds for 11.4% of the database). Ages smaller

(27)

than 18 years and larger than 105 years are grouped in one level (100th percentile). The ages of this level will be considered unknown, but will not be left out of the model due to other information they are carrying and the investigation if these unknown ages have an effect on claim frequency can be done. Most of the dates of births are recorded correctly. Age is a continuous variable and is also categorized based on the percentiles, again before the selection of insured sums. The categorized age variable is named r age. The one way analysis of claim frequency in figure 3.5 shows that insureds older than 58 years claim less. For ages up till 42 years the claim frequency is increasing and it decreases thereafter. Ages 30-50 years commonly correspond to households with children and these ages show a higher claim frequency. When looking at the interaction plot in 3.5 of age and cap, one can see different patterns in claim frequencies among different ages for Single and Multiple households, indicating that there is an interaction effect. For the last percentile, the ages are unknown and therefore no conclusion can be drawn for this group. In the GLM analysis, the interaction effect of age and cap has to be investigated. 18−32 33−36 38−42 43−46 47−51 52−57 58−63 64−72 ₇₃₋₁₀₅ False Age grouped Exposure (thousands) 0 50 100 150 163 105 118 87 93 87 73 ₆₉ 41 51 18−32 33−36 38−42 43−46 47−51 52−57 58−63 64−72 ₇₃₋₁₀₅ False Age grouped Claim frequenc y 0.00 0.02 0.04 0.06 .042 .057 .065 .064 .056 .043 .034 .025 .013 .035 18−32 33−36 38−42 43−46 47−51 52−57 58−63 64−72 ₇₃₋₁₀₅ False Age grouped Claim frequenc y 0.00 0.02 0.04 0.06 Single Multiple

Figure 3.5: One way analyses of age. On the x-axes are ages shown in years. The left figure shows the distribution of exposure among the different age groups in years. The figure in the middle shows the average claim frequency per age group in years. The right figure shows the claim frequency per age level for Single and Multiple households. The pattern in claim frequency differs among the two levels of cap, which implicates an interaction effect between gender and age.

Customer duration a.s.r. offers next to personal liability insurances also other prod-ucts such as housing insurances, car insurances, life insurances and mortgages. These last two branches have long maturities. We assume that the customer duration can reach up to 75 years, since the assumptions is made that people can not be older than 105 years. Customer durations larger than 75 years are assumed to be false and will be deleted out of the dataset. Most of the policyholders are less than 10 years customer of a.s.r. (see figure 3.6). The customer duration (cd) is also quantile binned (r cd). When looking at the claim frequency in figure 3.6, similar findings as the pattern of claim frequency among age levels can be seen: increasing from 4.6% up to a claim fre-quency of 5.5% at a customer duration of 13 years and for higher customer durations the claim frequency decreases to 1.9%. From this pattern can be concluded that loyal poli-cyholders with a customer duration above 22 years claim less than the overall average claim frequency.

Extra coverage In the personal liability insurance, policyholders can choose to ex-pand their coverage with the so called sterdekking (extra). With this extra coverage, the following three situations regarding baby-sitting are covered.

(28)

<=2 3−4 5−6 7−9

10−13 14−17 18−22 23−27 28−37 >=38

Customer duration grouped

Exposure (thousands) 0 50 100 150 166 152 128 95 79 69 77 41 46 35 <=2 3−4 5−6 7−9 10−13 14−17 18−22 23−27 28−37 >=38

Customer duration grouped

Claim frequenc y 0.00 0.02 0.04 .046.048 .049 .051 .055 .054 .047 .041 .032 .019

Figure 3.6: One way analyses of customer duration. On the x-axes are customer du-rations shown in years. The left figure shows the distribution of exposure among the different groups of customer duration in years. The right figure shows the average claim frequency per customer duration group in years.

child younger than 14 years old.

• When a person lets your child of six years or younger get into his/her area. For example letting your child sit on his/her lap.

• damages occurred when someone nonprofessional is watching your pets.

These three situations are not covered in the standard personal liability insurance. The variable extra coverage can only take two values: zero and one, indicating whether one does or does not have the extra coverage.

The claim frequency plot in figure 3.7 shows that policyholders with an extra coverage do claim more than those who do not have this extra coverage.

No Yes Extra coverage Exposure (thousands) 0 200 400 600 800 845 42 No Yes Extra coverage Claim frequenc y 0.00 0.02 0.04 0.06 0.046 0.064

Figure 3.7: One way analyses of extra coverage. The left figure shows the distribution of exposure among the two levels of extra coverage. The right figure shows the average claim frequency of extra coverage.

Number of policies in other branches of a.s.r. A policyholder can also have other products from a.s.r.. This information can be an explanatory factor of claim frequency. A policyholder who has more products of a.s.r. will probably be more committed to a.s.r. and will claim less. The plots of claim frequency in figure 3.8 however show the opposite. a.s.r. has stored the data of fireproducts in the database called BAS, data of products for traffic are stored in the database called VAS and data stored for other non-life insurance products in VAR. Most of the personal liability insurance policy-holders do not have policies in VAS and VAR products. However, for BAS products this is different. Most of the people with a personal liability insurance of a.s.r. also

(29)

have one or two products in BAS. These three variables are regrouped, such that each level contains more than 2% of the total exposure and thus enough information to cap-ture the variables in the model. In table 3.1 can be seen how these variables are grouped.

0 1 2 > 2

Number VAS policies

Exposure (thousands) 0 100 200 300 400 466 259 115 47 0 1 2 > 2

Number VAS policies

Claim frequenc y 0.00 0.02 0.04 0.06 0.042 0.048 0.056 0.061 0 1 2 > 2

Number BAS policies

Exposure (thousands) 0 100 200 300 400 70 345 437 36 0 1 2 > 2

Number BAS policies

Claim frequenc y 0.00 0.02 0.04 0.036 0.042 0.051 0.059 0 1 > 1

Number VAR policies

Exposure (thousands) 0 100 300 500 672 190 26 0 1 > 1

Number VAR policies

Claim frequenc y 0.00 0.02 0.04 0.06 0.045 0.051 0.062

Figure 3.8: One way analyses of the number of policies in other branches. These figures show the distribution of exposure and the average claim frequency among the different levels of the number of VAS, BAS and VAR policies.

Number of restrictive and expanding clauses If someone takes out a personal liability insurance, he/she can add clauses to this insurance. These clauses can either be restrictive or expanding. When someone has expanding clauses, more is covered in the insurance and it is likely that more claims will occur. The opposite expectation holds for restrictive clauses. The number of policies with expanding clauses (exp) is less than 1% of the total exposure (figure 3.9). So almost all policies do not have these clauses and thus this variable has not enough information to add into the model. For the restrictive clauses (r restr) on the other hand, 16% has one or more of these clauses and this variable will be taken into account in this study (figure 3.9). For the number of restrictive clauses, one and two clauses are grouped together, because the level with two clauses contains less than 2% of the total exposure. As for the claim frequency, whether someone has restrictive clauses or not has apparently no influence on his/her claim frequency by these one-way analysis (see figure 3.9).

(30)

0 > 0

Number of expanding clauses

Exposure (thousands) 0 200 400 600 800 887 0 0 > 0

Number restrictive clauses

Exposure (thousands) 0 100 300 500 700 746 141 0 > 0

Number restrictive clauses

Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.047 0.048

Figure 3.9: One way analyses of clauses. The left figure shows the distribution of exposure among the different levels expanding clauses. The histogram in the middle shows the average claim frequency per level of expanding clauses. The right figure shows the average claim frequency per level of restrictive clauses.

3.2.2 Variables of Cendris

The following two variables from the Cendris database are used in this study.

Number of households and inhabitants per district These two variables, shown in figures 3.10 and 3.11, indicate the quantile binned number of households (r nhh) and inhabitants (r ninw) per zipcode. They could relate to claim frequency, because the PLI covers damages done by the policyholder to properties of someone else, for example the car of the neighbour’s. So when someone lives in a crowded area with more households, it is more likely that he/she will damage the neighbour’s properties than someone who lives alone in an area.

According to the Cendris data, PLI policyholders do not particularly live in a crowded or uncrowded area, see figures 3.10 and 3.11. These figures also show that the number of households and inhabitants do not differ much in claim frequency among the different percentiles. The only thing that can be remarked is that the last percentile has a lower claim frequency than the other levels, for both variables. This contradicts the previous hypothesis. <=11 ₁₂₋₁₄ ₁₅₋₁₆ ₁₇₋₁₈ ₁₉₋₂₀ ₂₁₋₂₃ ₂₄₋₂₅ ₂₆₋₃₀ ₃₁₋₄₃ >=44 Number of households Exposure (thousands) 0 20 40 60 80 100 99 99 83 86 87 112 79 79 83 78 <=11 ₁₂₋₁₄ ₁₅₋₁₆ ₁₇₋₁₈ ₁₉₋₂₀ ₂₁₋₂₃ ₂₄₋₂₅ ₂₆₋₃₀ ₃₁₋₄₃ >=44 Number of households Claim frequenc y 0.00 0.01 0.02 0.03 0.04 .049 .049 .047 .048.049 .048.046.048_.045 .038

Figure 3.10: One way analyses of the number of households (in thousands) in the area. On the x-axes are the number of households shown in thousands. The left figure shows the distribution of exposure among the number of households (in thousands). The right figure shows the average claim frequency of the number of households (in thousands) in the area.

(31)

<=24 ₂₅₋₃₁ ₃₂₋₃₆ ₃₇₋₄₁ ₄₂₋₄₅ ₄₆₋₅₀ ₅₁₋₅₇ ₅₈₋₆₈ ₆₉₋₉₅ >=96 Number of inhabitants Exposure (thousands) 0 20 40 60 80 97 97 88 95 75 86 96 87 84 83 <=24 ₂₅₋₃₁ ₃₂₋₃₆ ₃₇₋₄₁ ₄₂₋₄₅ ₄₆₋₅₀ ₅₁₋₅₇ ₅₈₋₆₈ ₆₉₋₉₅ >=96 Number of inhabitants Claim frequenc y 0.00 0.01 0.02 0.03 0.04 .047 .048 .046 .047.048 .047.049 .048.047 .041

Figure 3.11: One way analyses of the number of inhabitants (in thousands) in the area. On the x-axes are the number of inhabitants shown in thousands. The left fig-ure shows the distribution of exposfig-ure among the different the number of inhabitants (in thousands). The right figure shows the average claim frequency of the number of inhabitants (in thousands) in the area.

3.2.3 Variables of Experian

The following selection of variables from the Experian database is used in this study.

Check on Experian data To check if the data from Experian mostly coincide with the data from a.s.r., the information about number of children in household is tabled against the capacity of household. Within the single household, one would expect no chil-dren. The table showed that 88.8% of the single households has no registered chilchil-dren. Further, for households with one child and more than two children, 93.0% respectively 93.9% is registered as multiple households. We can assume that this information coin-cides with each other and the Experian data can indeed be used. Another check is the check on secured families and capacity of household. If one has assigned the secured family group, the majority must be in a multiple household. In this dataset this holds for 90 percent.

Variables with information about children Since a.s.r. does not register infor-mation about children from policyholders, this inforinfor-mation can only be derived from the Experian data. From the number of children in a household, the life stage and the age of the youngest child in a household, one can see that most of the policyholders don’t have children (71.7 percent) (see plots in figure 3.12). Also, the more children one has, the more likely he/she will claim. A remarkable outcome of the one way analysis, is that households with children between 12 - 19 years old claim more than households without or with younger children.

Socio-economic variables Other interesting factors to investigate are family income, education and working situation. In figure 3.13 the plots for these variables are shown. From the data one can see that most of the policyholders of PLI are categorized in the highest income class and have an income of more than two times modal (33.3 percent). The remaining 66.6 percent is almost equally distributed among the other classes. When looking at the claim frequency plot, one can see that the more income a household has, the more he/she is likely to claim.

The PLI portfolio is mainly medium educated. From the histograms of education no pattern in claim frequency is visible, low and high education have a lower claim fre-quency than medium and university education.

(32)

28 Vera Makhan — claim frequency of personal liability insurance 1 2 3 n.r.k. Number of children Exposure (thousands) 0 100 300 500 163 67 22 635 1 2 3 n.r.k. Number of children Claim frequenc y 0.00 0.02 0.04 0.06 0.08 0.06 0.076 0.098 0.039 0−5 ₆₋₁₂ 13−18 19−24 n.r .k.

Age youngest child (years)

Exposure (thousands) 0 100 300 500 70 120 43 ₁₈ 635 0−5 ₆₋₁₂ 13−18 19−24 n.r .k.

Age youngest child (years)

Claim frequenc y 0.00 0.02 0.04 0.06 0.06 0.072 0.07 0.062 0.039 1 2 3 4 5 6 7 8 Life stage Exposure (thousands) 0 20 40 60 80 120 105 89 113 105 96 95 146 139 1 2 3 4 5 6 7 8 Life stage Claim frequenc y 0.00 0.02 0.04 0.06 .038 .049 .069.067 .053 .043 .039 .025

Figure 3.12: One way analyses of variables which contain information about children. These figures show their exposure among the different levels and average claim frequen-cies per level. The variables are the number of children, the age of the youngest child in the household and the life stage.

They can not be merged with another group, because students can either be parttime-workers, full-time workers or unemployed. From the claim history histogram of working situation can be seen that pensioners have the lowest claim frequency of 2.4% in 2011 of all working levels. This is in line with the observation that older people claim less.

Housing information Together with housing type, parcel size is one way to indicate in what kind of house one lives. One who has a large parcel, is likely to live in a detached home, with his own parking lot and garden where his/her children can play in. One who has a small parcel is likely to live in an apartment or linked home. Because housing type and parcel size are highly correlated with each other, they will not be added in one model together. The variable which explains the claim frequency the most will be added. From the histograms in figure 3.15 one can see that most of the policyholders live in apartments and linked houses. One can also see that the more parcel size the insured has, apparently the more he/she is likely to claim. And policyholders living in apartments claim significantly less than policyholders living in other housing types.

(33)

1 2 3 4 5 Gross income Exposure (thousands) 0 50 100 150 200 250 159 161 116 157 294 1 2 3 4 5 Gross income Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.05 0.038 0.044 0.045 0.05 0.052

Low Medium High University

Education Exposure (thousands) 0 100 200 300 157 397 227 106

Low Medium High University

Education Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.05 0.039 0.05 0.045 0.05 1 2 3 4 5 Working situation Exposure (thousands) 0 100 200 300 400 500 516 155 ₁₄₄ 8 64 1 2 3 4 5 Working situation Claim frequenc y 0.00 0.01 0.02 0.03 0.04 0.05 0.052 _0.051 0.025 0.036 0.049

Figure 3.13: One way analyses of the family income, education and working situa-tion. These figures show their exposure among the different levels and average claim frequencies per level.

3.2.4 Conclusion

The outcomes of the one way analyses are mostly as expected. Multiple households claim more, older people (older than 58 years) claim less and loyal policyholders with a cd above 22 years claim less. Households with children claim more. The analyses show that the age of the children living in a household seems to be of interest. Households with older children, ages 6-18 years, have a higher claim frequency. Next, the socio-economic variables show that policyholders with a higher income claim more and they also show that policyholders with a low education claim less. The housing variables show that the larger a premise of a policyholder is, the more he/she will claim and that policyholders living in apartments claim less. A remarkable outcome is that the more policies in other branches of a.s.r. a policyholder has, the more likely he/she will claim. The one way analyses also show that there seems to be an interaction effect between cap and age and between cap and gender. At last, exp and er contain not enough information to be added in the model. These variables will therefore not be further investigated in the GLM, GLMM and GEE models.

(34)

30 Vera Makhan — claim frequency of personal liability insurance Missing <97 98−136 137−158 159−196 197−256 257−367 368−798 >=799 Land size (m2) Exposure (thousands) ₀ 50 100 150 194 24 90 90 93 96 100 101 98 Missing <97 98−136 137−158 159−196 197−256 257−367 368−798 >=799 Land size (m2) Claim frequenc y 0.00 0.02 0.04 .031 .043 .048 .052 .051 .052 .052 .052.053

Figure 3.14: One way analyses of parcel size. On the x-axes are the parcel sizes in m2 shown. The left figure shows the distribution of exposure among the different groups of parcel size. The right figure shows the average claim frequency per group of parcel size.

1 3 5 7 9 11 13 Housing type Exposure (thousands) 0 50 100 150 36 4550 3643 63 49 40 49 82 99 100 194 1 3 5 7 9 11 13 Housing type Claim frequenc y 0.00 0.02 0.04 .057 .050.049 .054.055 .051.051 .054 .052.051.051 .047 .031

Figure 3.15: One way analyses of housing type. The left figure shows the distribution of exposure among the different housing types. The right figure shows the average claim frequency per housing type.

(35)

GLM analysis

Deriving a set of significant explanatory variables in a GLM is an iterative process. One can not conclude from a single GLM if the included variables are significant or not, because adding and deleting variables may change the significance structure of the included variables.

There is not one best way to derive this set of factors. One can begin with a full or a null model, or with a model containing a selection of explanatory variables which are known to be important. When one begins with a full model, all explanatory variables are included in the model. Per iteration the insignificant factors will be deleted out of the model one at a time. When beginning with a null model, the first model only contains an intercept. One factor will be added to the model per iteration and investigated on significance. Per iteration, the GENMOD procedure in SAS R _{will derive goodness of fit}

statistics, such as the deviance and likelihood. With these statistics, one can perform the Deviance test and derive AIC and BIC values to derive whether the inclusion or exclusion of a specific factor improves the model or not.

4.1 Model derivation

Deriving variable To derive which variable should be added, the Deviance test and the AIC and BIC values can be used. These test and methods are described in section 2.7. For each model, the Deviance and the Likelihood are derived by the GENMOD procedure in SAS. With the Likelihood and formulas (2.25) and (2.26), the AIC and BIC values can be calculated. For the Deviance test, the drop in Deviance must be larger than the critical value of a χ2 statistic for the increase of degrees of freedom, caused by adding a variable. The variable which causes the largest drop in the AIC and BIC statistics and which comes out as best through the Deviance test is added in the model. Next all remaining variables are again added to this best model, one at a time. Again, the variable which has the lowest AIC, BIC values and comes out as best through the Deviance test is added to the model. This procedure is repeated until all significant factors are added to the model. Although this is a time expensive method, it determines per step the best significant variable of all available variables to include in the model. Table ?? shows for three models the Deviance test, AIC and BIC values. The null contains no variables and has the largest Deviance, AIC and BIC possible for the given dataset. In the next two models r age and cap are added to the null model. One can see that both r age and cap improve the model, since for both variables the drop in Deviance is larger than the critical value of a χ2 _{statistic for ∆degrees of freedom}

and they both decrease the AIC and BIC values. In this example, r age will be added to the null model, since this variable comes out as best through the Deviance test and causes the lowest AIC and BIC values.

Explaining the claim frequency in a personal liability insurance portfolio

personal liability insurance portfolio

Vera Makhan

Contents

Introduction

Theoretical framework

2.1

Claim frequency

2.2

General linear models vs Generalized linear models

2.3

Poisson distribution for claim frequencies

2.4

Claim frequency

2.5

Serial correlation

2.6

Descriptive statistics

2.7

Goodness of fit statistics

2.8

Software: SAS

Chapter 3

Data

3.1

Data description

3.2

One way analyses

GLM analysis

4.1

Model derivation