Modelling Lapse of Individual Policies in an Automobile Insurance Firm A thesis presented for the degree of Master of Science Econometrics Kleissen M.T.C.

(1)

Modelling Lapse of Individual

Policies in an Automobile

Insurance Firm

A thesis presented for the degree of

Master of Science Econometrics

Kleissen M.T.C.

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Drs. Vullings

(3)

Modelling Lapse of Individual Policies in

an Automobile Insurance Firm

Martijn Kleissen

Abstract

We discuss lapse and its main risk drivers in the automobile insurance industry. For this, we use survival models to explain and prediction models to predict lapse rates. These models are applied to an extensive data set. For the survival models, we use the Cox Proportional Hazards model and discuss three subquestions. In this, we study consumers’ price sensitivity, dependence between contracts related to a single policyholder and the performance in fit of combinations of grouped variables. For the prediction models, we study and compare multiple machine learning models based on prediction performance. In addition, we study the vari-able importance for each model and discuss the impact of varivari-ables by Partial Dependence Plots (PDPs). The results indicate a symmetric effect for a change in the total premium on the probability of lapse. Furthermore, we find a signific-ant heterogeneity term for the two studied frailty distributions. That is, both the Gamma and Log-normal frailty terms are statistically significant. Moreover, the model including all combinations of variables has the best fit. For the predicted models, we find that the Gradient Boosted Trees has the best performance among all our models. However, we do note that the Gradient Boosted Trees did not considerably outperform the other models. Also, the models agreed that the age of the vehicle was the most important predictor of lapse.

(6)

1 Introduction

The goal of this paper is to investigate and find the main risk drivers of lapse of individual policy contracts in the automobile insurance industry. We study the impact of geographical, contract-specific, behavioural and macroe-conomic factors. For this, survival models are used. Furthermore, we study multiple machine learning models and compare them based on prediction performance.

Lapse rates impact the profit, growth and risk composition of an insurer’s portfolio. Insurers benefit from modelling the lapse rates as they will have better insights in future portfolio compositions. The data we use is extensive and mainly based on information provided by the insured person or publicly available. Therefore, the results of our paper are interesting to automobile insurers within as well as outside the Netherlands.

Academic literature uses varying definitions of lapse. For our research, we define lapse as both mid-term and end-term cancellation. Modelling lapse is for retention policy purposes highly relevant. For example, Brockett et al. (2008) showed that lapse of one insurance policy contract indicates the churn of the insurance contract holder.

(7)

We study three questions for our survival models.

First, we answer the question Are large premium changes related to a higher lapse probability of insurance contracts?. That is, we study the price sensitivity of customers. We use Inverse Probability Treatment Weighting (IPTW), combine them with a machine learning model and apply causal inference in a survival analysis context. Guelman and Guill´en (2014) used Propensity Score Matching and a Generalized Additive Model in a regression context to study price sensitivity of automobile insurance. They studied end-of-term renewal. We extend their research by studying the effect of price change shocks on non-renewal and mid-term cancellation by using time as the dependent variable using survival models, instead of the binary variable lapse.

(8)

Third, we answer the question Which (combination of ) factors have the largest impact on lapse time?. For this, we study which combinations of variables explain most of the variation of lapse. We compare combinations of groups of covariates based on goodness of fit measures.

In short, for our survival models, we are interested in the following ques-tions:

• Are large premium changes related to a higher lapse probability of insurance contracts?

• Do frailty models help in explaining (unobserved) heterogeneity after controlling for a large number of geographic and behavioural factors?

• Which (combination of) factors have the largest impact on lapse time?

Whilst survival models are appropriate for inference; they are not for prediction, generally.2 Hence, in the second part of the thesis, we survey and compare some machine learning techniques that have been used to build predictive customer lapse or churn models. Among other things, we compare a logistic regression (Kiesenbauer, 2012; G¨unther et al., 2014), to a logistic regression with both l1-and l2-regularization (Zou and Hastie, 2005), also

called elastic net, to a Gradient Boosted Trees model (Lemmens and Croux, 2006) and a Random Forest model (Babaoglu et al. 2017).

The logistic regression is a type of Generalized Linear Model with a logit link function. In the elastic net model, we combine the logistic regression with a penalty term as a convex combination of ridge and lasso penalties. In the Random Forest model, we combine many trees, which are optimized using bootstrap, in one model. The final predictions are decided by majority voting, where each tree gets one vote. For the Gradient Boosted Trees, we build an ensemble model in a stage-wise fashion using classification trees.

2_{This is generally the case for insurance companies, because of time-varying covariates.}

(9)

Each model has its up- and downsides: A logistic regression can be more easily interpreted compared to a Random Forest model. The logistic re-gression and elastic net do not take (automatically) interaction effects into account (unless you explicitly add interaction effects), whilst Random Forest and Gradient Boosted Trees do.3 However, in the case of Gradient Boos-ted Trees, we explicitly should consider overfitting, whilst for the other two models, this comes naturally. Furthermore, the logistic and elastic net re-gressions are built around the Generalized Linear Model (GLM). GLMs are the main workhorse of insurance rate-making and dash-boarding software. This has a potential practical benefit in that for a LASSO (elastic net with only l1-regularization) and logistic model; you only need to change the to be

scored dataset.

Our analysis of an extensive dataset of automobile insurance contracts identifies some indicators that help explain and predict customer lapse. These results are relevant to insurers within the Netherlands as well as outside the Netherlands. Furthermore, we contribute to the current literature by showing the practical applications of both survival models and prediction models in an insurance customers context.

This paper is structured as follows. Section 2 discusses the data and introduces the variables. Section 3 explains the methodology. Section 4 gives an overview of the results. Section 5 presents a discussion and conclusion.

2 Data

The data was extracted from a large database from one of the largest (auto-mobile) insurers in the Netherlands. This data set contains just below 20 million records of insured parties’ contracts, contract renewals and contract modifications. Since individual policyholders are followed over multiple years, this data has a panel data structure.

(10)

flex-2.1

Data construction

In Table 1, we created some artificial data to show the data format. We created policyholders with some Begin and End dates together with some contract-specific data Xi and geographic data Zi, i ∈ {1, ..., 28}. The Begin

and End dates do not necessarily denote contract starts or contract lapses. Since the characteristics corresponding to an insurer are allowed to change, they denote periods of constant characteristics corresponding to some car insurance contract. Some policyholders have right-censoring: The event is in the future. For example, for policyholder 4, we do not observe her/his lapse. The data has unequal spacing between the observations as can be seen in the simulated data of Table 1: Some people have a yearly change in the contract, whilst others have their contract changed quarterly or more often. These changes can also be infrequent. For example, we notice that policyholder 1 has many infrequent changes in his contract. Note that the contract specific and geographic factors do not have to remain constant over the consumer’s contract duration.

Table 1 further shows that some policyholders have had multiple insured cars over the years. Consumers can hold multiple car contracts at the same time, but also multiple car contracts sequentially. For example, we notice that policyholder 3 has had multiple car insurance contracts at the same time. The dependence between the car contracts of a single policyholder is possibly unexplained heterogeneity.

In principle, people keep the same policy number if they have changed their contract. The data contains, among other things, insurance contract, geographic, policyholder-specific and car-specific data.

(11)

Tab. 1: Example data format.

This table reports the format of the data set for illustrative purposes. We note that policyholders can hold multiple car contracts. Xi and Zi, i ∈ {1, .., 28} are vectors containing contract-specific and geographic data, respectively. Lapse is a binary variable that returns 1 when the policyholder lapses on her/his car contract and 0 we s/he does not. Begin and End denote the times in dd/mm/yyyy of measurements of the car contracts.

Policyholder Car Begin End Contract Geographic Lapse

1 A 01/01/2017 01/02/2017 X1 Z1 0 1 A 02/02/2017 13/02/2017 X2 Z2 0 1 A 14/02/2017 14/07/2017 X3 Z3 0 1 A 15/07/2017 25/02/2018 X4 Z4 0 1 A 26/02/2018 25/08/2018 X5 Z5 0 1 A 26/08/2018 28/10/2018 X6 Z6 0 1 A 29/10/2018 02/02/2019 X7 Z7 0 1 A 03/02/2019 06/06/2019 X8 Z8 0 1 A 07/06/2019 15/09/2019 X9 Z9 0 1 A 16/09/2019 24/10/2019 X10 Z10 1 2 B 06/02/2014 31/01/2015 X11 Z11 0 2 B 01/02/2015 01/08/2015 X12 Z12 0 2 B 02/08/2015 02/02/2016 X13 Z13 0 2 B 03/03/2017 01/02/2018 X14 Z14 0 2 B 02/02/2018 02/04/2018 X15 Z15 0 2 B 03/04/2018 06/05/2018 X16 Z16 0 2 B 07/05/2018 21/08/2018 X17 Z17 0 2 B 22/08/2018 31/07/2019 X18 Z18 0 2 B 01/08/2019 31/07/2020 X19 Z19 0 3 C 18/04/2015 17/04/2016 X20 Z20 0 3 C 18/04/2016 12/08/2016 X21 Z21 0 3 C 13/08/2016 12/09/2017 X22 Z22 0 3 C 13/09/2017 06/11/2017 X23 Z23 1 3 D 02/03/2016 01/03/2017 X24 Z24 0 3 D 02/03/2017 01/03/2018 X25 Z25 0 3 D 02/03/2019 01/03/2019 X26 Z26 0 3 D 02/03/2019 01/03/2020 X27 Z27 0 4 E 01/01/2019 31/12/2019 X28 Z28 0

Therefore, in the data we have a length-time bias, a type of selection bias. For example, the data set contains insureds who signed their contract in 2005. Since we do not observe consumers who lapsed before January 2013, we have a bias for the long-term customers as discussed in Porta (2016). This is known in survival analysis as left-truncation. Since we expect long-term consumers to be different from the regular customers, we remove all consumers whose contract was signed before 1 January 2013. We do this to obtain an (time) unbiased estimate of the survival distribution.

(12)

in the year 2013, lapsed in the same year, id est there was only end-term non-renewal and no mid-term cancellation. This was not present for the rest of the years.

Finally, this leaves us with data from 1 January 2014 to 30 September 2019, slightly less than 6 years of data. From this dataset we sampled without replacement 50,000 policyholders. This leaves us with a total of 425,587 observations.

We add the number of changes in the contract specifications of individual contract holders in one year as a new variable. We argue that consumers who have multiple contract changes are more price sensitive than consumers who do not. Secondly, we add the country of origin of the car as a new variable. We do this, since it can provide us with extra information in estimating some missing car-specific variables. Thirdly, we extend the data set to contain macroeconomic variables. We include the Gross Domestic Product (GDP), Consumer Confidence Index (CCI) and the unemployment rate. A priori, it is unclear how economic booms and busts affect the probability of lapse. We argue that in times of economic busts, consumers are more price sensitive than before and therefore more likely to lapse on their car contract. However, consumers postpone on buying a new car in economic busts, which make them less likely to lapse.

For the survival models, we used the whole data set. For the prediction models, we considered yearly prediction and we only used the observation and their corresponding covariates at some point in time, and from there, every next full year (defined as 365.25 days). We call this end-of-year snapshots.

2.2 Data description

(13)

Fig. 1: Development of new consumers and lapses.

In this Figure we have the counts of new consumers and lapsed consumers starting from 1 January 2014. We note the upward trend of lapsed consumers.

We use a total of 14 coverages related to car insurance. Every insurance except the customary liability insurance (in Dutch: Wet aansprakelijkheids-verzekering motorrijtuigen) is optional. We believe that the coverages in the summary tables are self-explanatory. The net premia are yearly amounts in euros.

(14)

Zuid-Holland. We also know how many times an insurer changed their contract and their pay frequency.

. .

Car characteristics were requested from the RDW (Dienst Wegverkeer), a Dutch government agency. Data on weight, length, construction year, catalogue value, dummy variable for imported, energy label, fuel type and country of origin was collected for the car models.

A marketing company constructed factors at neighbourhood level. Factors

such as .

. . Hence,

res-ults obtained by using these factors cannot be extended to other insurance companies. However, the insurance company does use these variables for premium calculation. Therefore, we decided to include these variables in the treatment weighting process of the first subquestion Are large premium changes related to a higher lapse probability of insurance contracts? , but exclude these for the survival and prediction models. We give a description of these factors in the Appendix in Table 13. In this, we describe the factors using a few keywords.

(15)

Descriptive statistics of the numeric variables are in Table 4. We used the weighted mean and weighted standard deviation, where the weights are calculated as the proportion of days in one year the observation is in the dataset. We define a year as 365.25 days.4

Tables 2 and 3 show summary statistics for the factor variables. These were weighted according to wi = #days_365.25, as well.

Tab. 2: Summary statistics categorical variables (Part I).

Note that the number are all in percentages. Energy label: 0 is A, 1 is B, 2 is C, 3 is D, 4 is E, 5 is F. Fueltype: 0 is Gasoline, 1 is Diesel, 2 is Electric, 3 is Hybrid, 4 is LPG, 5 is Else. Country of origin: 0 is American, 1 is French, 2 is German, 3 is Italian, 4 is either Japanese or South-Korean, 5 is unknown, 6 is Swedish, 7 is British. Gender: 0 is male, 1 is female, 2 is unkown. The marketing factors will not be discussed as previously mentioned.

Observations for which some geo-factor was missing were deleted. Car characteristics which were missing, were substituted by regression imputa-tion of using the car’s country of origin and the remaining, i.e. non-missing, car characteristics.5 _{Covariates with more than 50% missing entries were}

de-leted. Ordered categorical variables of more than five categories are treated as continuous. Variables which were missing for some contract holder at some point in time, but not the period before, are imputed using Last

Observa-4_{The weighted mean is defined as ¯}_{x =}

Pn i=1wixi Pn i=1wi , where we have wi = #days 365.25, n

is the number of observations, xi is the value for observation i and ¯x is the weighted

sample mean. The weighted standard deviation is defined as

rPn i=1wi(xi−¯x)2 (k−1) k Pn i=1wi , where n is

the number of observations, k is the number of nonzero weights, wi are the weights (as

previously defined), xi are the observations. ¯x is the weighted mean.

(16)

Tab. 3: Summary statistics categorical variables (Part II).

Descriptive statistics for the binary variables. In the table is the distribution related to the covers, paying frequency and province. For the coverages, car and geographic, we denote 0 as no/absent and 1 as yes/present. For the Paying frequency, we denote 0 as monthly payment and 1 as a yearly payment.

(17)

Tab. 4: Summary statistics continuous variables.

This table reports the summary statistics of our working sample of numeric variables from January 1 2014 to 30 September 2019. Mean, SD and Missing are scaled according to the number of days in a year the value for the covariate was in the dataset. So, for example, if a value for age of driver of 40 was observed for 200 days, we scaled this value using the weight 200

365.25. on a yearly basis according to the time of the

(18)

3 Models

In this section, we discuss the survival and prediction models.

3.1 Survival Models

We describe the general notation for survival models. In addition, we define the Cox Proportional Hazards model, describe its notation and likelihood function. Causal inference and frailty modelling in the survival context are postponed to the relevant Results subsections.

3.1.1 The basics

One of the reasons we prefer survival analysis over regression models is the fact that regression analysis does not make a distinction between end-term non-renewal and mid-term cancellation. But most importantly, survival mod-els treat right-censoring in a natural way. Right-censoring occurs when sub-jects do not experience the event in question before the end of the study. It is in itself informative to know whether the event did or did not occur. Fur-thermore, using survival models it is easy to take into account time-varying covariates. This becomes more apparent when the time intervals between variations become increasingly dissimilar.

To explain our models, we first need to introduce some notation for con-tinuous survival models. Let T denote the independent and identically dis-tributed (i.i.d.) random variable of the survival time, t its observed value and f (·) its probability density function. The survivor function is then defined as

S(t) = 1 − Pr T ≤ t = 1 − Z t

0

f (s)ds. (1)

(19)

is called the hazard function, id est λ(t) = lim ∆t&0 Pr t ≤ T ≤ t + ∆t|T ≥ t ∆t = lim ∆t&0 Rt+∆t 0 f (s)ds − Rt 0f (s)ds ∆t × 1 S(t) = f (t) S(t). (2)

Note that we either need λ(t), S(t) or f (t) to find the other two functions. Now by combining (1) and (2), we are able find the integrated hazard function

Λ(t) = Z t

0

λ(s)ds.

The integrated hazard function is also called the cumulative hazard function. It is interpreted as the probability of failure before time t.

3.1.2 The Cox PH model

The Cox Proportional Hazard (Cox PH) model has the following form

λ(t|x, β) | {z } hazard = λ0(t) | {z } baseline hazard × φ(x, β), | {z } partial hazard

where β = (β1. . . βp)0 is a p × 1 vector of coefficients and x = (x1. . . xp)0 is

the vector of explanatory variables. The most common form for φ(· , ·) is the exponential function, because of its ease of interpretation and non-negativity. Then, the Cox PH is written as:

(20)

It is convenient to choose a parametric distribution for the baseline hazard. Popular choices are the Weibull, exponential and Gompertz distributions (Hanagal, 2011). However, since we do not know the baseline hazard, we prefer to use a semiparametric model.

The Cox Proportional Hazards is a semi-parametric model since the par-tial hazard is estimated parametrically, whilst we do not specify the distri-bution of the baseline hazard. The Cox Proportional Hazard model assumes that the ratio of the baseline hazards of individuals, say individuals i and j, is constant over time and the same for each individual. That is, we should have that the ratio of the hazards, which is also called the relative hazard, is expβ0(xi− xj)

, where xi denotes the covariates corresponding to

obser-vation i.

Since we have data with time-varying covariates, we change x to x(t) to denote the changes in the variable over time.

For observation i we use the following definitions:

• Yi is a possibly censored failure time random variable that is defined

as min(Ti, Ci), where Ti is a random variable denoting the failure time

and Ci is a random variable denoting the censoring time. We denote

the realization of Yi as yi.

• ∆i is the failure/censoring random variable defined as ∆i = 1Ti≤Ci,

where 1 is the indicator function. We denote the realization of ∆i as

δi.

• R(yi) is the set of spells at risk at time yi. That is, R(yi) contains all

observations still under study just prior to yi.

(21)

Cox (1972) proposed the partial likelihood function defined as Lp(β) = n Y i=1 ( _exp_x0 i(yi)β P l∈R(yi)exp x0 l(yi)β )δi . (3)

This is called the partial likelihood function, since we do not estimate the baseline hazard λ0(t). For event ties, we use the Breslow approximation

(Breslow, 1974).

Instead of optimizing (3) with respect to β, we optimize the following partial log-likelihood w.r.t. β: lp(β) = n X i=1 δi ( x0_i(yi)β − log h _X l∈R(yi) expx0_l(yi)β i ) .

Cox PH models are particularly useful for hypothesis testing. We do not have to make a distinction between mid-term cancellation and end-of-term non-renewal. Furthermore, the COX PH model can easily be extended to allow for time-varying parameters. However, it is not possible to obtain estimates of the probability of survival at particular time points for different observations. This is due to the way the Cox PH is estimated. The baseline hazard is treated as a nuissance parameter, which gives the output an interpretation of relative hazards. By estimating the baseline hazard using parametric models, we are able to make estimates of probabilities in time. However, since we have time-varying covariates this becomes non-trivial. Because for prediction, we should know the covariates for each individual for every moment in time. Now, knowing the development of the covariates for some individual implies that we are aware whether the observation has failed or not. This creates a tautology.

(22)

3.2 Prediction Models

In this section, we introduce the prediction models we use. We describe the logistic regression, elastic net (and its l1- and l2-regularisations), Random

Forests and Gradient Boosted Trees.

3.2.1 Logistic regression

Following Dobson and Barnett (2018), we use the generic terms success and failure to describe the binary outcome variable of interest. We define the following random variable for individual i as

Yi =

  

1, if the outcome is a success 0, if the outcome is a failure,

where we define yi as the ith observed value of random variable Yi.

Further-more, we define the short-hand notations for probabilities as πi = Pr(Yi = 1)

and 1 − πi = Pr(Yi = 0). We could model the above using a linear model,

however, this has as obvious limitation that the probabilities will not be restricted to the interval [0, 1]. Therefore, it is convenient to use the logit link-function. This is defined as

log πi 1 − πi

!

= x0_iβ,

where x0_i is a 1 × p covariate vector and β is the p × 1 parameter vector. Of course, the above is solved for πi as

πi =

exp(x0_iβ)

1 + exp(x0_iβ). (4)

(23)

for the Likelihood function L(β) = N Y i=1 πyi i (1 − πi)1−yi. (5)

From (5) and after some algebra, we find the following log-likelihood function.

l(β) = N X i=1 yix0iβ − N X i=1 log(1 + exp(x0_iβ)). (6)

The estimates of β are found by optimizing (6) with respect to β. Generally, this is solved by using iterated weighted least squares, see for discussion Dobson and Barnett (2018). The predicted probabilities can be found by setting ˆβ for β in (4). The predicted classes, the binary variables of interest, are determined by using a cut-off point in predicted probability. For this, we choose a cutoff value and classify inputs with probability greater than the cutoff as one class and below the cutoff as the other. Generally, a predicted probability of 0.5 or larger implies a success and smaller than 0.5 a failure.

3.2.2 Elastic net

Zou and Hastie (2005) introduced the elastic net penalty. For this, we write the following penalized log-likelihood

( ˆβ0, ˆβ) = arg min (β0,β)∈R×Rp n − 1 Nl(β0, β) + λ1 p X j=1 |βj| + λ2 p X j=1 β_j2o,

where λ1, λ2 ≥ 0, l(β0, β) is the log-likelihood with parameter β0 and

(24)

the Bernoulli log-likelihood ( ˆβ0, ˆβ) = arg min (β0,β)∈R×Rp n − 1 N N X i=1 ˜ yi(β0+ ˜x0iβ) + 1 N N X i=1 (log(1 + exp(β0+ ˜x0iβ)) + λ1 p X j=1 |βj| + λ2 p X j=1 β_j2o, (7)

where ˜yi is centred, that is

PN

i=1y˜i = 0, and ˜xi is scaled for each column

to have a mean of 0 and a variance of 1. Scaling and centering is sensible, variables with large magnitudes will otherwise disproportionally affect the solution.

The summation Pp

j=1|βj| is called the l1-norm and

q Pp

j=1β 2

j is called

the l2-norm. Removing the l2-norm in (7), leaves us with the LASSO (Least

Absolute Shrinkage and Selection Operator). Lasso causes a subset of the estimated coefficients ˆβj to be exactly zero, for a sufficiently large value of

the tuning parameter λ1. The LASSO provides a form of variable selection.

Removing the l1-norm, leaves us with the Ridge regression. The Ridge

pen-alty shrinks predictors of correlated variables towards each other. This is different from the LASSO, which picks one of the predictors and sets the other to 0. This can also seen as |b| is much larger than b2 for small b.

The elastic net regression is a compromise between the Ridge and LASSO regression, and was derived as an extension over the LASSO (Zou and Hastie, 2005). The LASSO regression tends to select one variable from a group of highly correlated variables whilst ignoring the others. The elastic net tends to select the correlated features (or not) together (Hastie et al., 2015). For example, in our case, this would mean that all binary variables related to the provinces are likely to be either in- or excluded. This is also the reason for its name elastic net, from the original (2005) paper “It is like a stretchable fishing net that retains ‘all the big fish’.” .

(25)

by writing the penalization term as λ h 1 2(1 − α) p X j=1 β_j2+ α p X j=1 |βj| i , (8)

which varies between a LASSO (α = 1) and Ridge (α = 0). Note that for λ > 0 and α < 1, the elastic net is strictly convex and a unique minimum exists. In practice, this implies that correlations or duplications (perfect multicollinearity) in the predictor matrix do not affect the solution.

It can be directly noticed that the elastic net model has, in theory, at least as good prediction accuracy as either the LASSO or Ridge, since the LASSO and Ridge are special cases of the elastic net model. However, hyper-parameter tuning is more computer intensive in the elastic net case: Instead of searching a one-dimensional grid of tuning parameters, it is now a two-dimensional space. Because of this, it is possible in practice, for the LASSO and Ridge to outperform the elastic net.

The elastic net model also has an intuitive geometric interpretation, see Figure 2. In Figure 2, we have the contour plots of the penalties of the Ridge, Lasso and elastic net. This graph is from the original elastic net paper of Zou and Hastie (2005), except for the maroon and gray coloured circle and capsule. Figure 2 shows the two-parameters constraint region for the ridge and LASSO regressions. The (- - -)-square is the constraint region of the l1-regularization with |β1| + |β2| ≤ t. The (-.-.)-circle is the constraint

region of the l2-regularization with β12+ β22 ≤ t, i.e. the Ridge model. The

shape between these is a combination of the l1 and l2 penalties, and is the

penalty of the elastic net model setting α = 0.5. Note the difference between the shapes of the parameter constraint regions in the maroon circle: The LASSO constraint has 90-degree angles, whilst the Ridge constraint has no corners. The corners encourage some ˆβjs to be set to zero. The contour in

the gray capsule encourages sharing of coefficients.

(26)

the constraint in (8)

• We use α ∈ [0.05, 0.10, ..., 1].

• We use 17 different values for λ. We use λs from 0.00011 to the λ which sets all the coefficients to zero for α = 0.001 on a linear log-scale. We choose a non-zero α since for α = 0 we would need λ = ∞ for the coefficients to be set to zero.

3.2.3 Random Forest

Before we discuss the Random Forests model, we first discuss its building blocks: Decision Trees. Since, our variable of interest is lapse, which is a binary variable, we focus on classification trees. Classification trees consist of the following parts:

• Root nodes: entry node; first split.

• Internal nodes: a set of splits.

• Leaf nodes: final outcome, also called class labels.

Classification trees partition the variable space into disjoint regions Rj, j =

1, 2, ..., J . Following Hastie et al. (2009), we assign a value γj to each such

region and use the predictive rule xi ∈ Rj → f (xi) = γj, where xi is the set

independent covariates for observation i and f (·) is a tree. For classification it is natural to use the most prevalent class for γj. Formally, a tree is then

expressed as T (xi; {Rj, γj}Jj=1) = J X j=1 γjI(xi ∈ Rj), (9)

(27)

Fig. 2: Geometric interpretation of the regularizations.

The graph is from the original Zou and Hasie (2005) elastic net paper, except for themaroonandgray coloured circle and capsule. The graph shows the two-parameters constraint region for the ridge and LASSO regressions. The (- - -)-square is the constraint region of the l1-regularization, i.e. the LASSO model. The (-.-.)-circle is the constraint region of the l2-regularization (squared l2-norm), i.e. the Ridge model. The shape between is a combination of the l1and l2 penalties, and is the penalty of the elastic net model.

(28)

of observations of class j in node t. Then, the Gini index is defined as φG(p) = J X j=1 ˆ pj(1 − ˆpj),

where p = (p1. . . pJ). The Gini index can be seen as the expected error rate if

the class label for a node is randomly chosen. Note that ˆpj is the probability

of a random entry belonging to class j, whilst (1 − ˆpj) is the probability of

misclassification.

If from parent node t, s sends a proportion of the population PLto the left

and a proportion PR to the right, we let pL = (p1,L. . . pJ,L) be the proportion

of classes in the left node and pR= (p1,R. . . pJ,R) be the proportion of classes

in the right node. Now, the goodness-of-split is defined as

θ(s, t) = φG(p) − PLφG(pL) − PRφG(pR).

For each split, we want to find a goodness-of-split measure as large as pos-sible.

Figure 3 shows an example of a decision tree based on CART fitted on our data using only the geo-factors and binary indications of specific insurance contracts related to the car insurance.

Decision Trees are known for their ease of interpretation, but notorious for their potential over-fit. An algorithm is susceptible to over-fit if a small change in the data causes a large change in the overall model. This is due to the way the classification trees are constructed: because of the hierarch-ical nature of the classification trees, a large error at the start of the tree propagates down to splits below.

(29)

Fig. 3: This figure shows an example of a classification tree fitted by CART.

The geo-factors and binary variables for coverages are the variables we use for the classification tree. The first split was by dek ard. Hence, the largest goodness-of-split was achieved by splitting the consumers based on dek ard, which is the root node. The nodes on the bottom are the leaf nodes, which give the final prediction of either 0 (no-lapse) or 1 (lapse). The final predictions are based on the most prevalent class in the node. The percentages denote the proportion of data left in the nodes. Note that for the leaf nodes the percentages sum to 100%, as they should. The fractions in the nodes show the fraction of the no-lapse class (left) and the fraction of the lapse class (right), which sum to 1.

(30)

one model. The “bootstrap” part is due to the fact the algorithm, for each tree split, randomly selects with replacement a subsample of the variables and of the training data. These variables are then used to fit the tree.

The main idea is that trees are noisy, but unbiased on average. The idea of a Random Forest is to average these noisy trees to reduce prediction bias. In classification trees, this is done in a plurality voting scheme where each tree “votes” on the class outcome. To be precise, for a given number of bootstrap iterations, we draw a bootstrap sample Z∗ of size N from the training data. For each tree and split, we sample with replacement the variables and determine the optimal variable split by Gini index. This is repeated B times. Final class predictions are done by using majority voting: Each individual tree determines a class prediction and the maximum number of predictions of a certain class becomes the prediction of the Random Forest. There are many tuning parameters for Random Forest. The implement-ations of the Random Forest model also differ in the options of tuning para-meters. We discuss three of those tuning parapara-meters.

• Variable candidates for each split.

• Number of trees.

• Node size of the leaf node.

(31)

influential variables are chosen for splits, but also less useful, but still highly relevant, variables.

The number of trees will not be tuned. The error rate stabilizes for a certain number of trees. We consider 500 to be large enough for our model. However, as a sensitivity analysis, we train a second Random Forest model and check whether it produces a similar result.

The node size hyper-parameter discusses the minimum number of obser-vations in the leaf node. For a dataset with a large number of noisy variables, we want the node size of the leaf node to be larger than for a dataset with many useful predictors.

Note however that there are more hyper-parameters which are possible to tune. For example, the the maximum number of leaf nodes, total number of nodes and size of the internal node necessary to make a split are also possible hyper-parameters.

To conclude, we use the following tune-grid to optimize our Random Forest model

• Variable candidates for each split: p ∈ {4, 5, 6, 7, 8, 9, 10, 11, 12}.

• Node size of the leaf node: M ∈ {1, 10, 100, 1000}.

Hence, we fit a total of 9 × 4 = 36 models. The same hyper-parameters are used for the weighting procedure for the first section of the survival models.

3.2.4 Gradient Boosted Trees

(32)

Formally, following (9) and defining the set Θ = {Rj, γj}Jj=1, we find the

optimal parameter vector Θ by solving

ˆ Θ = arg min Θ J X j=1 X xi∈Rj L(yi, γj),

where L(· , ·) is some differentiable loss-function. The boosted tree model is build using the sum of classification trees

fM(x) = M

X

m=1

T (x; Θm),

in a forward step-wise manner, where Θm = {Rjm, γjm}Jj=1m for classification

tree m. For each step we should solve the equation

ˆ Θm = arg min Θm N X i=1 L(yi, fm−1(xi) + T (xi; Θm)).

For binary classification, we can use the negative Binomial log-likelihood as the loss function, that is, we use

L(yi, ˆyi) = −yilog

1

1 + e−ˆyi − (1 − yi) log

e−ˆyi

1 + e−ˆyi. (10)

The above is solved using gradient coordinate descent (hence, the name Gradient Boosted Trees/Machines).

The algorithm for Gradient Boosted Trees works as follows. We first set a starting value fk0 = 0 for each classification group. Since we have a binary

classification problem, we have k ∈ {1, 2}.

Then, for each m = {1, ..., M }, where i ∈ {1, ..., n}, j = {1, ..., Jm} and

(33)

1. First, we calculate the classification group pseudo-residuals as defined by the solution to the negative gradient

rikm = − h∂L(y_i, f_k(x_i)) ∂fk(xi) i f =fm−1 . (11)

The above is the gradient after which the Gradient Boosted Trees is named after. For the binary classification, we have the loss function as defined in (10), (11) thus reads

yik−

efk(xi)

P2

l=1efl(xi)

.

2. In the second step, we fit a regression tree to the targets rikm, giving

regions Rjkm. The regions Rjkmare the terminal nodes of the regression

tree.

3. In the third step, we choose a gradient descent step size. We calculate the following γjkm= 1 2× P xi∈Rjkmrikm P xi∈Rjkm|rikm|(1 − |rikm|) .

4. In the fourth step, we update the equation

fkm(x) = fk,m−1(x) + η Jm

X

j=1

γjkmI(x ∈ Rjkm),

where η is the learning rate. We update the solution by adding a shrunken version of the new tree. After these four steps we use an updated value for m.

(34)

• The learning parameter η. The hyper-parameter η acts as a shrinkage parameter for the model. It is a multiplicative value on the new tree and decides how much information is used for the fit of the new tree. Ridgeway (2007) suggest to set η very small (η < 0.1). We use η ∈ {0.001, 0.01}.

• The number of trees M. Unlike the Random Forest model, the number of trees needs to be tuned.6 _{If M is too small, we have suboptimal}

pre-dictions, if M is too large, we overfit the data.7 _{However, this is partly}

dependent on the learning rate η. We use M ∈ {50, 200, 500, 1000}.

• The number d of splits in each tree. We try d ∈ {1, 2, 8, 9, 10, 11, 12}.

3.2.5 Cross-validation, metrics & interpretation

To evaluate our algorithms, we make use of fold cross-validation. In k-fold cross-validation, the data set is randomly split into k equally sized sub-samples, also called folds. For each k, one of the folds is the validation set, whilst the rest of the k − 1 folds are the data on which the algorithm is fit. For classification models, we make use of a misclassification error and com-pute this error on the validation set. The misclassification error combined with k-fold cross-validation is defined as

CV(k)= 1 k k X i=1 Erri,

where Erri is the Area Under the Receiver Operating Curve, often denoted as

AUROC or AUC, for cross-validation iteration i. The AUC will be explained later in the text. Cross-validation is used to test the generalizability of the algorithm. We fit a model on training data and test how the model performs

6_{My hypothesis is that some people think the number of trees in the Random Forest}

model is a to-be-tuned hyper-parameter, since it is for the boosting approach.

(35)

on previously unseen data. This gives us evidence for or against potential overfit. Often k = 5 or k = 10 is used. For k = N , where N is the number of observations, we have Leave-one-out cross-validation (LOOCV). Efron (1983) shows that LOOCV is almost unbiased but has a high degree of variance. This leads to unreliable estimates. Generally, lower values of k lead to more bias, but a lesser degree of variance. Also, lower values of k are computationally more tractable. For our models, we make use of k = 5 cross-validation.

We split the data in 80% training data and 20% test data. The model is fitted using the k = 5 cross-validation on the training’s data. We discuss the model by using confusion matrices and the AUC.

In Table 5, we show the general form of the confusion matrix and the definitions for the True positive rate and the True negative rate.

Tab. 5: Confusion matrix.

The confusion matrix with some definitions. We define the True positive rate and False positive rate. The bold letters denote the abbreviations used for the True positive rate and True negative rate definitions.

True Class Success Failure Predicted Class Success True Positives False

Positives True positive rate= TP

P

Failure False Negatives

True

Negatives True negative rate= TN

N

Sum columns: P N

The true positive rate is also known as sensitivity and the true negative rate is also known as specificity. Using the generic terms success and failure to describe the binary outcome, sensitivity is the probability of predicting success given true state is success, whilst specificity is the probability of predicting failure given true state is failure.

(36)

other. A cut-off point of 0.5 is sensible and will be used for our confusion matrices.

Having estimated the models, we can vary this cut-off point and determine how the true positive and false positive rates change. The plotted values are known as the receiver operating characteristic curve, commonly known as the ROC curve. The ROC graph is a two-dimensional plot with the true positive rate on the vertical axis and the false positive rate on the horizontal axis. The ROC curve illustrates the trade-off between the true positive to the false positive rates. The area under the ROC curve is known, unsurprisingly, as the Area Under the Curve (AUC). The AUC is equal to the probability the probabilities of a true positive is scored greater than the probabilities of a true negative. Naturally, we want an AUC as large as possible. The benefit of the AUC compared to other metrics is that we do not have to categorize the predicted probabilities in responses. This would result in a loss of information (Pontius and Milones, 2011). The test data will give the final Area Under the Curve (AUC). The AUC will then be used to compare the final prediction models.

(37)

Fig. 4: An example of a ROC.

This is an example of a Receiver Operating Characteristic (ROC) curve. The ROC curve illustrates the trade-off between the true positive to the false positive rates. Thegrayline is the performance for some model. Themaroonline denotes a “model” based on random guesses. In that case, the AUC is approx-imately 0.5 and model has no discriminating power to distinguish between positive class and negative class.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

(38)

ways to split the data. Therefore, we make use of a permutation algorithm: Accuracy is measured using the observations which were not used for fitting the tree, also called the Out-Of-Bag (OOB) observations. To calculate the importance of variable j, we change the value for the jth variable and re-compute the accuracy. We average the decrease in accuracy over all trees. Then, the average reduction in accuracy is used as a measure for variable importance. For the Gradient Boosted Trees, we use the reduction in the loss-function for each split following Ridgeway (2007).

To interpret our machine learning models, we make use of Partial De-pendence Plots (PDPs) following Friedman (2001). These show how the response variable changes as we change the variables of interest whilst taking into account the average effect of the other variables.

4 Results

This section discusses the results for the survival and prediction models.

4.1 Survival models

We discuss the three subquestions as described in the Introduction. Again, they are the following:

1. Are large premium changes related to a higher lapse probability of insurance contracts?

2. Do frailty models help in explaining (unobserved) heterogeneity after controlling for a large number of geographic and behavioural factors?

(39)

4.1.1 Are large premium changes related to a higher lapse probability of insurance contracts?

In this section, we extend Guelman and Guill´en (2014). We propose Inverse Probability of Treatment Weights (IPTW) in a survival model context using a Random Forest algorithm. We estimate Hazard Ratios for binned changes in the total premium. This is not equal to the sum of net premia as was discussed in the Data description section, because the calculation of the sum of net premia differ across our data sources. We define the total premium as the sum of premia for coverages of Car collision, Fire and storm dam-age, Theft, Burglary, No-claim protection, Window cover, Vandalism and Customary liability insurance.

Customers can adjust their coverages over time. This implies that changes in the total premium can be due to specific changes in the insurance contract. Therefore, we excluded changes in total premium which were accompanied by a change in coverages because premium changes due to changes in the contract should be anticipated by the consumer.

Within 2 weeks before contract renewal, consumers receive a new premium proposal. They have small time-frame (fewer than 10 business days) to either accept or refuse this offer. If refused, the consumer has to find a new car insurer. If accepted or failed to refuse, the consumer will pay, from now on, the offered premium. The consumer still has the option, if the new con-tract took effect, to look for a new insurer in the meantime. The rest of the money from the contract will then be refunded. The above implies that some people, who got a premium change, will switch insurers before contract renewal. These consumers are not observed. This gives a potential bias for some of the results. We will discuss this issue later in the text.

(40)

change in premium on lapse is subject to potential selection bias. As Guel-man and Guill´en (2014) put it “as historical rate changes are reflective of a risk-based pricing exercise”. Although there is a commercial side in insurance pricing, insurers main focus is “fair”-pricing: premium is based on estimated individual risk. This makes measuring the price sensitivity and the impact of a price shock unique. To estimate the impact of a change in premium, we use the change in total premium as discussed previously.

Guelman and Guill´en (2014) used Propensity Score Matching and Gen-eralized Additive Model in a regression context to study price sensitivity of automobile insurance. We mimic a randomized trial by using IPTW (Robins et al., 2000). We can adjust for a set of confounders C when measuring the effect of a treatment A by weighting observation i by the inverse probability weights in the following way

wi =

1

P(Ai = ai|Ci = ci)

. (12)

Weighting by wi creates a pseudo-population in which the set of

con-founders no longer predicts the treatments and in one can infer the relation between the treatment and the event of interest. In other words, we create a pseudo-population for which the covariates and the treatment assignment are independent of each other. We would expect this in a randomized experiment (Thoemmes and Ong, 2016). Generally, stabilized weights are preferred over the weights in (12) (Robins et al., 2000). Stabilized weights are defined as

swi =

P(Ai = ai)

P(Ai = ai|Ci = ci)

. (13)

(41)

IPTW.8 _{We note that equation (13) would be different in case of feedback.}

The numerator can be estimated from the data by calculating the propor-tion of the different treatments. Now the denominator, id est the probability of an event given the set of confounders, can be estimated using a parametric model. Most of the time, the logistic regression is used. However, in recent years, the interest has shifted to non-parametric models, for example Zhu et al. (2015), Li et al. (2016) and Pirracchio et al. (2015). This because, possible misspecifications such as non-linearity of the functional form of the relation between the pretreatment covariates and the treatment assignment might bias the results.

To include the weights in the Cox PH, we should write the partial likeli-hood function of equation (3) as described in section 3 as

Lp(β) = n Y i=1 ( _exp_x0 i(yi)β Pn j∈R(yi)exp x0_j(yi)β swj(yi) )swi(yi)×δi .

To deduce causal effects, we need a number of assumptions. The most well-known assumption is the assumption of no-unmeasured confounders, also called interchangeability. This assumption can be formalized as follows. Again, if we denote A as the random variable denoting the treatment, C as its set of corresponding confounders and the outcome (event-time) as T , we should have T(a) _⊥_{⊥ A|C, where a is the observed treatment. This condition}

states that there should be no variables, which impact the propensity for all possible treatments a, which are not included in our weighting function. And, treatment assignment is based solely on covariate data and independent of potential outcomes. This assumption is untestable and should be critically reviewed and argued by the researcher. Note that this assumption takes into account whether we used the correct functional form of the relation between the treatment and confounding covariates. In our case, we believe this

(42)

sumption is satisfied, since our data set contains all the covariates which are used for premium calculation in the insurance company. Furthermore, the exceptionally large number of observed covariates in our dataset gives us a way to control for a large degree of multivariate variation between groups.

The second assumption we need is of positivity. Mathematically, we should have P(A = a|C = c) > 0 for all a and c. That is, each person should have a non-zero probability to be in each of the treatment groups. Again, this assumption is untestable.

The third (and last) assumption is of consistency. This states that the actual outcome is equal to the potential outcome under the treatment ad-ministered (Rubin, 2009). It means that that the treatment is defined un-ambiguously. Cole and Frangakis (2009) describe this informally as ”have I defined exposure to include the causally relevant features? ”.9 _{This is, of}

course, satisfied in our case. If all three conditions are satisfied, there is no conceptual difference between a randomized experiment and an observational study (Fitzmaurice et al., 2008).

The insurance premium a consumer has to pay is largely based on previous claim history. In the Netherlands, this is translated to a Bonus-Malus system (BM system). In general, if one places a claim, she/he moves down (a few) steps, whilst if one does not place a claim in a year, she/he moves up one step.

Because of this system, it is important to control for a change in the BM steps if we want to study the impact of insurance premium changes on lapse.10 This is because a consumer can anticipate a change in their

insur-9_{Cole and Frangakis (2009) discuss how the specification of aspirin intake should change}

as we change the studied outcome. For example, to study the gastrointestinal impact of aspirin, it is important whether the aspirin was buffered or not. Whilst for a study on the increased risk of heart attacks due to aspirin, it is possible to ignore whether the aspirin was buffered or not.

10_{As a side note, it is possible to insure oneself at the insurer against dropping in}

steps at the insurers’ internally-used BM scale after placing a claim. Since then the

(43)

ance premium amount. It is the decision of the consumer whether to file a claim or not after a car accident. Customers consider whether the claim benefits outweigh the increase of premium due to a lower BM step. In short, a consumer can see an increase in premium coming and hence should not be surprised by a premium increase next period. Furthermore, accounting for the BM changes in the Cox PH model, gives us a way to account for the differences in competing insurers’ premium BM tables: The insurance company with the lowest premium might be different depending on the steps on the BM scale. Since we use individual car contracts, insurance of an extra car does not play a role in premium change. We control for a change in BM step, by including this both in the weighting and in the final Cox PH model. This is known as double Robust estimation and partly solves possible misspecification of the weighting model (Klein et al., 2014). Furthermore, it provides us with more precise estimates (Klein et al., 2014). The estimated weights for each percentile price change are plotted in Figure 5. The estim-ated weights are obtained via cross-validation and hyper-parameter tuning. Figure 5 gives us evidence that positivity holds; however, this argument is based on the assumption that we used the correct model.

We need to check whether our weighting procedure produced a pseudo-population with an equal distribution of covariates across the treatments. We did this for a small selection of our variables. We used short-hand notation for the treatments. We denote Tr1 = (−∞, −10], Tr2 = (−10, −5], Tr3 =

(−5, 0), Tr4 = (0, 5], Tr5 = (5, 10], Tr6 = (10, 15] and Tr7 = (15, ∞) as the

binned percentile premium changes.

The results of some summary statistics on weighting are in Table 6. We note that the weighting procedure produced means and standard deviations which are similar over the treatments.

(44)

Fig. 5: Distribution of estimated weights for each treatment.

Here we have plotted the distribution of individual weights. If our model is true, the assumption of positivity is satisfied. 0 1 2 0.0 0.5 1.0 1.5 2.0 weights densit y %Change in total premium (-Inf,-10] (-10,-5] (-5,0) (0,5] (5,10] (10,15] (15,Inf)

customers to find a new insurer.

Table 7 gives the results for a change in the total premium after con-trolling for a change in the BM scale. We removed observations where the percentile premium change was 0.11 For sensitivity analysis, we tried differ-ent strategies of binning the data, however, this did not significantly change our results. The result could indicate whether policyholders consider their premium bump or decline “fair”. We used robust standard errors. This ac-counts for the fact that the pseudo-population may be larger or smaller than the real population. We note that the treatment of a premium change of (−∞, −10] reduces survival. For this treatment, we find a value e0.351 _{≈ 1.42}

as the Hazard Ratio (p-value < 0.01). This implies that a treatment of (−∞, −10] is associated with a 42% additional risk of lapse over the baseline treatment (0, 5], ceteris paribus. Also, the premium increase (15, ∞)

pro-11_{Because this gives no extra information. Although, when we included the zeroes, we}

(45)

Tab. 6: Results weighting on the covariates.

This table shows the means and standard deviations of the treatment groups before and after the weighting procedure. All numbers are rounded to the nearest integer. We notice that the weighting process produces a covariate distribution across treatments which is closer than before the weighting. We argue that the weighting procedure was successful.

duces a Hazard Ratio significantly larger than 1. The overall results are roughly V-shaped and imply that consumers have a higher probability of lapsing in both large premium decreases (that is, for the intervals (−∞, 10]) and (−10, −5]) and large premium increases (that is, for the intervals (10, 15] and (15, ∞)). The results are graphically illustrated in Figure 6. There, the V-shaped effect is clearly visible.

(46)

Tab. 7: Results COX PH in a 4-month contract-change interval.

This table shows the results for the 4-month contract-change interval. Note:∗_p<0.1;∗∗_p<0.05;∗∗∗_p<0.01

Extra Note: We removed observations where the percentile change of premium was 0.

Parameter Coef. Robust se(Coef.) HR(95% CI) p-value

% Change in total premium:

(−∞, −10] 0.351 0.045 1.300 1.554 *** (−10, −5] 0.106 0.044 1.020 1.212 ** (−5, 0) 0.028 0.035 0.961 1.101 (0, 5] baseline (5, 10] -0.037 0.040 0.891 1.042 (10, 15] 0.091 0.060 0.974 1.231 (15, ∞) 0.300 0.055 1.208 1.498 *** Change in BM (Control): -0.028 0.007 0.959 0.987 ***

Fig. 6: Results effect of percentile change in total premium.

Here, we have plotted the results for the 4 months lapse setting. We note that large premium decreases are associated with an increase of lapse probability.

baseline 1.0 1.2 1.4 1.6 (-Inf,-10] (-10,-5] (-5,0) (0,5] (5,10] (10,15] (15,Inf)

%Change in total premium

H

a

zard R

a

tios (95% CI)

(47)

with time or any function of time. We use a Pearson correlation coefficient between the scaled Schoenfeld residuals and time for each covariate. Signi-ficance is tested using a chi-squared test statistic at the 5% signiSigni-ficance level. We reject the null of proportional hazards. According to Allison (2014), this is not an issue, necessarily.12 He argues that the coefficient is now a rough average effect. We believe that the use of an average hazard is reasonable in a four months setting, the proportional hazards assumption is, after all, never precisely true (Therneau et al., 2020).

We note that the impact of large premium changes correspond to a Hazard Ratio larger than 1 for the positive and negative domain. We have two hypotheses for the result. Our first hypothesis is that a number of customers are inert and become active after a (large) price change. Either because this price change is positive or negative. For a positive price change, the customer might want to look for another car insurance firm which sells their insurance for a lower price. For a negative price change, the customer thinks “Did I pay too much for my car insurance, all these years? ” Potentially, this instigates them to go looking for another insurer. We call this the shock effect. Our second hypothesis focuses on the modelling approach. Since consumers had a short time to find a new insurer (10 business days), it is possible that the results for the positive price changes are biased downwards. This is because we do not observe consumers which changed insurers within those 10 business days. We call this the selection effect.

We study, as a sensitivity analysis, whether changing the definition of lapse has an impact on the results. We change the period for which we consider lapse being possibly influenced by a price change. We arbitrarily set this time-frame as 4-months, but there is no reason to believe this time-frame not to be shorter or longer. Therefore, we also study the effect of the price

12_{Interpretation would be problematic if the survival curves would cross. For example,}

(48)

shock for 6 months, 2 months, 1 month and 2 weeks. The results are in Table 8. We note from Table 8 that the Hazard Ratios stay reasonable constant over time. That is, for different time-frames, we have similar statistical significances, signs and sizes of the effects. This gives us evidence for the Proportional Hazards assumption. The Cox PHs are accepted as being robust to minor violations of the PH assumption (Barraclough et al., 2011).

Again, the results are likely, to a degree, biased. However, we believe that the signs and signficances of the coefficients still hold, since we consider 10 business days a short period of time to change the insurer. Furthermore, it is unlikely that the selection effect will play a major role on the results for the negative price changes.

(49)

Tab. 8: Results Cox PH for a number of intervals.

Here are the results for the weighted Cox PH for a number of intervals. We have the intervals half a month, 1 month, 2 months and 6 months. Half a month was defined as 14 days, 1 month was defined as 30 days, 2 months was defined 61 days and 6 months was defined as 183 days.

Note:∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

Extra Note: We removed observations where the percentile change of total was 0.

Parameter Coef. Robust se(Coef.) HR(95% CI) p-value

% Change in total premium: (half a month): (−∞, −10] 0.586 0.113 1.441 2.240 *** (−10, −5] 0.206 0.095 1.020 1.480 ** (−5, 0) -0.128 0.084 0.746 1.038 (0, 5] baseline (5, 10] 0.062 0.102 0.871 1.298 (10, 15] 0.336 0.150 1.041 1.878 ** (15, ∞) 0.330 0.101 1.142 1.695 *** Change in BM (Control): -0.065 0.009 0.921 0.953 ***

% Change in total premium: (1 month): (−∞, −10] 0.516 0.080 1.431 1.962 *** (−10, −5] 0.251 0.084 1.091 1.514 *** (−5, 0) -0.0002 0.066 0.879 1.137 (0, 5] baseline (5, 10] -0.005 0.072 0.864 1.147 (10, 15] 0.271 0.107 1.063 1.617 ** (15, ∞) 0.418 0.093 1.266 1.822 *** Change in BM (Control): -0.059 0.010 0.923 0.962 ***

% Change in total premium: (2 month): (−∞, −10] 0.421 0.060 1.360 1.710 *** (−10, −5] 0.138 0.060 1.020 1.292 ** (−5, 0) 0.034 0.047 0.943 1.134 (0, 5] baseline (5, 10] -0.007 0.053 0.896 1.101 (10, 15] 0.141 0.077 0.990 1.340 * (15, ∞) 0.370 0.070 1.262 1.662 *** Change in BM (Control): -0.039 0.009 0.944 0.980 ***

(50)

4.1.2 Do shared frailty models help in explaining (unobserved)

heterogeneity after controlling for a large number of geographic and behavioural factors?

(51)

observations belonging to a certain group into account.

Since a large fraction of policyholders holds multiple car contracts, either at the same time or sequentially, we have a natural clustering in the data. Adding a random-effect estimator to clusters of data reflects that observa-tions within clusters are more similar than between clusters. Shared frailty models can take (some) unobserved covariates into account, through the de-pendence of car contracts. They take unobserved heterogeneity, such as price sensitivity, propensity to buy a new car, the degree of informedness of com-peting insurers’ prices and brand loyalty into account. For the ith subject (i = 1, ..., nj) of the jth cluster (j = 1, ..., s) the hazard function conditional

on the frailty term is defined as

λij(t|x, β) = λ0(t)ujexp

x0_ij(t)β.

The uj denotes the multiplicative effect of cluster j. That is, it acts as a

factor on the hazard function. The uj is assumed to be from some

distribu-tion Uj. For identification purposes, we will set the mean of Uj as 1 and the

variance as an unkown value. A non-zero value for the variation of the frailty term implies dependence between the event times. Most models using a frailty component are solved using the Expectation-Maximization algorithm (EM-algorithm)13, which is generally slow. However, for the Gamma and Log-normal frailties there exist closed-form solutions for the hazard function (Therneau et al., 2003). For the Gamma frailty, we have the following penal-ized partial log-likelihood. The γj = log(uj) is distributed as the logs of i.i.d.

gamma distributed random variables. This partial log-likelihood should be

13_{The EM-algorithm iterates between an expectation and a maximization step. The}

(52)

maximized w.r.t. both β and γ: lp(β, γ) = s X j=1 nj X i=1 δij ( Aij− log h X (k,l)∈R(ykl) expAkl i ) (14) + 1/θ s X j=1 h γj − exp(γj) i , (15)

where Aij = x0ij(yij)β + γjZij(yij). Here Zij equals 1 if car contract i belongs

to policyholder j, 0 otherwise. The above equation is solved using Newton-Raphson iterations. The γj-term is a variable that is shrunk towards zero

(Therneau and Grambsch, 2000). The parameter θ is the variance of γj

and controls the degree of heterogeneity between the clusters/groups/centres. Larger values of θ imply closer positive relationships between car contracts corresponding to a policyholder. The parameter θ can be transformed into Kendall’s τ that measures the degree of association between outcomes (event times) within the same centre (Oakes, 1982).14 _{For the gamma frailty, the}

relation is τ = θ/(θ + 2). A general disadvantage of frailty models is their inability to describe negative dependence in the clusters. For our study, we do not consider this to be a problem.

For the Log-normal frailty, we replace (15) with (1/2θ)Ps

j=1γ 2

j, where γj

is distributed as the Gaussian distribution. The variance is of γj is θ, again.

We sampled 10,000 policyholders of all 50,000 policyholders. This is done because of computational limitations. We plotted the number of contracts held by the individual policyholders in Figure 7. We checked whether it was a good representation of our total data set. We found this to be the case.

14_{Kendall’s τ is defined as}

τ = Pr[(X1− X2)(Y1− Y2) > 0] − Pr[(X1− X2)(Y1− Y2) < 0]

= 2 Pr[(X1− X2)(Y1− Y2) > 0] − 1,

where (X1, Y1) and (X2, Y2) are continuous random vectors drawn from the same joint

(53)

Fig. 7: Number of contracts held by sampled individual policyholders.

. .

The number of car contracts denotes individual car insurance contracts, that is, one car contract for each car. The number of car contracts denote both the contracts held at the same time as sequential.

. .

(54)

change in the within-cluster relative risk. For example, the Hazard Ratio of having coverage of No-claim insurance for the Gamma frailty model indicates that, once we account for intragroup correlation via the Gamma distributed shared frailty term, for a given level of frailty the hazard for No-claim cover-age is about four-fifth of without No-claim covercover-age.

We notice that adding a frailty term has little influence on the results. The estimated coefficients are similar, as are the confidence intervals. This indicates robustness of the analysis with respect to the choice or lack of a frailty term.

Tab. 9: Results COX PH frailty models for a selection of the variables.

This table the result of the COX PH models with and without frailty for a selection of our total variables. We only used variables of the dummy coverages.

Variable No frailty Gamma frailty Log-normal frailty

HR[95% CI] HR[95% CI] HR[95% CI]

Dummy coverage:

Car accessory 1.18[1.07;1.30] 1.05[0.97;1.13] 1.07[0.99;1.15]

Car collision 0.96[0.91,1.02] 0.96[0.90,1.02] 0.96[0.9,1.02]

Fire and storm damage 1.09[0.99,1.19] 1.05[0.95,1.16] 1.07[0.97,1.18]

Theft Not enough variation.

Burglary 0.94[0.87,1.02] 0.90[0.82,0.99] 0.91[0.83,1.004]

No-claim insurance 0.84[0.78,0.88] 0.81[0.77,0.86] 0.82[0.78,0.87]

Roadside assistance car abroad 2.58[1.37,4.87] 1.49[1.35,1.64] 2.73[1.54,4.84] Roadside assistance car in the Netherlands 1.14[0.89,1.48] 1.26[1.00,1.58] 1.19[0.94,1.50]

Car window cover 0.79[0.73,0.86] 0.79[0.72,0.87] 0.79[0.72,0.87]

Vandalism Not enough variation.

Life insurance occupants 0.74[0.58,0.94] 0.78[0.62,0.98] 0.8[0.64,1.01]

Replac. vehicle abroad due to car breakdown Not enough variation.

Replac. vehicle in the _{0.86[0.67,1.12]} _{0.86[0.68,1.09]} _{0.86[0.67,1.09]}

Netherlands due to car breakdown

Replacement vehicle abroad due to car damage 0.30[0.16,0.56] 0.27[0.15,0.48] 0.29[0.17,0.51] Non-life insurance occupants 1.30[1.12,1.51] 1.31[1.11,1.54] 1.31[1.11,1.54]

(55)

A of policyholder 1 lapses later than car contract B of policyholder 2, but car contract B of policyholder 1 lapses sooner than car contract B of poli-cyholder 2. This interpretation also holds for polipoli-cyholders with more than two car contracts. For the Log-normal frailty term, we find an estimated variance of the frailty term of 0.278. The estimated variances are best de-scribed graphically. We have plotted the distribution of the Gamma and Log-normal frailty terms in Figure 8. We notice that the histograms look similar. The dotted lines denote the first and third quartiles, respectively. The quartiles of the estimated distribution of Gamma frailty terms are 0.81 and 1.15. The median is 0.98. Thus, policyholders at the first quartile have (0.81/0.98 − 1) ∗ −100% ≈ 17.3% lower risk of lapse, whilst those at the third quartile have a 17.3% higher risk than those at the median. For the Log-normal, the first quartile is 0.84 and the third quartile is 1.17. The median is 1.00.

Now it is, of course, interesting to test the significance of the frailty term. Since the frailty term is bound to be non-negative, we have the following one-sided test

H0 : θ = 0 versus H1 : θ > 0.

(56)

Fig. 8: Distribution of estimated frailties.

In this figure, are the plots for the Gamma frailty and Log-normal frailty terms. We notice that the distributions are very similar. The left dotted lines denote the first quartile, whilst the right dotted lines denote the third quartile.

(57)

distributed as 1 2χ 2 0+ 1 2χ 2 1.

The likelihood ratio test for significance is -167786.3 of the Gamma frailty versus -163559.4 of the non-frailty, which gives a chi-square statistic of 8453.8 on one degree of freedom for a p-value of p = 1₂Pr(χ2

1 > 8453.8) = 0.000.

Similarly, for the Log-normal frailty, we find a p-value of 0.000 for a test on the significance of the frailty term. To be on the safe side, we can use the χ2 1

as the limiting distribution for the null. Doubling the p-values, we still find highly significant p-values.

One of the explanations of the frailty term is to model the effects of hidden covariates (Hanagal, 2019). A possible practical relevance of this result is that there is still some (unobserved) heterogeneity in the data which can be explained by using a random intercept across policyholders which cannot be explained by the remaining variables. Therefore, frailty models give us, potentially, a useful way to check whether we have data to explain most of the variation. This applies to the policyholder holding multiple car contracts setting, but potentially also for policyholders holding a single contract. In the case of car insurance, the frailty term might be able to capture variation across policyholders such as price sensitivity, the propensity to buy a new car, the consumers’ degree of informedness about competing insurers’ prices and brand loyalty.

Modelling Lapse of Individual Policies in an Automobile Insurance Firm A thesis presented for the degree of Master of Science Econometrics Kleissen M.T.C.

Modelling Lapse of Individual

Policies in an Automobile

Insurance Firm

A thesis presented for the degree of

Master of Science Econometrics

Kleissen M.T.C.

Contents

Modelling Lapse of Individual Policies in

an Automobile Insurance Firm

Martijn Kleissen

1

Introduction

2

Data

flex-2.1

Data construction

2.2

Data description

3

Models

3.1

Survival Models

3.2

Prediction Models

4

Results

4.1

Survival models