Dynamic Model Selection Procedures in the Automobile Insurance Context

(1)

Dynamic Model Selection Procedures in the Automobile

Insurance Context

(2)

(3)

Dynamic Model Selection Procedures in the Automobile Insurance

Context

Student: Rieneke Getkate Student Number: s2497212

Abstract. The developments in big data and statistical learning techniques are believed to allow actuaries to improve the pricing models of non-life insurance. However, the consequences of the application of this type of techniques to insurance ratemaking have not fully been determined and are, therefore, an important topic of research in the actuarial sciences. In this research, model selection algorithms are adapted to and applied in the context of a car insurance data set. Both subset algorithms and a shrinkage method are applied to the GLMs used to estimate the frequency and severity of claims. The resulting models are tested on multiple goals within insurance ratemaking, including predictive accuracy, applicability within the industry and robustness of the models. Subsequently, their performance is compared to that of the rating model currently used by the insurer. Using traditional stopping rules, it is shown that model selection algorithms are able to improve the predictive accuracy at the cost of applicability and robustness. When studying the paths of subset selection algorithms, it is even possible to find models that improve the combined performance of rating models for the given goals. This implies that model selection algorithms cannot be perceived as a black box producing a best model, but can serve as an important source of information on model dynamics.

1 Introduction 3 2 GLM-Framework 4 2.1 Current Model . . . 5 2.2 Frequency GLMs . . . 5 2.3 Severity GLMs . . . 6 2.4 Distribution Selection . . . 6 3 Data 7 3.1 Dependent Variables . . . 7 3.1.1 Claim Frequency . . . 7 3.1.2 Claim Severity . . . 8 3.2 Independent Variables . . . 9 3.2.1 Categorical Variables . . . 10 3.2.2 Correlation . . . 10

4 Model Performance Measures 12 4.1 Prediction . . . 12

4.1.1 Measures of Prediction Accu-racy . . . 13

4.1.2 Simulation Method . . . 15

4.1.3 Simulation Results . . . 17

4.2 Other Performance Measures . . . 18

4.2.1 Model Fit . . . 18

4.2.2 Applicability . . . 19

4.2.3 Robustness . . . 19

5 Model Selection Algorithms 20 5.1 Stepwise Selection . . . 20 5.2 Autometrics . . . 21 5.3 Results . . . 24 5.3.1 Frequency Models . . . 24 5.3.2 Severity Models . . . 26 6 LASSO 29 6.1 Model Definition . . . 30 6.2 Results . . . 31 6.2.1 Frequency Models . . . 31 6.2.2 Severity Models . . . 32

7 Conclusion and Discussion 34 Bibliography 36 Appendices 38 A Variable Binning . . . 38

B Cramérs V . . . 41

C Simulation Method . . . 42

D Pseudo Code Algorithms Chapter 5 . . 48

E Results Frequency Autometrics . . . . 51

(4)

1 |

Introduction

Over the past few years the rise of big data has caught the interest of the non-life insurance industry. This development has considerably increased the number of potential risk factors that can be included in a rating model, which is used to determine insurance premia. Insurers believe that, among these newly introduced potential risk factors, factors might exist that improve their current rating model. However, the increasing number of candidate risk factors complicates the task of risk factor selection. Therefore, insurers are inter-ested in the application of the available statistical learning methods in the ratemaking process. Statistical learning is a branch of machine learning that uses data-driven methods for making statistical decisions. When applied in the ratemaking process, this practice is also referred to as dynamic rating. Within statis-tical learning, several different techniques are developed in order to systemastatis-tically compare different model specifications. These methods are referred to as model selection algorithms, which, due to their risk-based nature, are considered to allow insurers to select better rating models. The research question therefore is: To which extent can model selection algorithms improve rating models?

(5)

2 |

GLM-Framework

In non-life insurance pricing it is common practice to estimate the total loss per policy instance using sep-arate frequency and severity models. The frequency is defined as the number of claims per policy instance and the severity is defined as the cost of a claim, conditional of the occurrence of a claim. The total esti-mated claim cost for a policy instance, which defines the pure premium, is then determined as the product of estimated frequency and severity. Frequency as well as severity are commonly predicted using Generalised Linear Models. In this research several possibilities of implementing statistical learning techniques within this GLM-framework will be discussed. Furthermore, the GLMs that are currently used to predict claim frequency and severity will be used as a benchmark for the performance of the models produced by model selection algorithms. This chapter introduces the general GLM-structure and the benchmark.

The imposed GLM-structure yields pricing models that can be explained towards policyholders relatively eas-ily, which increases their trust in the fairness of the model. Furthermore, restriction to the GLM-framework yields models that adhere to the structures found in the insurance industry. Therefore, these models are easier to implement within insurance companies. The use of different models, on the other hand, would introduce high cost of adapting operational structures within insurance companies. Hence, the restriction to the GLM-framework is a more appealing approach than the expansion beyond GLMs. While restriction to the GLMs decreases the overall flexibility of the model, the flexibility within GLM models will be used optimally using statistical learning techniques.

In a Generalised Linear Model (GLM), a response yi is estimated in two steps. First, an appropriate

probability distribution F of the individual yi’s is selected. Then, the expectation of this distribution is

estimated using a linear model and the inverse of a link function g. This link function is defined as the function that links the expectation µi of the ithresponse to a linear function. This linear function is defined

as

gpµiq{oi “ β0` xi1β1` xi2β2` ... ` ximβm“ x1iβ, (2.1)

where the vector xi contains the values of all included explanatory variables corresponding to observation

i, and β contains the parameters. Furthermore, oi defines the offset, which contains variables with a fixed

multiplicative effect on the expected value. In insurance pricing, the log-link function is often used, that is

µi“ oiEpyiq “ oiexppx1iβq. (2.2)

This implies that the effect of the individual independent variables is assumed to be multiplicative instead of additive. Since this assumption is commonly made in the insurance industry, the log-link function will be used throughout this research. We assume that the response yi, where i denotes the ith individual, is

distributed as follows:

Epyiq “ g´1px1iβq and (2.3)

(6)

where θi is a function of xi, β and if present the additional parameters φ of F . Here, β and φ are constant

over the observations. Using the probability distribution and the link function, the log-likelihood can be computed. The estimates ˆβ and ˆφ are found by maximising the log-likelihood. This information, as well as additional information on GLMs, can among others be found in Jong and Heller (2008) and Dobson and Barnett (2008).

2.1 Current Model

The insurance company that provides the data currently uses separate GLMs for frequency and severity estimations. These GLMs and their performance will be used as a benchmark for the performance of the model selection algorithms.

The available risk factors that can be used in a rating model include policy information, policyholder in-formation, regional information and vehicle information. In the current rating model, the claim frequency is estimated using a Poisson GLM, 7 independent variables and an offset including 2 variables. These 7 independent variables include 1 risk factor concerning the policy characteristics, 3 risk factors concerning the region, 1 risk factor concerning individual information and 2 risk factors describing car characteristics. In the offset of the frequency GLM, the Bonus-Malus discount and the duration of the policy instance are considered. Since this offset is predetermined, it will be used in all frequency models considered throughout this research.

The currently used severity model is a Gamma GLM including 6 variables. In this case only 2 of the 3 previously mentioned regional variables are included. The other risk factors occur in both models. In the severity models, no offset is included.

In this chapter, the Poisson and Gamma distributions will be introduced. Furthermore, their alternatives will be mentioned. Section 2.4 introduces the method applied for distribution selection in the frequency-severity modelling context. In section 3.1, the observed independent variables will be introduced. There, using the methods of section 2.4, it is shown that the Poisson and Gamma distribution are appropriate choices for the frequency and severity GLMs respectively.

2.2 Frequency GLMs

Since the claim frequency of an observation, defined by ci, is a positive integer, a discrete distribution should

be used. The Poisson is a natural candidate for count responses. Combined with a log-link function and offset oi, the resulting model is given by

Epciq “ λi“ oiexppx1iβq (2.5)

f pciq “ e´λ

λci

ci!

. (2.6)

The log-likelihood function that is to be optimised is thus given by LLpc, βq “

n

ÿ

i“1

p´oiexppx1iβq ` ciplogpoiq ` x1iβq ´ logpci!qq, (2.7)

where c is a vector containing all individual claim counts ci. The offset contains variables that have a

(7)

2.3 Severity GLMs

The severity si is defined by the costs corresponding to a claim. This response is continuous and strictly

positive. The Gamma-distribution is a common choice in insurance for the distribution of si. The

Gamma-distribution includes a shape parameter k and a scale parameter θ. Combined with a log-link function, the resulting model is given by

Epsiq “ kθ “ exp px1iβq (2.8) f psiq “ 1 Γpkqθks k´1 i e si θ_. _(2.9)

The log-likelihood function that should be optimised when estimating the GLM model is thus given by LLps, k, βq “ n ÿ i“1 p´ logpΓpkqq ´ k log ˜ exp px1 iβq k ¸ ` pk ´ 1q log si´ si ´_{exp px}1 iβq k ¯ , (2.10)

where s is a vector containing all observed claim amounts si. There is no offset included in the severity

models. Jong and Heller (2008) propose among others the Log Normal distribution as a possible alternative to the Gamma distribution.

2.4 Distribution Selection

A distribution for the response variable of a GLM model should be selected with care. As mentioned before, several alternatives exist for both the frequency and the severity models. Using the risk factors included in the current models, these distributions will be compared and the most appropriate distribution will be selected.

In the case of frequency models, the Poisson model can be perceived as a special case of the negative binomial distribution that has no overdispersion. Therefore, Jong and Heller (2008) propose to use a likeli-hood ratio test between a Negative Binomial and a Poisson GLM in order to test for overdispersion. In addition to this test of overdispersion on the claim frequency data, quantile-quantile plots (QQ-plots) based on the deviance residuals are inspected for the potential distribution of both the frequency and the severity models. Augustin et al. (2012) developed a method for constructing QQ-plots in the context of GLMs, based on the deviance residuals. The deviance residuals constitute a measure of the distance between the saturated and the fitted models. This saturated model represents a perfect fit, where all predictions are equal to the observed value. The log-likelihood of the saturated model is marked as |LL. Then the resulting deviance is given by

∆ ” 2p |LL ´ xLLq. (2.11)

Due to the additive nature of log-likelihood functions, the deviance can be split into individual contributions. The deviance residuals are based on this concept and given by

δi“ signpyi´ ˆyiq

b

2p |LLi´ xLLiq. (2.12)

(8)

3 |

Data

For the application of the model selection algorithms, data from an all-risk automobile insurance are used. This chapter describes the used data and the unique characteristics of data used in insurance pricing. The data consist of information on policy instances that had full coverage for some period in 2011-2016. For analysis purposes, policies that were active during two calendar years are split into two separate policy instances. This feature together with inflow and outflow of insureds in the portfolio, implies that the average duration of a policy instance is 170.8 days. The full data set contains 931,985 policy instances of 160,108 insureds. The data set containing all policy instances incurring one or more claims includes 35,440 observations of 30,676 insureds. These data will be used for the estimation of severity models.

The used policies have coverage for several classes of losses. Due to their relative importance in the share of total losses, the focus will be on all-risk losses, which indicate losses to the own vehicle caused by the insured. Together with liability losses, these all-risk losses make up about 95 % of the pure premium for an all-risk insurance. Since the predictive accuracy of the model will be tested using a hold-out sample and liability claims have a relatively long development period, the values observed in the test set will often not be fully developed. The all-risk losses are used, because they are short-tailed, implying that the vast majority of the losses in the test set is fully developed. For those few claims that are not yet fully developed, the estimated future costs are included in the claim total.

3.1 Dependent Variables

In this section characteristics of the claim frequency and severity, which are the dependent variables, will be discussed. In particular the distribution of the dependent variables will be estimated using the methods as proposed in section 2.4.

3.1.1 Claim Frequency

The claim frequency is defined as the number of claims that occurred during a policy instance. Most policy instances (96%) incurred zero claims. Of those policy instances with a positive claim frequency only 2.98% incurred more than one claim. Table 3.1 presents a few descriptive statistics of the claim frequency per policy instance.

As described in section 2.2, the Poisson distribution has the property that mean and variance are equal. From table 3.1, it can be observed that the squared standard deviation is slightly higher than the mean.

Table 3.1: Descriptive statistics claim frequency.

mean sd sd2 min max

(9)

The likelihood ratio test as proposed by Jong and Heller (2008), resulted in a a P-value of 0.767, indicating that there is no evidence of overdispersion. Additionally, QQ-plots based on the method of Augustin et al. (2012) are inspected. The deviance residuals of the Poisson distribution are given by

ˆ δi“ signpci´ ˆλiq c 2tciln p ci ˆ λi q ´ pci´ ˆλiqu. (3.1)

Applying the method of Augustin et al. produces the plot as depicted in figure 3.1a. The deviation from the 45-degree line in the middle of the plot is caused by the integer nature of the claim frequency. The current GLM estimated a slightly lower total number of claims than is observed, causing this deviation. The Negative Binomial distribution produces an almost identical picture, as can be observed in figure 3.1b. Since the current rating model uses the Poisson distribution and there are no significant differences between the Poisson and the relatively more complicated Negative Binomial distribution, the Poisson distribution is selected for the frequency GLMs throughout this research.

(a) Poisson GLM. (b) Negative Binomial GLM.

Figure 3.1: QQ-plots based on the deviance residuals of the candidate distributions in the claim frequency GLMs.

3.1.2 Claim Severity

The claim severity is given by the pure cost of a single all-risk claim. For those few policy instances that incurred more than one claim, the claim severity is given by the mean severity of their incurred losses. Table 3.2 gives a few descriptive statistics of the claim severity. In figure 3.2 histograms of the severity are given on the interval r0, 10000s. To both figures, the densities of the candidate distributions as described in section 2.3 are added. It can be observed that the Gamma distribution shows more mass in the tails than the Log Normal distribution.

mean sd min max

1,566 1,783 0.11 50,910 Table 3.2: Descriptive statistics claim severity.

(10)

Using the method as described in section 2.4, the QQ-plots for both candidate distributions are computed. The deviance residuals of GLMs using these distributions are given by

ˆ δgamma,i“ signpsi´ ˆθiˆkq d 2ˆkpln ˆ θiˆk si `si´ ˆθi ˆ k ˆ θiˆk q (3.2) ˆ δlognormal,i“ 1 ˆ σpln psiq ´ ˆµiq 2_. _(3.3)

Using the approach as described in section 2.3, the two QQ-plots as given in figure 3.3 are computed. It can be observed that the Gamma GLM produces deviance residuals that follow the 45-degree line for the most part. However, the right tail is underestimated. The Log Normal distribution severely overestimates the mass in the left tail, while properly estimating the right tail. Based on these two plots, the Gamma GLM is selected for all severity models.

(a) Gamma GLM. (b) Log Normal GLM.

Figure 3.3: QQ-plots based on the deviance residuals of the candidate distributions in the claim severity GLMs.

3.2 Independent Variables

(11)

3.2.1 Categorical Variables

As mentioned by Goldburd et al. (2006), it is common practice in insurance ratemaking to transform con-tinuous variables, such as a drivers age, into categorical variables. This is done, because these concon-tinuous variables often display a non-linear effect on the claim frequency and severity. The transformation to cat-egorical variables allows insurers to define a nonlinear relation between the risk factor and the dependent variable. In order for the categorisation to work, the resulting categories should identify risk-homogeneous groups and a sufficiently large size to prevent overfitting. Clijsters (2015) introduced a data-driven approach to variable binning that outperformed other common options for variable binning. In this approach, a GAM, which is a more flexible generalisation of the GLM, is fit to the residuals of the rating model excluding the risk factor of interest. Then, a regression tree is used to provide a risk-homogeneous binning. Appendix A describes and analyses this option in comparison to the currently used binning of risk factors. Since this method is shown to outperform the current binning, all continuous variables are binned using this approach. In order to improve the flexibility of the GLMs, the number of categories for binned variables is usually relatively high in comparison to the number of categories in naturally categorical variables. For the continu-ous variables that are currently included in the model, the same number of categories is used. For the newly introduced variables we use 10 levels. In addition to the number of categories, a minimum category size should be specified in order to prevent overfitting. Categories in the full data set contain at least 750 policy instances, implying that the claim database will include around 30 observations for the smallest categories. For naturally categorical variables this implies that on some occasions small categories are joined into one group. Missing values are labelled as a separate category. This category often contains less observations, thus the estimated coefficients cannot be used for interpretation.

3.2.2 Correlation

Another important aspect of insurance data is the correlation between risk factors. Due to the categorical nature of the risk factors, traditional measures for correlation between continuous variables such as Pear-sons Rho and Kendalls Tau cannot be applied without the assumption of an ordering of the categories. Cramér (1946) proposed a test of independence of two categorical variables that involves a measure with an interpretation similar to that of a correlation. The proposed association measure lies in the interval r0, 1s, where 0 implies independence and 1 implies perfect correlation. Appendix B describes Cramérs V mathematically. Figure 3.4 depicts the association matrix for the 23 candidate risk factors. As mentioned before, the candidate risk factors are selected from several sources. These roughly comprehend four cate-gories: policy information, regional demographics, policyholder information and vehicle information. These categories contain 2, 3, 3 and 15 potential risk factors respectively. It is to be expected that within these categories risk factors show a relatively high association. For example, within vehicle information, electric cars and cars that drive on diesel usually have a higher price than cars that drive on regular gasoline. Figure 3.4 indeed shows high association within the categories, especially for car and regional information. The higher association within blocks implies that it is more difficult to distinguish true and noise parameters. On the other hand, the high association implies that a decent model can be attained by including relatively few risk factors. Additionally, it can be observed that some variables, especially those in the car category show a slightly higher association with all variables.

(12)

(13)

4 |

Model Performance Measures

The goal of applying statistical learning techniques in insurance ratemaking is to improve the rating model. To establish to which extent this is possible, the different goals of a rating model should be defined. Further-more, measures of model performance should be selected for these different goals. This chapter serves this purpose. One of the most important aspects of the performance of a rating model is its ability to predict future losses. Therefore, section 4.1 describes several measures of predictive accuracy and selects the most appropriate one in both frequency and severity models using simulations. In addition, section 4.2 describes measures of in-sample model fit, applicability and robustness.

4.1 Prediction

A major goal of a rating model is to predict future claim costs accurately. James et al. (2013) state that in the modern statistics literature, it is common practice to apply resampling methods to test the predictive accuracy of models. While in-sample measures of fit are designed to indicate which risk factors are relevant, these measures are prone to favour large models which overfit the data. The concept of overfitting is, among others, described in Hastie et al. (2016) and James et al. (2013). Generally, the more complex the model is, the better the fit is in the training sample. In the test sample more complex models generally decreases the prediction error until a certain point of complexity, after which extra complexity increases the prediction error. This increased prediction error indicates that the model is overfit and the extra complexity should therefore be avoided. A depiction of overfitting, based on Hastie et al. (2016), is given in figure 4.1. Since the goal of ratemaking is to predict future losses based on the data available from the past few years, the most recent year of data from the data set as described in chapter 3 provide a natural hold-out sample. Therefore, the used training set contains all policy instances from 2011 to 2015 and the policy instances from 2016 are used as the test set. The training set contains 729,820 policy instances of 114,776 insureds, while the test set contains 202,165 policy instances from 108,198 insureds. The severity data set is split in a similar way, yielding a training set of 28,449 observations and a test set of 6991 observations.

(14)

In the literature, many different approaches of measuring predictive accuracy using a hold-out sample are proposed. Most of the measures of predictive accuracy are estimated using linear models on relatively small data sets. Since this situation is quite different for the frequency-severity GLM modelling approach to the large data sets, it cannot be assumed that these measures all are appropriate for model selection in the insurance context. Therefore, data sets similar to the data described in chapter 3 are simulated and the prediction measures are tested. Section 4.1.1 introduces the tested measures, and section 4.1.2 describes the simulation method. Section 4.1.3 concludes by discussing the results of the simulations and selecting the most appropriate measures. Here, y denotes a response vector and yi denotes the ith observation in this

vector. Furthermore, ci and si specifically indicate claim frequency and severity respectively.

4.1.1 Measures of Prediction Accuracy

Predictive accuracy can be measured for all policy instances combined (aggregately), or for each policy in-stance separately (individually). Accurate aggregate prediction is of interest to an insurer, since an insurer is interested in predicting the total value of claims accurately. On the other hand, insurers benefit from accurate individual predictions, since underestimating the claim cost attracts bad risks to the insurer, while overestimating individual claim costs shrinks the portfolio of the insurer due to the high premium. Note, however, that these comments are relative to the prices in the market. Outperforming the market in indi-vidual predictive accuracy is beneficial for the insurer, while even better performance only helps to have a more accurate overall fit. In this section 7 measures of predictive accuracy will be introduced.

Mean Prediction Bias. Lord et al. (2010) selected a few simple methods for measuring prediction: the mean prediction bias, the mean absolute deviance and the mean squared predictive error. These methods provide intuitive measures some different aspects of predictive accuracy. The aggregate prediction perfor-mance is easily measured by the mean prediction bias, which is defined by

MPB “ 1 N N ÿ i“1 pyi´ ˆyiq. (4.1)

In case of a perfect fit, there will be no difference between yi and ˆyi, resulting in a MPB of zero. Therefore

when using this measure for predictive accuracy, the value should be close to zero. Hence, the absolute value of MPB should be minimised.

Mean Absolute Deviance. The mean absolute deviance is given by

MAD “ 1 N N ÿ i“1 |yi´ ˆyi|. (4.2)

This measure calculates the mean distance of the prediction to the observed value. Consequently, the focus of this measure is on individual prediction error. As mentioned before the goal is to minimise the error for each observation, thus this measure should be minimised.

Mean Squared Prediction Error. The mean squared predictive error is given by MSPE “ 1 N N ÿ i“1 pyi´ ˆyiq2.

(15)

Out-of sample log-likelihood. Similar to in-sample likelihood comparisons between models, the predic-tion accuracy of a model can be measured using the out-of-sample value of the likelihood funcpredic-tion. In the Poisson-framework, the log-likelihood is given by:

LLOOS,poisson “ N

ÿ

i“1

r´ˆci` cilogpˆciq ´ logpci!qs. (4.3)

For the Gamma GLM, we have: ˆ θi“ ˆ si ˆ k (4.4) LLOOS,gamma“ N ÿ i“1 r´ˆk logpˆθiq ` pˆk ´ 1q logpsiq ´ 1 ˆ θi si´ logpΓpˆkqqs. (4.5)

Corresponding to the in-sample log-likelihood, the out-of-sample log-likelihood gives a monotone transforma-tion of the probability of observing the results of the test data set under the assumptransforma-tion that the underlying model is the true model. This measure thus favours models under which the joint set of observations of the test data are likely to occur. The measure should be maximised in order to maximise the prediction value of the models.

Correlation. Another intuitive approach is to use the correlation between observations y and predictions ˆ

y. This approach has been first used by Nelson (1972). The correlation is given by

Cor “ 1 N řN i“1pyi´ ¯yqpˆyi´ ¯yqˆ sdpyqsdpˆyq . (4.6)

A perfect predictor produces observations that are equal to their predicted value and thus yields a correlation of one. Therefore, the correlation should be maximised. Note that under equal predictions the correlation cannot be computed as it depend on variation in both predicted and observed values. Thus, a correlation coefficient cannot be estimated using a null model that does not include an offset.

Area under the Receiver Operating Characteristic curve. Due to the relatively low number of policy instances incurring more than one claim, the transformation of the claim frequency to a dummy indicating a positive claim count does not cause a great loss of information. Therefore, methods designed for logistic outcomes can be applied for measuring the predictive accuracy of claim frequency models. A commonly used measure in this context is the ROC. The measure is based on the number of correct risk classifications. Using a given threshold, the predictions for the test sample exceeding a this threshold are classified as positive predictions. The ratio of correctly classified positives and negatives to the total number of observed positives and negatives as calculated from these predictions are called the sensitivity and specificity respectively. Note that these two form a tradeoff; increasing the threshold decreases the number of falsely classified positives at the cost of introducing more falsely classified negatives. The ROC uses this tradeoff as a measure of the possibility to separate future positives and negatives by the model. The ROC measures the area underneath the plot of the sensitivity against 1 minus the specificity for different thresholds. In the ideal situation, this area is equal to 1. Since irrespective of the threshold, the sensitivity and specificity are equal to one. When there is no predictive power the ROC curve is a diagonal line and the area equals 1₂. Figure 4.2a illustrates the ROC curve, more information and background on the ROC is among others given in Kuhn and Johnson (2013).

(16)

(a) Example of an ROC curve. For the thresholds of 0.25 and 0.75 the specificity and sensitivity are given. The marked surface represents the ROC.

(b) Lorenz curve with marked surface representing the Gini-coefficient.

Figure 4.2: Illustration of the prediction measures based on the ROC and Lorenz curve.

values are plotted. These predicted values are ordered in increasing size. On the y-axis, the corresponding relative cumulative claims are reported. If the model perfectly predicts the number of claims, the observed cumulative number of losses exactly follows the 45-degree line. Therefore, the area between the curve of cumulative losses and the 45-degree line shows all relative inefficiencies of the proposed model. A good predictive model minimises the area between the curve of cumulative losses and the 45 degree line. Figure 4.2b shows such a surface. Like the MPB, this measure allows observations to compensate for each other and thus measures aggregate predictive value instead of individual predictive value.

4.1.2 Simulation Method

In order to test the accuracy and the stability of the results of the implementation of machine learning tech-niques, data simulation is a common tool. Using simulation methods that represent the unique characteristics as described in chapter 3 provides the opportunity to select measures of predictive accuracy that perform well in the given data context. The data simulation adapted to this data framework consists of simulating independent variables and subsequently simulating dependent variables based on an assumed true model and these independent variables. The approach will be discussed in this section.

In the first step, the independent variables are simulated. Due to the low pseudo-R2 _{observed in the full}

(17)

size, the mean is the relevant quantile. All observations lower than the mean would be attributed to the first category, while all others are attributed to the second category. This categorisation yields a simulated data set of categorical variables representing the correlation of the predefined correlation matrix and the category sizes. In appendix C, the continuous method is described mathematically and tested in the context of the risk factors that are included in the current model for claim frequency. The simulated data show a mean absolute difference with the true correlations of about 25% of these true correlations, where correlation is measured using Kendall’s Tau. This is considered to be an acceptable deviation and therefore the continuous method is used to simulate data.

In order to represent the full data while not exhausting computations, the simulated data sets consist of 12 risk factors. For Poisson simulations, the simulated data set size is 100,000, while for Gamma sim-ulations, the data set size is 30,000. The continuous method requires the use of a predefined correlation matrix. For each simulated data set, a new correlation matrix is simulated. As discussed in section 3.2.2, the correlation observed within the data sources exceeds the correlation with risk factors from different data sources. In addition, some data sources showed increased overall correlation in comparison to other data sources. To represent this data clustering, the simulated correlation matrix features 4 risk factors that show higher correlation with the risk factors in their block, based on the observed values for the regional variables. Furthermore, 3 other risk factors represent a block that shows high correlation with the variables of their block and have a lower, but still relatively high correlation to all other risk factors. These values are based on the observed values for the car variables.

In addition to the correlation matrix, the continuous method uses predefined intervals to categorise the continuous variables that are simulated. These intervals are simulated based on the category sizes in the original data set. Appendix C further elaborates on the technical aspects of the predetermined values used for the simulations.

250 sets of independent variables are simulated in order to determine the best prediction measure for the frequency and severity GLMs independently. Based on these data sets, dependent variables are simulated using an assumed true model. For all risk factors, each of the categories is assigned a model coefficient drawn from the uniform distribution on the interval [-0.1,0.1]. For a given observation i, denote the assigned coefficient corresponding to its category for the jth variable by ˜βij. The intercept used to estimate claim

frequency is log(0.07) and the intercept used to estimate claim severity is log(1700). These values are chosen to represent the mean of the original data set and the coefficients of the current rating model.

Let m denote the number of risk factors included in the true model. The performance of the measures of predictive accuracy is tested for 7 different true models with m in the set t12, 10, 8, 6, 4, 2, 0u. Since the order of the characteristics of the risk factors is randomised, the first m variables are included. Under the assumed true model, the dependent variables are simulated. This is done for each observation by computing µi as the

exponential of the sum of the coefficients assigned to the first m risk factors of the ith_{observation and the}

intercept. Thus, µi“ exppintercept `řm_j“1β˜ijq. Using this mean, the response is simulated. For the claim

frequency, a Poisson model with λ “ µi is used. The claim severity is estimated using a shape parameter

k “ 0.9 and a scale parameter equal to θi“ µi{0.9. The responses are added to the simulated data.

(18)

4.1.3 Simulation Results

From the 250 simulations performed for both the frequency and severity models, the mean number of in-cluded risk factors according to each of the models is reported in figures 4.3a and 4.3b respectively. In this figure, the mean estimated number of included variables is compared to the 45-degree line. This line represents perfect estimation of the number of variables included in the true model. Here, the true model size represents the number of included simulated risk factors in addition to the intercept. Since it can be assumed that in insurance ratemaking the true model includes a positive subset of the potential risk factors, the focus will be on the performance of the predictive measures for true models including 2 to 10 risk factors. Table 4.1 presents the mean absolute difference between the estimated number of included variables and the number of variables in the true model for this interval.

(a) Frequency simulations (b) Severity simulations

Figure 4.3: Results of the simulations for both frequency and severity in the form of the average estimate of the number of included variable per true model size and prediction measure.

From figure 4.3 several characteristics of the prediction measures can be observed. Note that, in contrast to the other measures, the MPB and Gini-coefficient have a slope close to zero. This implies that these measures hardly react to the increase of the number of variables in the true model. Underlying this poor predictive performance is the fact that, in contrast to the other proposed measures, these two measures represent aggregate prediction performance. That is, they allow observations impact on the total measure to offset each other. The observed lines imply that measures of aggregate prediction are inappropriate measures for identifying the true model.

Furthermore, the Cor and the ROC both showed a poor fit when the true model size is equal to zero. This is caused by the fact that these measures cannot be computed under equal predictions as is the case when the true model contains only an intercept. Since the focus is on models including 2 to 10 risk factors, the misspecification by the Cor and ROC when the true model consists of only an intercept does not affect the selected prediction measure.

(19)

it is convenient to select the same measure of prediction for both the frequency and the severity models. However, it is considered more important to measure predictive value accurately. Since the results from the frequency and severity models differ substantially, a separate measure is chosen in both cases.

Figure 4.3a shows that for frequency simulations, the Cor and the ROC lie the closest to the 45-degree line in the interval [2,10]. Over this range, both measures slightly overestimate the number of variables included in the true model, performing better when a larger number of risk factors is included in the true model. Table 4.1 shows that the Cor performs slightly better with an average mean absolute difference of 1.61 between the true and estimated number of parameters. Since, additionally, the Cor is more easily computed and conceptually more straightforward than the ROC, the Cor is selected as the best performing measure in frequency simulations.

Concerning the severity simulations, figure 4.3b shows that all measures but the Gini and the MPB follow the 45-degree line quite closely, while slightly overestimating the number of variables included in the true model. From table 4.1 it can be observed that the out-of-sample log-likelihood produces the lowest average mean absolute difference. Therefore, the out-of-sample log likelihood is selected selected for measuring the predictive accuracy of severity models.

Table 4.1: Mean absolute difference between the true and estimated number of variables for each prediction measure per true model size in the interval r2, 10s.

Frequency Simulations Severity Simulations

True Model Size MPB MAD MSPE LL.oos Cor ROC Gini MPB MAD MSPE LL.oos Cor Gini 10 4.92 1.43 2.88 2.75 1.16 1.15 8.39 5.23 0.67 0.55 0.53 0.57 7,38 8 4.12 2.22 2.67 2.68 1.53 1.56 6.79 4.38 0.99 0.77 0.70 0.73 5,56 6 3.84 3.25 2.07 2.10 1.63 1.74 4.83 3.70 0.99 0.68 0.78 0.92 4,07 4 3.81 4.10 1.77 1.74 1.94 1.80 3.00 3.88 1.45 0.84 0.81 1.15 2,67 2 4.29 4.72 1.20 1.20 1.77 2.16 1.48 4.06 1.14 0.90 0.77 1.30 1,49 Mean 4.20 3.14 2.12 2.09 1.61 1.68 4.90 4.25 1.05 0.75 0.72 0.93 4.24

4.2 Other Performance Measures

In addition to measuring predictive accuracy, some other aspects of model performance should be measured. While accurate prediction is one of the major goals of ratemaking, other aspects of a model are important for the application to insurance practice. In this section measures for three other aspects of the resulting model are proposed. These measures represent the in-sample fit, applicability and robustness of the model.

4.2.1 Model Fit

The easiest way to verify the performance of a model is to use the in-sample model fit. A commonly used in-sample performance measure is the log-likelihood. Often, this measure is penalised for the number of included variables as is done in Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC). The objective of deleting these variables is to prevent the inclusion of variables that have a small contribution to the fit. This penalty thus prevents a model from overfitting. The AIC will be used to measure the in-sample fit and is defined by

AIC “ ´2 xLL ` 2p, (4.7)

(20)

4.2.2 Applicability

In choosing the optimal rating model, the cost of application of the given model should be considered as well. A major cost of using a risk factor in a model is the cost of data collection and preparation. However, in order to consider the use of a risk factor in the context of measuring model performance, the data should already be collected and prepared. Therefore, these are sunk cost that cannot be included in an applicability measure. Hence, only the application costs after determining the optimal model are considered. These costs involve the sheer number of risk factors and the costs for missing data observed for certain risk factors. The applicability measure includes the number of included risk factors, since a larger model size indicates higher cost of setting up the final premium. Furthermore, when increasing model size it becomes increasingly difficult to explain why risk factors are included and to interpret their use.

Besides the model size, the measure of applicability includes the fraction of missing values for each included factor fj in the full data set. When a variable that has missing values is included in a model, it is common

practice to use the highest coefficient occurring for a category of that given risk factor, since there is no evidence that it is possible to reduce that given price. This risk averse strategy on average overprices these risks. This can be prevented by trying to retrieve more data for the given subject. Both solutions, however are quite costly, thus including variables with a high number of missing values is undesirable.

The full measure is therefore given by

Applicability “ J `

J

ÿ

j“1

fj,

where J denotes the number of included variables. Note that the applicability score represents the cost of model application and is preferably low.

4.2.3 Robustness

In the insurance industry, it is desirable to offer premia that are robust. That is, if a new year of data is added and the full data set is used to estimate the same model, the estimated coefficients do not change substantially. If model coefficients change considerably over time, this indicates a change in the process generating claims. While such changes do occur, changes in coefficients can also be an indicator of overfitting the data. Furthermore, robust premia are relevant since it improves the interpretability of premia which benefits transparency towards the policyholders. Moreover, robust premia are thought to increase customer retention.

There are two options for measuring robustness of the estimated coefficients. First, one could measure the differences in the estimated coefficients directly. However, since changes in coefficients aggregate in the pricing of a policy, the total change in the price is the most relevant. Since the effect of robustness is only visible for policyholders through their premium, the used measure focuses on the total premium. A measure of robustness can be formulated parallel to the relativity measure as introduced by Frees et al. (2013). First we estimate the given GLM for the training data set and use this to estimate the premium ptraining,ifor each

observation i in the hold-out sample. Then, the same model is estimated using both the training and the test data set and used to establish the premium for each observation pc(training, test),i. The measure used to

measure robustness is then given by

Relativity “ 1 N N ÿ i“1 pc(training, test),i ptraining,i ´ 1 , where N is the number of observations in the hold-out sample.

(21)

5 |

Model Selection Algorithms

When selecting a rating model, one of the most important decisions to make is which risk factors have to be included in the rating model. In chapter 3 the set of candidate risk factors are described. One might want to construct all possible models based on this set of risk factors and compare them based on the criteria described in chapter 4. However, since the full set of risk factors includes 23 candidates, the total number of models that can be estimated becomes 223“ 8, 388, 608. If estimating a single model would take one minute, this process would take 15.96 years. Since GLM estimations based on large data sets generally take more than one minute, alternatives of estimating all models should be considered. Model selection algorithms provide a method of both systematically and efficiently searching the model space. James et al. (2013) describe several methods of linear model selection and regularisation. These approaches are divided into three classes: sub-set selection, shrinkage and dimension reduction. The first two classes contain methods that can be applied in the context of variable selection. In this chapter, two subset selection methods are introduced and applied. Subset selection techniques select a subset of risk factors to be included in the final model. Many dif-ferent approaches to subset selection have been proposed in the literature. Castle et al. (2013) review and compare subset selection algorithms in the context of linear regression. They distinguish models that iteratively remove risk factors (general-to-specific) and models that iteratively add risk factors (specific-to-general). In addition to these stepwise approaches James et al. (2013) describe hybrid approaches. The difference between the two types of algorithms is that stepwise approaches move in one direction, while hybrid approaches allow for the reversal of steps. Since stepwise approaches are often based on making an optimal step in each iteration, they are called greedy algorithms. Hybrid approaches, on the other hand, are more flexible.

The two algorithms considered in this chapter are the stepwise and the autometrics algorithms. The forward and backward stepwise algorithms are studied due to their straightforward properties. The autometrics al-gorithm performed best within the simulation study of Castle et al. and is therefore considered as the second subset selection algorithm in this research. Autometrics, as proposed by Doornik (2007), is a general-to-specific subset selection algorithm. New models are created by removing one insignificant variable at a time. When there are multiple insignificant variables, this method results in a tree of outcomes. Both methods classify as stepwise algorithms. However, due to the multiple paths searched by the autometrics algorithm, it is more flexible than the simple forward and backward stepwise algorithms.

5.1 Stepwise Selection

(22)

risk factor is selected. This is done by estimating all models where one of the risk factors that were not included earlier, are included. Among the union of these models and the existing model in the given itera-tion, the updating rule selects the best model. When the existing model outperforms all models including an extra risk factor, this model is selected as the final model by the algorithm. Note that this implies that the algorithms stopping rule equals its updating rule. Otherwise, the model is updated and we move to the next iteration. In backward stepwise selection, we start by the full model including all candidate risk factors. In each iteration, one risk factor is deleted from the model formula. This is done by estimating all models where one of the included risk factors is deleted from the model given each iteration. Again, the best model is selected among these models. The iterations end when none of the possible updates outperform the model given at that iteration.

There are several different options to define updating rules measuring the performance of the possible steps. For the updating and stopping rules James et al. propose the use of AIC, BIC or adjusted R2_{. Castle et al.}

use the AIC. In chapter 4 it is argued that predictive performance is one of the most important measures. However, if one would include for example predictive performance as an element of the updating equation, the resulting models would still be subject to overfitting relative to the test data set. Furthermore, using the measures for prediction prohibits the application of an algorithm on the full data set. Therefore, one of the traditional updating rules is used. Since the AIC can be computed straightforwardly for GLMs, we follow the literature by using the AIC to measure the best performance. Since the stepwise algorithm only optimises iteratively and moves in one direction, it is not guaranteed that forward and backward algorithms result in the same final model. Algorithm 2, as given in appendix D, illustrates the forward stepwise selection in pseudo code.

The stepwise algorithms will be run on the data as described in chapter 3. The frequency-based data set is too large to run the full algorithms due to memory constraints. Therefore, the algorithms are run on one fifth of the data; 145,964 observations. In order to maintain the same training-test ratio, the test-set also contains one fifth of the data, thus 40,433 observations. Both these data sets are computed as a random sample without replacement from the original data sets. For the severity models, the full sample can be used. For the candidate models estimated in both algorithms, the proposed measures from chapter 4 are computed. The best models based on these criteria are compared to the ones selected by the stopping rule of the algorithm.

In addition to measuring the robustness of the models when adding the test set, as is done in the relativity, the robustness of the algorithms will be tested. This will be done based on the frequency models. In addition to the existing sample from the data, two other subsets are sampled from the data. These samples are such that all observations occur in at most one of these three samples. Then, again, the algorithms are run on these two samples. Finally, it is checked to which extent these subsets produce the same risk factors. Under-lying this test is the assumption that by random sampling from the data set, the samples are homogeneous and should thus produce the same resulting model.

5.2 Autometrics

With the autometrics algorithm, Doornik (2007) extends existing tree-search algorithms that iteratively exclude all variables in multiple paths. The autometrics algorithm performs subset selection in a more flexible manner than the stepwise selection, since it considers multiple reduction paths and compares the resulting final models. However, its design still classifies as a general-to-specific stepwise algorithm. In the research of Castle et al. (2013), the autometrics algorithm outperformed the other subset selection algoritms. In this section, first the tree-search algorithm is described, then the additions of Doornik are explained and finally, the details of application to the insurance data is set out.

(23)

significance level is computed by a likelihood ratio test between the given model and the model where the given variable is excluded. We define the likelihood ratio test as follows; let M1 denote the given model and

let M2 denote the model where the ithvariable is excluded. In addition, let the dfM. denote the degrees of

freedom in a given model and denote the likelihood of a model by LLM.. Then, the p-value of the likelihood

ratio test is given by

P “ 1 ´ χ´1

p2pLLM1´ LLM2q, pdfM2´ dfM1qq. (5.1)

Due to the use of categorical variables, the deletion of one risk factor results in an increase of the number of degrees of freedom by x ´ 1, when x denotes the number of categories of the given risk factor.

When the root branches are formed, the tree search starts it reduction from the first root branch. At each branch, the given model is computed and the significance of each of the risk factors is determined. From this branch, new branches are formed by all models excluding one of the risk factors, again ordered from highest p-value to lowest p-value. From each branch the path where at each branch the most insignificant variable is removed is followed first until only one variable remains. After this model, the next branch that is not fully developed is visited.

If a root branch is fully developed, we move to the next root branch. Since the previous root branch already excluded the most insignificant variable from the full model, this risk factor will not be deleted in any of the branches originating in this root branch or any of the following root branches. The remaining risk factors are called the free variables. The number of free variables thus decreases by one in each following root branch. Figure 5.1 shows an example of a tree-search where there are four variables: A, B, C and D. At each branch, the variables that still are included are mentioned. The numbers indicate at which order the models are visited. For clarity of the example, it is assumed that the order of insignificance does not change for the different branches. This example, including the figure, is taken from Castle and Shephard (2009).

Figure 5.1: An example of the tree-search algorithm. The numbers indicate the order in which the nodes are visited. The letters indicate the included variables.

Doornik (2007) add several features to this algorithm in order to reduce the number of visited branches, while still systematically searching the model space. Underlying the elements that he added is the assumption that the best model only includes the (jointly) significant risk factors and discards insignificant variables. For the application to the given data, the significance level pawill be set to 0.01, which is commonly used in economic

literature when a high number of observations is present. Three of the added elements are described below. Algorithm 3, as can be found in appendix D, describes the autometrics algorithm in pseudo code.

(24)

terminal branch. Terminal branches are fully developed and no further reduction is run on these models. This implies that the significance level of the model and the risk factors form the stopping rule for the autometrics algorithm.

Bunching. In some occasions, a group of variables is highly insignificant. Iteratively reducing the model by each of these variables involves high computational cost. Therefore, the most insignificant variables are grouped together. Subsequently, it is tested whether the model which excludes this group of variables is not significantly different from the full model. If this is the case, the bunch is excluded at once and all intermediate branches are disregarded. If the reduced model is significantly different from the full model, the most significant variable of the bunch is excluded from the bunch and the backtest with respect to the full model is performed again. If necessary, this process is repeated until only one variable remains in the bunch and it can be concluded that bunching is not an option. There exists a tradeoff between the possibility to find a large bunch and the risk of having to perform multiple backtests until an appropriate bunch is found. In order to manage this tradeoff, Doornik proposes to include the variables with a p-value larger than p˚

b in

the bunch that is proposed first. He defines

pb“ maxp 1 2p 1 2 a, p 3 4 aq (5.2) p˚_bpkbq “ p1{2b “1 ´ p1 ´ p 1{2 b q kb‰, _(5.3)

where kb is the size of the bunch. By design, bunching is not performed on the full model, but only from

the root branches onwards.

Minimal contrasts. Doornik runs the tree-search in two rounds. In the first round, branches that include (among others) all variables that are also included in an already found terminal are skipped. This round results in one or more terminals. In the second round, the previously skipped branches are revisited. In order to prevent these branches to result in the same terminal model as found before, it is tested whether a contrasting bunch can be found. A contrasting bunch is a bunch from the existing model with nested terminal(s) such that the resulting model does not have any nested terminals. The bunch considered is the minimal bunch that adheres to this criterion. For the previously skipped nodes, the p-value of each of the included risk-factors is calculated. Then, the included free variables are ordered from largest to lowest p-value. Next, we determine the minimal i such that the model excluding the first i elements of the ordered free variables does not nest any of the previously found terminals. The first i elements of the ordered free variables form the minimal contrast. The model excluding this minimal contrast is computed and backtested with respect to the full model. If the model is not significantly different from the full model, the bunch is accepted and we move on as with a normal bunch. If the bunch results in a model that is significantly different from the full model, the branch is not developed further.

(25)

5.3 Results

5.3.1 Frequency Models

5.3.1.1 Stepwise Algorithm

(a) AIC (b) Correlation

(c) Applicability (d) Relativity

Figure 5.2: Performance of the frequency models visited by the forward and backward stepwise algorithm for each of the defined measures introduced in chapter 4. The dotted lines indicate the forward algorithm, and the solid lines indicate the backward algorithm. For all three data splits, the results are combined in one plot per measure. The line colour indicates which data split is used.

Both the forward and the backward stepwise algorithms are applied on three parts of the data set. The stepwise algorithm decides which model is optimal using the AIC. For data split 1, the forward and backward algorithm resulted in the same optimal model including 12 variables. Similarly, both algorithms resulted in the same optimal model including 11 risk factors when using data split 2. Nine of the selected risk factors occur in both data split 1 and 2. For split 3, the forward algorithm includes 10 variables, while the backwards algorithm includes 11 risk factors. Among these risk factors, 8 occur both in the forward and backward algorithm. A total of 6 risk factors are selected in all three data splits both in the forward and backward stepwise algorithms.

(26)

the performance of the model increases notably until about 4 risk factors are included and remains about constant until more than 20 variables are included. When considering the applicability and the robustness (measured in relativity) of the models, it is clear that the inclusion of extra variables comes with a cost in both these aspects of the models. Therefore, based on the combined criteria simultaneously, the model including only 4 risk factors appears to be optimal.

In table 5.1, for each split information on the resulting models based on the AIC and the correlation are given. In addition to these two selection criteria, the model from the forward stepwise algorithm including 4 risk factors is reported as well. The number of included variables is reported both in total and per data source. As mentioned in chapter 3, there are 2 policy risk factors, 3 regional risk factors, 3 individual risk factors and 15 variables. Table 5.1 shows that a relatively low number of car variables are included in the frequency models. Additionally, regional variables are not included in the smallest models. In comparison to the currently applied rating model, these selected models based on the combined measures do compete. They offer a similar predictive performance under generally better conditions for application and robustness of coefficients. However, in the first data split the risk factor selection caused a higher relativity score than the current model.

Table 5.1: Model size and performance for the resulting frequency models of the forward and backward algorithms on the three data splits. In addition, the performance of the current model is reported.

Forward Stepwise Algorithm

Number of Included Variables Model Performance

Criterion Data Split total policy region individual car AIC Cor Applicability Relativity

AIC 1 12 2 2 2 6 48,648.1 0.1030 12.87 0.0945 AIC 2 11 2 1 2 6 47,396.3 0.1080 12.27 0.0736 AIC 3 10 2 2 2 4 48,574.6 0.0993 10.96 0.0641 Cor 1 5 1 1 1 2 48,682.8 0.1044 5 0.0824 Cor 2 10 2 1 1 6 47,397.2 0.1081 11.27 0.0730 Cor 3 10 2 2 2 4 48,574.6 0.0993 10.96 0.0641 Combined 1 4 1 0 1 2 48,700.2 0.1026 4 0.0815 Combined 2 4 1 0 1 2 47,435.3 0.1070 4 0.0492 Combined 3 4 1 0 1 2 48,622.0 0.0947 4 0.0454

Backward Stepwise Algorithm

AIC 1 12 2 2 2 6 48,648.1 0.1030 12.87 0.0945 AIC 2 11 2 1 2 6 47,396.3 0.1080 12.27 0.0736 AIC 3 11 1 2 2 6 48,553.5 0.0956 12.72 0.0829 Cor 1 13 2 2 3 6 48,649.2 0.1031 14.31 0.0950 Cor 2 16 2 3 3 8 47,423.4 0.1081 18.07 0.0949 Cor 3 13 2 2 3 6 48,554.2 0.0957 14.72 0.0833

Risk Factors Current Frequency Model

1 7 1 3 1 2 48,700.5 0.1023 7.003 0.0570

2 7 1 3 1 2 47,439.2 0.1043 7.003 0.0547

3 7 1 3 1 2 48,624.1 0.1023 7.003 0.0581

5.3.1.2 Autometrics.

(27)

algorithm. 19 of the root branches were skipped in the first round due to nested terminals. For 11 of these root branches model contrasts were found and thus considered in the further search for terminals. In the second round an additional 14 terminals were found. The average terminal includes 17 variables. Within the process 16 bunches were applied, 2 of these were as large as 12 variables. In total 66 branches were visited. In split 1.2, all 23 risk factors were insignificant at the 1% level. Hence, 23 root branches were formed in the initial stage of the autometric algorithm. In the first round 3 of the roots were considered, resulting in the formation of 15 terminals of average size 12.8. A total of 36 branches were saved for the second round, due to the existence of nested terminals. For 23 of these branches model contrasts were found and these branches were developed further. The second round resulted in 31 terminals, 12 of which containing 22 variables. In the entire process, 182 branches were formed and 22 bunches were performed.

Due to the complexity of the algorithm, the algorithm required a relatively low sample size in the con-text of the frequency models. Therefore, the number of visited branches and hence the number of terminals is quite high. The full set of found terminals is given in appendix E for both data split 1.1 and data split 1.2. In table 5.2 the variables and performance of best terminals in both splits are listed alongside the performance of the current pricing model in the first split. Note that the smaller data set size has increased the relativity even when the ratio between training and test data remains constant.

For data split 1.1, the terminal including 6 variables performs best on all measures proposed in chapter 4. However, note that the current model produced a better out-of-sample prediction under a lower relativity. Similarly, the best terminals of split 1.2 produce slightly lower predictive accuracy under a higher cost of applicability. Hence, in both data split, the autometrics algorithm did not manage to select a model outper-forming the current rating model.

Furthermore, note that the autometric algorithms behaved quite differently in both data splits. Therefore, the algorithm is not robust to data set changes when using relatively small data sets that hypothetically should behave similar to each other.

Table 5.2: Model size and performance of the best performing models as selected by the autometrics algo-rithm. The current model is added as the benchmark for the model performance.

Data Number of Included Variables Model Performance

split total policy region individual car AIC Cor Applicability Relativity

1.1 6 1 1 2 2 24,765.8 0.0957 6.00 0.1204 1.1 10 1 1 2 6 24,838.9 0.0896 11.87 0.1278 1.2 6 1 1 1 3 23,639.8 0.1002 6.00 0.1261 1.2 6 0 2 1 3 23,546.2 0.1054 6.00 0.0950 1.2 7 1 2 1 3 23,541.4 0.1064 7.00 0.0950 1.2 7 0 2 2 3 23,543.8 0.1045 7.00 0.0969 1.2 8 1 2 2 3 23,539.4 0.1054 8.00 0.0971 1.2 15 2 3 2 8 23,610.7 0.1039 17.72 0.1389 Current Model 1.1 7 1 3 1 2 24,807.6 0.1012 7.00 0.0739 1.2 7 1 3 1 2 23,542.4 0.1065 7.00 0.0849

5.3.2 Severity Models

5.3.2.1 Stepwise Algorithm.

(28)

(a) AIC (b) Out-of-sample log-likelihood

(c) Applicability (d) Relativity

Figure 5.3: Performance of the severity models visited by the forward and backward stepwise algorithm for each of the defined measures introduced in chapter 4. The dotted lines indicate the forward algorithm, the solid lines indicate the backward algorithm.

In addition to the AIC, the out-of-sample log-likelihood, the applicability and the relativity are computed for each of the estimated models. The results are shown in figure 5.3. From figure 5.3a it can be observed that the AIC decreases quite rapidly due to the inclusion of the first 4 variables. Afterwards, the decrease is less rapid, but still visible until the minimum of 15 variables is reached. In the backwards algorithm, the decrease caused by the deletion of risk factors is slow and quite constant. The out-of-sample log-likelihood shows a large increase by the inclusion of the first five variables and remains about constant when 5 or more risk factors are included. As is observed in the frequency models, the relativity both increases with the number of variables included. Therefore, when considering all measures of prediction jointly, a model including 5 variables is favourable. Table 5.3 reports the resulting models based on the AIC and the out-of-sample log-likelihood for both the forward and the backward stepwise algorithm. In addition to these two models, the model including 5 risk factors as visited by the forward stepwise algorithm is reported as well. From the number of variables included per data source, it is clear that the relative importance of car variables is larger in the claim severity than in the claim frequency.

(29)

therefore skipped in the first round. In the second round 14 models were visited. These models consisted of 10 root branches and 4 regular branches that nested one or both of the 2 found terminals. For none of the visited models a contrasting bunch was found. Therefore, the two terminals found in the first round are the only two proposed final models.

Table 5.3 reports these two terminals. Both models have the same parent node, include 13 risk factors and differ in only one variable within the car category. Therefore, the resulting model only differ slightly. When choosing between these two models, one needs to consider the tradeoff between a good fit in and out of sample and the relative change of the premium.

Table 5.3: Model size and performance for the resulting models of the stepwise and autometrics algorithms. The current model is added as the benchmark for the model performance.

Forward Stepwise Algorithm

Criterion total policy region individual car AIC LLOOS Applicability Relativity

AIC 15 1 2 1 11 471,347 -58,918 18,39 0.0860

LLOOS 11 1 1 1 8 471,373 -58,917 13,63 0.0749

Combined 5 0 0 1 4 471,481 -58,921 5 0.0442

Backward Stepwise Algorithm

Criterion total policy region individual car AIC LLOOS Applicability Relativity

AIC 15 1 2 1 11 471,347 -58,918 18.39 0.086

LLOOS 16 1 2 1 12 471,348 -58,915 19.91 0.0859

Autometric Algorithm

Model total policy region individual car AIC LLOOS Applicability Relativity

1 13 1 2 1 9 473,066.7 -58,912 15.63 0.0780

2 13 1 2 1 9 473,069.7 -58,919 15.63 0.0774

Risk Factors Current Severity Model

Model total policy region individual car AIC LLOOS Applicability Relativity

6 1 2 1 2 471,566.1 -58,968 6.003 0,0429

Comparison. When comparing the risk factors selected by the autometric algorithm and those selected by the forward and backward stepwise algorithm (based on the AIC), a lot of similarities are found. Recall that the forward and backward stepwise algorithm both selected the same model. This model included 15 variables. 13 of these variables are included in model 1 of table 5.3, while 12 are included in model 2 of this table. The overlap in the selection of risk factors is thus strikingly high.

(30)

6 |

LASSO

In the previous chapter several different subset selection algorithms are discussed. Subset selection algorithms deal directly with the inclusion and exclusion of variables and are therefore a straightforward approach. How-ever, the literature describes some drawbacks of the use of these methods. Harrell (2015) argues that a model resulting from subset selection methods cannot be viewed as a prespecified model. His main concern is that in subset selection algorithms, both the updating and the stopping rules based on hypothesis testing are only valid for prespecified hypotheses. However, during the process of updating the set of included variables the hypotheses do change. Therefore, standard errors are often biased downwards and p-values become too small. These considerations do not only involve pure hypothesis testing, also methods such as the AIC are designed with the goal of comparing a set of prespecified models in mind. Thus, Harrell argues that models resulting from subset selection should be interpreted with care.

Derksen and Keselman (1992) studied the problems with subset selection algorithms. They performed a Monte Carlo study by simulating multiple data sets and showing the results of the subset selection algo-rithms. In their research, they encounter problems related to those stated by Harrell. They find that external factors such as the number of initial variables considered affect the final number of variables selected. This is largely due to correlations within the data set. They conclude that in the light of high correlation, the determination of the true variable becomes increasingly difficult.

By using cross-validation based on a test sample, a part of this problem is overcome. However, alterna-tives to the subset selection method should be considered as well. A proposed alternative to subset selection is the use of shrinkage methods. Harrell explains the benefit of shrinkage in the context of overfitting. In out-of-sample forecasting, the high predictions often are too high, while the lower predictions are too low. Therefore, the estimated values should lie closer to the mean of the observations. This can be done by shrinkage of the coefficients.

James et al. (2013) describe two methods of shrinkage: ridge regression and the LASSO. Both realise shrink-age by adding a penalty on the estimated coefficients in the optimisation equation. The ridge regression, as proposed by Hoerl and Kennard (1970), shrinks the coefficients gradually towards zero. Tibshirani (1996) proposed the LASSO, where different penalty to the ridge regression is used, this penalty is given by

λ

p

ÿ

j“1

|βj|, (6.1)

Dynamic Model Selection Procedures in the Automobile Insurance Context

Dynamic Model Selection Procedures in the Automobile

Insurance Context

Dynamic Model Selection Procedures in the Automobile Insurance

Context

Contents

1

|

Introduction

2

|

GLM-Framework

2.1

Current Model

2.2

Frequency GLMs

2.3

Severity GLMs

2.4

Distribution Selection

3

|

Data

3.1

Dependent Variables

3.1.1

Claim Frequency

3.1.2

Claim Severity

3.2

Independent Variables

3.2.1

Categorical Variables

3.2.2

Correlation

4

|

Model Performance Measures

4.1

Prediction

4.1.1

Measures of Prediction Accuracy

4.1.2

Simulation Method

4.1.3

Simulation Results

4.2

Other Performance Measures

4.2.1

Model Fit

4.2.2

Applicability

4.2.3

Robustness

5

|

Model Selection Algorithms

5.1

Stepwise Selection

5.2

Autometrics

5.3

Results

5.3.1

Frequency Models

5.3.2

Severity Models

6

|

LASSO