On the improvement of risk classiﬁcation using boosted trees to search for interactions

(1)

On the improvement of risk classification using

boosted trees to search for interactions

By Jelte ter Braake∗

March 5, 2021

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies

(3)

On the improvement of risk classification using boosted trees

to search for interactions

Jelte ter Braake

Abstract

(4)

1 Introduction

Insurance companies must differentiate premiums to reflect the heterogeneity of risks in their portfolio. Otherwise, an equal premium across all risks would imply that the good risks accept a more suitable offer, with a lower corresponding premium, elsewhere, and the company is left with the ‘bad’ risks. Consequently, these bad risks pay a premium that is lower than the expectation of the insured loss and the company is expected to lose money over time.

This problem is called adverse selection. In order to avoid it, insurance companies can rank policyholders in different tariff classes based on their estimated risk profile. Within a tariff class, policyholders pay an equal premium that is meant to reflect their inherent risk. The construction of such tariff classes is known as risk classification, see for example Kaas et al. (2008).

Apart from the problem of adverse selection, increasing individualisation emerges. People prefer not to pay for damage caused by others and furthermore it is considered more fair for policyholders to pay a premium that better reflects his/her inherent risk. This gives additional motivation for insurance companies to further differentiate between policyhold-ers.

2 Problem formulation

The problem of risk classification is to group policyholders in tariff classes such that for each policyholder, the premium reflects his/her inherent risk as well as possible. This premium is called the pure premium. An insurance company has large amounts of data available containing historic claims and the corresponding characteristics of the policyholders. This can be used to construct the different tariff classes in a data-driven way.

Generalised linear models (GLMs), as developed by Nelder and Wedderburn (1972), are the current industry standard for classifying risk. The success of GLMs in an insurance setting is because of their mathematical and computational tractability, as well as their stability. Furthermore, when utilising GLMs it is relatively straightforward to construct a practical, flexible and interpretable tariff that can be explained easily to stakeholders (Henckaerts et al., 2018). Moving away from this highly embedded standard is nearly impossible in practice, at least in the short term. That is why in this research, we aim to improve risk classification while remaining in the GLM framework.

(7)

it may seem intuitive that in this case also interactions between at least some of these variables exist. However, due to the complexity and abundance of possible interactions, especially for higher-order interactions, hardly any interactions are used in many current risk models (Buchner et al., 2017). That is why in this research, we want to improve risk classification by researching interactions, where we apply a modern machine learning technique that is suitable for modelling interactions in a very flexible way.

3 Literature review

Nelder and Wedderburn (1972) developed generalised linear models (GLMs). They used the technique of iterative weighted linear regressions to obtain maximum likelihood estimates of the parameters, with observations distributed according to any distribution part of the exponential family. A relevant example of a distribution that is part of the exponential family is the Poisson distribution, which is often used to model counting variables.

GLMs are often used for risk classification. See for example De Jong and Heller (2008) or Denuit et al. (2007) for a complete description. When modelling risk, two components are estimated separately; namely the claim frequency and the claim severity. The frequency of the claims is the number of claims per unit of time for which premium has been paid (exposure), and claim severity refers to the average claim cost. Combined, these two measures define the total predicted loss for a specific policyholder. Since claim frequency and severity are assumed to be independent, the modelled risk of a policyholder is simply given by the product of both models, i.e. for a given year, the estimated risk of a specific policyholder is given by the estimated frequency times the estimated severity for that year. GLMs have become the industry standard to separately model claim frequency and severity. For predicting the claim frequency, the Poisson distribution is often used, while for predicting the claim severity, the gamma distribution is especially interesting.

(8)

Cars crossing? Traffic light? No Wait! Yes Which color? Yes No Drive Green Wait! Red Drive

Figure 1: Simple example classification tree predicting whether a car crosses a road or waits before crossing.

in the future.

Tree-based methods are a powerful and popular method for classification and prediction. Classification and regression trees (CART) can be used for a discrete dependent variable (classification tree) or a continuous dependent variable (regression tree). CART is an in-tuitive non-parametric method, which can easily be shown graphically. Starting from the root, an observation passes down the tree through a series of branches at which a deci-sion is made as to which direction to proceed, based on one of the explanatory variables. Ultimately, a terminal node (leaf) is reached and a predicted response is given. In short, trees partition the explanatory variables into a series of boxes that contain the most ho-mogeneous collection of outcomes possible (Breiman et al., 1984). An example is given in figure 1. Here a simple classification tree is shown, in which the decision to cross a road is predicted. The decision variable is categorical; a car crosses a road or waits before crossing. This choice is predicted in the final nodes, i.e. the leaves of the tree, displayed in yellow. Starting from the root on the left, the question is asked whether cars are currently crossing the road. Then based on this answer, the data set is split into two parts. If the answer is ‘yes’, then the car is expected to wait. Otherwise, we move along to the subsequent node, asking whether there is a traffic light present, and data is again split into two parts, and so on. Since we have a maximum of three subsequent nodes, the tree depth is said to equal three.

(9)

repeatedly modified versions of the data, producing a sequence of weak learners. For each successive iteration, the boosting algorithm modifies the weights given to each of the data points and the classification or regression problem is reapplied. Misclassified observations have their weights increased, whereas correctly predicted ones have their weights decreased. Thus in further iterations, observations that are hard to predict receive ever-increasing influence until the algorithm converges or the maximum number of iterations is reached. Henckaerts et al. (2020) mention that boosted trees are also praised for their ability to model interactions. In their paper, they propose three reasons why this is the case. Firstly, tree-based methods include interaction terms by construction, as splits in nodes further down the tree depend on previous splits, which by definition is an interaction. Therefore, a researcher does not require to have any preconception on this matter and common speci-fication errors are avoided. Secondly, discontinuous relationships and nonlinear interaction effects are more naturally accommodated by tree-based methods, as opposed to multiplica-tive interactions. Lastly, using joint plots one can visualise interactions between variables, providing a useful tool to intuitively explore interaction effects.

An interaction effect is one possible relationship between two variables. In a causal model, multiple relationships can occur. For example, a direct causal relationship between the de-pendent and indede-pendent variable exists when the indede-pendent variable is a direct cause of the dependent variable. Another possible relationship is the moderated causal relationship, in which the relationship between two variables is moderated by a third variable. In other words, the impact of one independent variable on the dependent variable depends on the level of another independent variable. Moderated relationships are often called interaction effects (Jaccard and Turrisi, 2003). An example in risk classification for auto insurance is the interaction between gender and age. Here gender alone does not drive differences in inherent risk, but when moderated by age there are differences (Association of British Insurers, 2010). Young female drivers appear to have a lower risk than their male counter-part. This difference decreases over time until the risk is approximately equal for ages 35 and older.

Because of the complexity and the abundance of possible interactions in a model with many predictors, hardly any interactions are used in many current models. Only a few studies have analysed the interaction effects in risk classification in a clear and systematic way. One of them is Buchner et al. (2017), where the authors tried to improve the predictive performance of risk adjustment in health insurance by identifying interactions with regres-sion trees and adding the relevant interactions to the traditional least squares regresregres-sion. Although in this research regression trees, which are weak predictors as mentioned earlier, were used instead of boosted trees, the authors found a marginal increase in the adjusted R2.

(10)

machines, different measures are proposed. The most popular one is Friedman’s H-statistic (Friedman and Popescu, 2008). See for example Rajbahadur et al. (2021), where the authors use the H-statistic in their research to measure interactions. The authors also warn researchers and practitioners that feature interactions may hinder the interpretation of a model and recommend that stakeholders always detect feature interactions in their data set. The H-statistic is based on estimated partial dependence functions and can be interpreted as the share of variance that is explained by the interaction, regardless of its particular form. The authors also propose a test statistic to evaluate whether the H-statistic differs significantly from zero with the null hypothesis assuming no interaction.

More recently, Greenwell et al. (2018) propose an alternative measure. Coincidentally, this measure is also based on estimated partial dependence functions, but the authors take an alternative approach. They introduce a model-agnostic, partial dependence-based variable importance measure, which can also be used to quantify the strength of potential interaction effects. In their paper, the authors present a simulation study with one true interaction and they compare their proposed measure to the H-statistic. In this simulation, it is found that the proposed measure suggests the true underlying interaction as having the strongest interaction effect. The H-statistic, however, did not seem to catch the true interaction and instead ranked two other pairs of variables as having the strongest interaction effects, although these predictors did not appear in the true model.

4 Research question

The goal of this study is to investigate whether the predictive performance of the GLM risk classification model for auto insurance can be improved by including interactions. In this research, we apply the ensemble learning method tree-based gradient boosting to hunt for possible interactions and we try to improve the risk classification model by including the strongest two-way interactions found by the boosting approach into the GLM-framework. The research question that will be answered in this thesis is:

”Can a generalised linear risk classification model be improved by including the strongest two-way interactions measured in a tree-based gradient boosting model based on the same set of variables?”

(11)

potential improvements in predictive power are significant, we use the Akaike information criterion (AIC). This criterion is a measure for relative model fit and introduces a penalty for the additional parameters that need to be estimated to model the interaction. We thus define significant increased predictive performance when both the absolute and relative performance measure increase.

5 Hypotheses

Since there is an abundance of possible interactions between the many variables used in risk classification, we expect that by systematically searching for them using boosted trees and without having any preconception on the interactions we can find multiple interesting interaction effects among the variables that are currently used in risk classification mod-els. Moreover, we expect that by including the strongest interactions in the GLM, the risk classification can be improved in terms of predictive power, by better being able to differentiate between classes based on estimated inherent risk.

Based on the literature, we may expect multiple different interactions. The standard exam-ple in an auto insurance setting is an interaction effect between age and gender in one corner of the interaction space, namely for young drivers. Young males appear to be riskier than their female counterpart, although this effect declines when age increases (Association of British Insurers, 2010). Furthermore, Yang et al. (2018) find several important interactions involving the type of area (urban or rural) the policyholder lived in; the marginal effects of multiple regressors on the pure premium are greater for policyholders living in an urban area than for those living in a rural area. Su and Bai (2020) find a significant interaction between the driver’s age and the vehicle age. They argue that for young drivers the vehi-cle age has a significant effect on the claim frequency, and when the driver age increases, this effect gradually decreases. Lastly, Lorentzen and Mayer (2020) report a case study comparing a GLM to different black-box models for modelling claim frequency, amongst others boosted trees. The strongest two-way interaction they find resulting from their black box models is a very significant interaction between the vehicle’s age and brand, with an interaction strength that is almost as large as the two main effects together. Besides, they find that the interaction between gas type and brand is substantial as well.

(12)

6 Methodology

Recall from section 3 that in risk classification, the frequency and severity are normally modelled separately. Since the model for claim frequency is generally more informative than the severity model, we will investigate claim frequency in this research. Afterwards, the severity can be analysed in a similar way and this combination defines a full tariff plan. We start by estimating a benchmark GLM for the claim frequency, which we will call the original GLM. Then on the residuals of the original GLM, we will fit a tree-based gradient boosting machine. For training this gradient boosting model, we use the same set of variables as included in the original GLM. Boosted trees are very suitable for detecting interaction effects, as the model requires no preconception on this matter. We continue by identifying the strongest two-way interactions from the results of the boosted trees, by making use of a partial dependence-based measure from Greenwell et al. (2018). Lastly, we will include these interactions in the GLM framework.

For a quick overview, see the flowchart in figure 2. In the following subsections, we explain each of the steps in more detail.

Input _{poisson family}GLM Residuals

Fit

Variables Observations

Gradient Boosting

Input VIM Interactions Improve

Figure 2: Flowchart showing the main steps taken in this research.

6.1 GLM

A probability function fY of random variable Y that can be written in the general form

fY(yi; θ, φ) = c(yi, φ) exp

yiθ − α(θ)

φ

, (1)

(13)

c(yi, φ) = _yφ_i_!, and φ = 1, equation (1) is equivalent to the Poisson distribution fY(yi; λi) = P (Y = yi|λi) = λyi i exp(−λi) yi! , for yi ∈ N, (2)

where parameter λi denotes the rate. For a certain single unit of time, both the mean

and the variance are equal to E[Yi] = Var[Yi] = λi. Note that this property of the Poisson

distribution that the mean is equal to the variance is called equidispersion. So when fitting a Poisson distribution to the data, the mean can not be modelled independently of the variance.

We aim to model the claim frequency of policyholder i, which is defined as the number of claims per unit of exposure. The claim frequency is commonly modelled by a GLM with Poisson error function and log-link function. A link function defines the relationship between the random and the systematic components of a model, i.e.

E[Yi] ≡ λi = g−1(ηi), (3)

where ηi is the linear predictor, g is the link function, and Yi is the stochastic dependent

variable, with expectation λi. In equation (3), we see that the expected value λi of Yi is

linked to the linear predictor ηi by the link function g, as ηi= g(λi). For log-link function

g(ηi) = ln(ηi), we have

λi = g−1(ηi) = exp(β0+ β1xi1+ . . . + βpxip)

= exp(β0) · exp(β1xi1) · . . . · exp(βpxip), (4)

where p is the number of regressors in linear predictor ηi, β = (β0, . . . , βp) is a vector

containing the to be estimated coefficients and xj = (x1j, . . . , xnj) is a vector containing

(14)

the exposure ei. Adding this offset to the linear predictor ηi gives

ηi = xiβ + ξi= xiβ + ln(ei), (5)

where xi = (1, xi1, . . . , xip)0 contains the i-th observation of each of the p regressors and

the additional one for the intercept. For the expectation of Yi, this results in

E[Yi] = λi = g−1(ηi) = g−1 xiβ + ln(ei) = exp xiβ + ln(ei) = exp(xiβ)ei. (6)

That is, by adding an offset term to the linear predictor ηi, we weight the expected claim

counts by the corresponding exposure, which results in estimating each claim count for the same unit of time.

Next, we want to estimate the coefficients β. We can do this using maximum likelihood estimation. This method seeks to find the combination of coefficients that produce the observed data with the highest probability, given the assumed model form. The likelihood is defined as the product of probabilities of observing each claim count. The Poisson distribution has a density function as given in equation 2. The likelihood function is then similar to the joint density function of each observation, i.e. the product of individual densities. However, the joint density is a function of the data given the parameters, while for the likelihood it is the other way around. For the Poisson likelihood, we get

L(λ; y) = n Y i=1 fY(yi; λi) = n Y i=1 λyi i exp(−λi) yi! . (7)

It is convenient to take the natural logarithm of the likelihood and since a log-transformation does not change the optimum, we can do it without influencing the maximum likelihood we are interested in. Taking logarithms, we get the following log-likelihood ` = ln(L):

`(λ; y) = n X i=1 yiln(λi) − λi− ln(yi!) . (8)

With the log-link function λi = exp(xiβ)ei, the log-likelihood reduces to

`(β; y) =

n

X

i=1

yi(xiβ + ln(ei)) − exp(xiβ)ei− ln(yi!)

(9)

Finally, in order to maximise the log-likelihood we can take the derivatives with respect to βm (m = 0, . . . , p) and equate them to zero. We get the following system of equations

(15)

which can be solved numerically using the Newton-Raphson method to find the estimates for each βm (m = 0, . . . , p). The Newton-Raphson method is a root-finding algorithm able

to approximate the roots of a real-valued function such as system (10). For details about the Newton-Raphson method, see Appendix A.

The independent variables in x =1 x1 . . . xp can either be numerical or categorical. For

example, respectively age or car brand. Before modelling, it is necessary to consider how the explanatory variables should enter the model. In the case of categorical variables, several factor levels can be clustered together in order to achieve a more parsimonious model. In some cases, especially levels with low data volume, unstable parameter estimates or similar characteristics, can be combined to achieve a more parsimonious model. For example, the variable car brand has many levels, the majority of which are unknown brands with very few observations, and therefore unreliable results. These brands can be combined into the factor ‘other’, containing all infrequently used brands which appear to be similar in terms of claim frequency. In other cases, similar levels are grouped. For example, the driver age variable also contains many levels; one for each discrete number of years present in the data. As expected, two adjacent years are often not significantly different from each other. Therefore, age classes containing consecutive years can be formed, which then get significantly different effects. An algorithm to automatically implement this is given in Appendix B.

Regarding numerical variables, there might exist some non-linear effects on (part) of the domain of a variable. In this case, splines are a flexible tool to model such effects. Splines are piecewise polynomials used in curve fitting. For details, see Appendix C.

Since the goal is to improve the risk classification of this GLM, we try to explain an additional part of the residuals that are left unexplained by the original GLM by including interaction effects. For this reason, we first fit the original GLM to capture the main effects, and afterwards, we try to model the GLM deviance residuals using the gradient boosting method. The deviance indicates the extent to which the likelihood of a proposed model exceeds the likelihood of the intercept-only null model. For a Poisson GLM, the deviance residual S(yi, ˆy) of the i-th observation is given by

S(yi, ˆyi) = sgn(yi− ˆyi) s 2yilog yi ˆ yi − (yi− ˆyi), (11)

where yi is the observed claim frequency and ˆyi is the predicted weighted expectation of

(16)

6.2 Gradient tree boosting

Gradient boosting (Friedman, 2001) is a recursive, nonparametric machine learning al-gorithm. The main idea of gradient boosting is to sequentially add new models to the ensemble, where at each iteration a new weak learner is trained with respect to the error of the whole ensemble so far. Here, we shortly explain the general procedure.

In supervised machine learning problems, there is a data set {(xi, yi)}ni=1, where xi are the

features and yi is the target of observation i = 1, . . . , n. The problem then is to estimate

the underlying unknown function y = F (x). To measure the fit of the estimated mapping ˆ

F (x), a loss function L(y, ˆF (x)) is used. Gradient boosting solves these above questions in an iterative manner. An overview of the algorithm is shown in algorithm 1.

Note that in step iii), ν is the shrinkage factor (0 < ν ≤ 1), or learning rate, that controls the update step size. Friedman (2001) has found that shrinkage with ν < 1 reduces overfitting and improves predictive accuracy. Moreover, the fitted function hm(xi)) can be viewed

as an approximation of the negative gradient, similar to the gradient descent algorithm. Finally, the final gradient boosting model is given by the sum of the initial constant and all the subsequent function updates ˆF (x) =PM

i=0Fˆi(x).

When modelling tree-based gradient boosting, some additional hyperparameters arise. Al-though this makes the method very flexible, these hyperparameters need to be tuned. They can be divided into three categories: tree-specific parameters, boosting parameters and miscellaneous parameters, respectively. Tree-specific parameters affect the individual trees in the model, while the boosting parameters affect the boosting operation. Here follows a short illustration of each of the hyperparameters:

• Tree-specific parameters:

– Minimum observations in leaf. This parameter naturally defines the minimum number of observations required in a terminal node (leaf). This value thus prevents further splits when the subset of the training set in a node would become too small. It can be used to control overfitting. However, for imbalanced problems (which is often the case with insurance claim data) in general lower values should be chosen, since the regions in which the minority values are in majority tend to be small.

(17)

Input : training set {(xi, yi)}ni=1, a differentiable loss function L(y, ˆF (x)), and a

number of iterations M . Output: model ˆF (x).

(1) Initialise the model with a constant value: ˆ F (x) = ˆF0, Fˆ0= arg min γ n X i=1

L(yi, γ), with γ ∈ IR;

(2) for m ∈ {1, . . . , M } do

(i) Calculate the pseudo-residuals: rim= − ∂L(y, Fi(xi)) ∂F (xi) F (x)= ˆFm−1(x) for i = 1, . . . , n;

(ii) Fit a base learner (e.g. tree) hm(x) to the pseudo-residuals, i.e. train it

using training set {(xi, rim)}ni=1;

(iii) Compute multiplier γm by solving the one-dimensional optimization

problem: γm = arg min γ n X i=1 L(yi, ˆFm−1(xi) + γhm(xi));

(iv) Update current approximation ˆF (x) with function update ˆ Fm(x) = νγmhm(x): ˆ F (x) ← ˆF (x) + ˆFm(x) = m X i=0 ˆ Fi(x) = ˆF0+ m X i=1 νγihi(x). end

Algorithm 1: The gradient boosting algorithm. • Boosting parameters:

(18)

the learning rate. With a smaller learning rate, the number of trees necessary to converge is higher.

– Early stopping. To prevent overfitting as a result of the number of trees that is too large (as explained above), as well as save computational costs, an early stopping criterion can be formulated. That is, the training stops when a cross-validation metric does not improve for a certain number of trees, based on a simple moving average. Note that by enabling early stopping, the number of trees can be set at a very large level without the problems of unnecessary computational cost or overfitting.

– Subsample rate. The subsample rate corresponds to the fraction of observations that are used for each subsequent iteration. Rather than using the whole training set, a new random sample of observations, with a size defined by the subsample rate, is drawn (without replacement) in each iteration. Friedman (2002) showed that subsampling (with rate < 1) can greatly improve predictive performance by preventing overfitting while simultaneously reducing computation time. • Miscellaneous:

– Loss function L. Lastly, the loss function is to be minimized in each split in the trees. For a numeric dependent variable, often quadratic loss is minimised. Another option might be to minimise absolute loss. Notice that for count data, Poisson loss is also possible, but since we model the residuals rather than the frequency, this is not an option.

We perform a grid search to tune the hyperparameters. That is, we fit a boosted tree model for many combinations of the hyperparameters in the hyperparameter space and search for the best combination of parameters based on the mean absolute error (MAE) on (part of) the validation set. To speed up computation, we use 1/8th of the total number of observations, i.e. for both the training set as well as the validation set we randomly select 12.5% of the observations without replacement. Since we want to test many combinations, we have to set the learning rate and the number of iterations relatively high and small, respectively, such that the grid search finishes within a reasonable amount of time. That is, we initially fix the learning rate ν = 0.3 and the number of iterations Number of trees = 250, such that the boosting model learns relatively fast. Note that we must check validation convergence at the specific learning rate within the number of trees. When that is the case we are now able to tune the remaining hyperparameters. We can start by a wide search, e.g over the following values:

(19)

• Subsample rate = 0.4, 0.6, 0.8, 1; • Loss function = gaussian, laplacian.

Our grid then consists of searching over 6 · 6 · 4 · 2 = 288 combinations. Note that even with the precautions of smaller data samples and a relatively low number of iterations, this still might take a while. After interpreting the results of the validation error for each of the hyperparameter combinations, we observe a range of possibly optimal values and can search again in a more specific range until we find the optimal values. Note that this grid search is based on part of the training and validation set. Since one might expect that the optimal minimum observations in a leaf or the optimal subsample rate depends on the number of observations, we have to test those hyperparameters again on the whole data set as well. When these parameters have been tuned, we finish with tuning the combination of learning rate and number of trees, given the found optimal remaining hyperparameters. Notice that again, our goal is to tune the hyperparameters such that the model performs as well as possible with predicting the validation set. However, now we also have a time constraint. For very small learning rates (< 0.0001), the model might require thousands of iterations to converge and this might take days or even weeks with a very large data set. When all hyperparameters have been tuned, the training of the final model can start. Note that, as mentioned before, the defined dependent variable is different from the GLM. In this case, we want to explain the claim frequency residuals of the GLM rather than the actual claim frequency itself. Regarding the independent variables, we want to use the same input data for fitting the gradient boosting model as for fitting the GLM before, since we want to capture the interaction of exactly the input variables of the GLM. Tree-based models do not only offer a way to measure interactions, but can also be used to look at the variable importance of the single regressors in the model. The relative importance of the individual regressors is among the most useful descriptions of a black box model like gradient boosting. Breiman et al. (1984) propose a measure to estimate the relative influences Ij of the individual inputs xj on the variation of a tree-based model

ˆ

F (x) over the joint input variable distribution ˆ I_j2(T ) = J −1 X t=1 ˆi2 t1(νt= j), (12)

where the summation is over the nonterminal nodes t of the J -terminal node tree T . Moreover, vt is the splitting variable associated with node t, and ˆi2t is the corresponding

empirical improvement in squared error i2 of as a result of the split of region R into subregions Rl and Rr:

i2(Rl, Rr) =

wlwr

wl+ wr

(20)

Here, ¯yland ¯yrare the left and right daughter response (GLM residual) means respectively,

and wl and wr are the corresponding sums of the weights (exposure).

Friedman (2001) generalise this to a collection of decision trees obtained through boosting, by simply taking the average over all the trees

ˆ I_j2= 1 M M X m=1 ˆ I_j2(Tm). (14)

Note that equations (13) and (14) are based purely on heuristic arguments, but are proven to produce valid results.

6.3 Interactions

A function F (x) exhibits a two-way interaction effect between variables xj and xkif

chang-ing the value of variable xj leads to a difference in the value of function F (x) that depends

on xk. Mathematically, in the case of an interaction, we have

Ex

∂2_{F (x)}

∂xj∂xk

2

> 0. (15)

On the other hand, no interaction between xj and xk implies that the function F (x) can

be expressed as the sum of two terms f\j and f\k, that are independent of xj and xk,

respectively. That is, we have

F (x) = f\j(x\j) + f\k(x\k), (16)

where x\l = (x1, . . . xl−1, xl+1, . . . , xp) is the set of input variables x minus xl.

(21)

As mentioned previously, the interaction measure from Greenwell et al. (2018) is also based on estimated partial dependence functions. A partial dependence plot (PDP) shows the marginal effect one or two features have on the predicted outcome of a machine learning outcome, while accounting for the average effect of the other predictors in the model (Friedman, 2001). Let x = (x1, . . . , xp)0 represent the predictors in a model and let F (x)

be the prediction function of the model. If we partition x into an interest set, zs, and

the remaining variables, zc= x \ zs, then the partial dependence of the response on zs is

defined as

Fs(zs) = Ezc[F (zs, zc)] =

Z

F (zs, zc)pc(zc)dzc, (17)

where pc(zc) is the marginal probability density of zc, i.e. pc(zc) = p(x)dzs. Equation (17)

can be estimated from a set of training data by ˆ Fs(zs) = 1 n n X i=1 ˆ f (zs, zi,c), (18)

where zi,c (i = 1, . . . , n) are the values of zc that occur in the training sample. We thus

average out the effect of all other predictors zc in the model.

Constructing a PDP in practice is rather straightforward. Greenwell et al. (2018) showed that we can use the following algorithm. To simplify, let zs = x1 be the variable of

interest, with unique values (x11, . . . x1k). The partial dependence of the response on x1

can be constructed with algorithm 2.

Input : the unique predictor values x11, . . . x1k;

Output: the estimated partial dependence values ˆF1(x11), . . . , ˆF1(x1k) .

for i ∈ {1, . . . , k} do

(1) copy the training data and replace the original values of x1 with the

constant x1i;

(2) compute the vector of predicted values from the modified copy of the training data;

(3) compute the average prediction to obtain ˆF1(x1i).

end

Algorithm 2: A simple algorithm for constructing the partial dependence of the re-sponse on a single predictor x1.

These estimated partial dependence functions can be used to quantify variable importance. Greenwell et al. (2018) propose the following simple and effective partial dependence-based variable importance measure i(x1) for predictor x1:

(22)

The idea is that the ‘flatness’ of the PDP is a measure of the variable importance. That is, a relatively flat PDP of a particular variable indicates that the variable does not have much influence on the predicted response, i.e. such a variable is likely to be less important than a variable that varies across a wider range of the predicted response.

Although variable importance measure (19) is model-agnostic, decision trees offer a very natural way of quantifying variable importance. In a decision tree, at each node t, a single feature is used to partition the data into two homogeneous groups. The chosen predictor is the one that maximises some measure of improvement ˆit. The relative importance of

predictor xj is the sum of the squared improvements over all internal nodes of the tree

for which xj was chosen as the partitioning variable (j = 1, . . . , p). See (Breiman et al.,

1984) for more details. This idea can be extended to tree ensembles, such as the GBM, where the improvement score for each variable is averaged across all trees. Therefore, for a large number of trees, the variable importance measure is more reliable (Hastie et al., 2009, p. 368).

Greenwell et al. (2018) argue that variable importance measure (19) can also be used to quantify the strength of potential interaction effects. Let l(xi, xj) (i 6= j) be the

standard deviation of the joint partial dependence values ˆFij(xii0, x_jj0) for i0 = 1, . . . , k_i

and j0 = 1, . . . , kj, where ki, kj denote the number of unique observations of feature xi, xj,

respectively. Essentially, a weak interaction effect would suggest that i(xi, xj) has little

variation when either xior xj is held constant, while the other varies. Let zs= (xi, xj) with

i 6= j be any two predictors in the feature space x. We construct the partial dependence function ˆFs(xi, xj) and compute i(xi) for each unique value of xj, denoted l(xi|xj), and

take the standard deviation of the resulting importance scores. The same can be done for xj and the results are averaged together. Large values (relative to each other) are

indicative of possible interaction effects.

Note that this interaction metric relies on the fitted (gradient boosting) model. Therefore, it is crucial to properly tune and train this model to attain the best performance possible. Fortunately, that is what we did by using a grid search, as described in the previous subsection 6.2.

6.4 Including interactions in GLM

A natural way to proceed is to iteratively add the strongest measured interactions to the GLM as multiplicative factors and at each iteration, check if the model improves significantly based on two different measures.

(23)

valida-tion data, we will use the exposure weighted average of the unit Poisson deviance D(y, ˆy) = n X i=1 wiS(yi, ˆyi)/ n X i=1 wi, (20)

where unit Poisson deviance S(yi, ˆyi) is given by

S(yi, ˆy) = sgn(yi− ˆyi) s 2yilog yi ˆ yi − (yi− ˆyi), (21)

where yi≥ 0 is the observed response of the frequency, ˆyi> 0 is the predicted expectation

of the response and wi ≥ 0 is the exposure for observations i = 1, . . . , n of the validation

data. Recall from subsection 6.1 that we introduced an offset term containing the exposure on order to weight the claim counts when estimating the GLM. Here we want to weight the deviance by the exposure for a different reason. An observation with a higher corresponding exposure contains more information and less variability. By weighting the deviance we compensate for these differences in variance.

Note that by consequently adding interactions to the GLM, we might expect the absolute performance metric to keep at least slightly improving. However, we want to know whether or not the inclusion of a specific interaction results in a significant improvement. Therefore, we can use the Akaike information criterion (AIC) (Akaike, 1973), a relative model fit based on the log-likelihood that penalises the estimation of additional parameters. Based on this criterion, we are thus able to check whether estimating the extra parameters of an additional interaction is worth including in the model. The AIC is given by

AIC = −2 ln(L) + 2p, (22) where L is the likelihood (7) (see subsection 6.1) and p is the number of parameters in the GLM that need to be estimated.

We stop iterating when no significant improvement is found and we can then conclude that for even weaker interactions, we will also find no significant improvement.

6.5 Software

Lastly, in this subsection, we mention the software and systems used to implement the models explained in the earlier methodology subsections. For analysing the data and build-ing the models, we use the free, open-source programmbuild-ing language R with environment RStudio. The software is run on an analysis server with 8 cores.

(24)

of extensions of the AdaBoost algorithm (Freund and Schapire, 1997) and Friedman’s gradient boosting machine (Friedman, 2001).

To compute the interaction strength of the variable combinations, we use the package ‘vip’, written by the authors of Greenwell et al. (2018) and implementing algorithm 2, equation (19), and the quantification of interactions as explained in subsection 6.3.

Where possible, we use the packages ‘foreach’ and ‘doParallel’ to perform parallel compu-tations in a for loop. For example in the hyperparameter tuning of the gradient boosting method, for calculating the in- and out-of-sample performance of the GBM for each itera-tion, and when calculating the interaction measures.

7 Data

We are using a large data set with historic claim frequency in an auto insurance policy belonging to a Dutch insurance company. We analyse the claim coverage collision. Claim frequencies from the last few years are included in the data set, resulting from a trade-off between more observations and therefore more reliable results on the one hand, and the relevancy of more recent observations as well as computational performance on the other hand. Apart from the yearly time variable, the variables can be divided into three classes:

1. individual variables of the policy(holder), e.g. driver age or policy duration; 2. car characteristic variables, e.g. vehicle age or weight;

3. postal code characteristics, e.g. urbanisation level or the province it’s in.

Table 1 shows a small illustration of the data. Notice the different types of variables as mentioned above. Furthermore, on the right are the dependent variables. The frequency consists of the division between the number of subclaims and the exposure, i.e. for a given observation i we have

frequency_i = subclaimsi

(25)

Table 1: Example of observations in the data. policy

number jaar age

claim-free years

vehicle brand

district

type subclaims exposure 1 12345678 2019 48 19 Opel rural 0 0.248 2 12345678 2019 49 20 Opel rural 0 0.752 3 12345679 2019 31 7 Volkswagen urban 1 0.416 4 12345679 2019 31 0 Volkswagen urban 0 0.099 5 12345679 2019 32 0 Volvo urban 0 0.485 6 . . . . driver age count

(a) Driver age

vehicle age

count

(b) Vehicle age

Figure 3: Histograms of two example variables. Note that the data shown in this figure contains noise and therefore does not represent the real portfolio for confidentiality reasons. Moreover, both axes are blanked out.

a policyholder buys a new car or moves to a new house. In short, in each year every observation corresponds to a set of characteristics with weights given by the exposure, i.e. the fraction of that year. The exposure weighted mean is equal to 9.0%, meaning that on average, a policyholder files a collision claim approximately once in 11 years.

See figure 3 for the visualisation of two example independent variables. As can be expected, the distribution of the driver age appears similar to a normal distribution. The vehicle age however has a long right tail.

(26)

model and the model can be adjusted accordingly. Finally, the test set is used to provide an unbiased evaluation of the final model. It is important that the test set has never been used for training or validation, as otherwise, the results might be biased.

Note that after splitting the data set, it is important to check for possible coincidental differences between the subsets. Therefore, we compare the means and quantiles and look for deviations. We can validate that the randomization of the splits is successful.

8 Results

8.1 GLM

We have a GLM available that is currently used for predicting pure claim frequency, i.e. claim frequency without restrictions. - That is, with a pure model we want to get insights into risk as precisely as possible. However, for the final models used for predicting the actual premiums, certain restrictions appear. For example, by European law, there must be no difference in premium based on gender. Moreover, regressors might be added or deleted for commercial reasons. - This model does not contain any interaction terms yet and from now on we refer to it as the ‘original GLM’, which we use as a benchmark model. The original GLM is estimated using multiple independent variables. Some variables are categorical (dummy) variables, possibly with some levels clustered together, while others are numerical, for which some non-linear effects are possibly estimated using splines. Most variables are individual-specific; e.g. the number of claim-free years and the age of the driver. Then there are car specific variables, e.g. the car brand and the vehicle age. Lastly, there are a few postal code specific variables, e.g. the province and the urbanisation level. For confidentiality reasons, we can not report more details about the characteristics of this model. The model selection is mainly based on the AIC.

The interpretation is rather straightforward for most variables. For example, the cate-gorical variable gender has two levels: ‘male’ and ‘female’. - Note that there are also few observations having a gender that is not male nor female. Since this group was not significantly different from the male group, these levels are grouped. - The male group has the most observations and is therefore chosen as the base category. The GLM has then estimated one coefficient; the one corresponding to females, β2, compared to the base

category. The coefficient corresponding to the male (and others) category β1 = 0 is set to

zero to prevent perfect multicollinearity. Since we have a log-link function

ln(λi) = xiβ =⇒ λi = exp(xiβ), (24)

(27)

the effect on the frequency, we take the exponent of the coefficients. In relative terms, females’ claim frequency is exp(β2) = 1.071 times as high compared to males (and others),

ceteris paribus.

Numerical variables are interpreted similarly. For some numerical variables, however, also cubic splines are included in the model to capture non-linear effects. For these variables, the interpretation is a little more involved as multiple coefficients have to be taken into account. One example of such a variable is policyholder age. For convenience, let p denote the policyholder age and let β4, β5 and β6 be the coefficients corresponding to the polynomials

of degree one, two and three, respectively. Then on average, a given policyholder with age pi files

exp(β4pi+ β5p2i + β6p3i) (25)

times as many claims, compared to the base rate (which is equal to 0 and corresponds to an unknown age), while holding all remaining factors equal. Curve (25) is shown visually in figure 4. Here we see that young policyholders of around 18 years old have a relatively high risk of collision and therefore pay a relatively high premium. Obviously, young drivers are in general less experienced on the road than older age groups. As experience increases over time, the risk decreases from 17 years onward until it slightly goes up again with a small local peak at 54 years old. This effect is probably explained by children of the policyholders becoming 18 around the parents’ age of 50, and after the children obtain their driver’s license, they start driving in their parents’ car. An additional cause might be due to the midlife crisis of a policyholder, although this effect is probably smaller than the previous hypothesis.

When we increase age even further and the just explained effect has declined again, the risk keeps increasing with age as expected, because impairment of both sensory and cognitive skills are also known to increase with age for elderly people. Note that policyholders with policyholder age ≥ 88 are clustered together, due to low data frequency as well as high similarity as mentioned in section 6.1. Lastly, observe that the point most to the left in figure 4 represents the factor in case the policyholder age is unknown.

8.2 Gradient tree boosting

(28)

policyholder age

factor

Figure 4: Estimated marginal effect of policyholder age on the claim frequency. Note that again some noise is added and the axes are left out for confidentiality reasons.

shown in table 2. Here, we can conclude that minimum observations in leaf = 100 clearly outperforms minimum observations in leaf = 10000.

Table 2: Example results of the grid search.

Learning rate Minimum observations in leaf Maximum depth min(MAE)

0.3 100 25 0.2157 0.3 100 15 0.2160 0.3 100 5 0.2172 0.3 10000 5 0.2239 0.3 10000 15 0.2253 0.3 10000 25 0.2310

Resulting from the grid search, we found the following optimal combination of hyperpa-rameters:

• Tree-specific parameters:

– Minimum observations in leaf = 1000. This number seems to be a good balance between generalisability and detailed relationships.

(29)

• Boosting parameters:

– Learning rate = 0.1. In general, a lower learning rate shrinks the impact of new iterations and therefore prevents over-fitting as well as precision. However, with a low learning rate, the gradient boosting method requires more iterations, as the model takes more time to converge. This effect is shown in figure 5, where the convergence for multiple learning rates is shown. Indeed, the trade-off between precision and speed is clearly visible. This value of 0.1 appears to be a good balance between performance and precision.

– Number of trees = 347. With all other hyperparameters set equal to the men-tioned values, we find that after 347 iterations, the out-of-sample prediction error reaches its minimum. See figure 6 for a plot of the cross-validation results. We see that with even more iterations, the in-sample prediction error keeps de-creasing, but the validation error slightly increases. This is proof of over-fitting after more than 347 iterations.

– Subsample rate = 0.5. Since we have a large data set with many observations, only using half of them by random assignment in each subsequent iterations extremely speeds up computational time. Moreover, randomly assigning half of the observations to be used in a particular iteration improves the generalisability of the model, i.e. increases the out-of-sample predictive performance.

• Miscellaneous:

– Laplacian loss function. The Laplacian loss function appears to outperform the quadratic loss function. The Laplacian loss function measures the absolute error and is therefore less sensitive to outliers.

We train our final GBM with each hyperparameter equal to the value mentioned in this subsection.

8.3 Interactions

Table 7 shows the ten strongest two-way interactions, as measured by the partial-dependence based variable importance measure. Notice that the interaction strength here is a relative measure and therefore the absolute values can not be interpreted separately. For conve-nience, we normalised the results.

(30)

0.230 0.235 0.240 0 100 200 300 400 500 Iteration V alidation MAE LR 0.01 0.05 0.1 0.2

Figure 5: The validation error plotted against the number of iterations for different learning rates. 0.184 0.185 0.186 0 100 200 300 400 Iteration T rain MAE

(a) Training error for each iteration.

0.228 0.232 0.236 0.240 0.244 0 100 200 300 400 Iteration V alidation MAE

(b) Validation error for each iteration.

Figure 6: Training and validation error plotted against the number of iterations.

are introduced in the model to capture non-linear effects corresponding to a particular variable in the model. By construction, both the level of a spline and grouped factor depend on the same underlying variable, for instance, the vehicle age. Therefore, the effect of the vehicle age spline on the claim frequency then depends on the level of the grouped vehicle age factor (young, middle, old), which by definition is an interaction.

(31)

Deductible*S_curve2 betaalfrequentie*PolicyHolderAge Deductible*W_curve VehicleAge*S_curve2 distributielabel*Deductible Deductible*VehicleAge Deductible*V_curve SchadevrijeJaren*S_curve2 PolicyHolderAge*P_curve1 VehicleAge*V_curve 0.0 0.3 0.6 0.9 1.2

Relative interaction strength (a) Unfiltered. Deductible*S_curve2 betaalfrequentie*PolicyHolderAge Deductible*W_curve VehicleAge*S_curve2 distributielabel*Deductible Deductible*VehicleAge Deductible*V_curve 0.00 0.25 0.50 0.75 1.00

Relative interaction strength (b) Filtered.

Figure 7: The strongest measured two-way interactions.

of a two-way interaction between the deductible and the vehicle age. Such a relationship between these variables would imply that the effect of the deductible on the claim frequency depends on the vehicle age. As expected, the sign of the marginal effect with respect to Deductible is negative in the original GLM; a higher deductible results in a lower claim frequency, ceteris paribus. In case of a high deductible, every potential claim for which the damage is worth less than the deductible is not filed. Again in the original GLM, the sign of the marginal effect of the regressor VehicleAge on the claim frequency is also negative in general. This can also be expected, as cars decrease in value over time, so the motivation for filing a claim becomes smaller for older cars. These two effects might reinforce each other, resulting in this interaction.

The sign of the interaction between the deductible and the vehicle age is however still unknown, and we need to model a new GLM including this interaction to find out. We might expect a negative sign of the interaction, implying that for a higher vehicle age the effect of the deductible on the claim frequency becomes stronger, i.e. more negative, holding all other factors constant.

(32)

8.4 Including interactions in GLM

We proceed by iteratively adding the strongest interactions presented in the previous sub-section as new multiplicative factors. The strongest interactions we found are between the deductible and either the grouped vehicle age or the (linear) spline corresponding to the vehicle age.

Since the latter of the two is found to be the strongest, we start by including the multi-plicative interaction term Deductiblei· V curvei in the GLM. Since Deductible is a factor

with 5 levels and V curve is a numeric variable, the first new candidate model contains 4 additional parameters to be estimated; one for each combination except for a base with a parameter equal to 0, to prevent perfect multicollinearity. Estimating this model, we ob-serve a difference in AIC of ∆AIC = AICnew−AICoriginal= 0, meaning that the likelihood

has increased slightly, although this increase is immediately compensated by the decrease in parsimony due to the need to estimate those four additional parameters. We observe practically equal exposure weighted deviance residuals ∆D = Doriginal− Dnew < 0.0001.

Lastly, we find that one out of four levels is statistically significant (with p-value p < 0.05). Since V curve only explains the linear part of the spline, we also want to try to include the interaction with the whole cubic spline. This results in estimating an additional 12 = (5 − 1)3 parameters compared to the original model. Estimating this model, we observe a difference in AIC of ∆AIC = AICnew− AICoriginal = −6. We thus find a model that

has a better fit based on this relative measure. However, again we find practically equal exposure weighted deviance residuals ∆D = Dnew− Doriginal < 0.00001. Lastly, we find

that two out of twelve levels are statistically significant (with p-value p < 0.05). Although the performance measures of this model are better than the previous candidate model containing an interaction with only the linear part of the spline, we can conclude from the lack of improvement based on the deviance residuals that this effect is not substantial. Then we also want to test an interaction between Deductible and the grouped factor V ehicleAge. Since the latter variable contains three groups, this interactions contains 8 = (5 − 1)(3 − 1) additional parameters compared to the original GLM. Here we find an AIC increase of ∆AIC = AICnew− AICoriginal = 7, again practically equal exposure

weighted deviance residuals ∆D = Dnew−Doriginal< 0.0001, and no statistically significant

coefficients (with p-value p < 0.05).

(33)

improvement.

Lastly, we test the inclusion in the GLM of a few more interactions following from the gradient boosting model, even though we are not optimistic about the possible improve-ment, as these interactions should be even weaker than previous ones. We indeed find that these interactions do also not yield increased performance based on both the AIC and the (exposure weighted) deviance residuals. For an overview of the GLM performance with the different interactions, see table 4 in appendix E. Notice that there a relatively small increase in AIC for one model with an interaction between the vehicle age and (the spline for) claim-free years (S curve2 ). However, as the actual predictive power, as measured by the exposure weighted deviance residuals, has not improved, we can not conclude that this interaction term increases predictive power.

9 Conclusion and discussion

In this research, we tried to investigate whether the predictive power of GLMs for risk classification could be improved by using gradient boosting to systematically search for interactions. We trained and tuned a gradient boosting model based on the residuals of the original GLM, quantified all possible two-way interactions and found an indication of an interaction between the vehicle age and the deductible. However, by adding the inter-action in the GLM framework as a multiplicative interinter-action, we did not find a significant improvement in our performance measures.

The research question was: ”Can a generalised linear risk classification model be improved by including the strongest two-way interactions measured in a tree-based gradient boosting model based on the same set of variables?” Based on our results, the answer is no, and we would have to conclude that including these interactions found by training a gradi-ent boosting model does not improve the predictive performance of the GLM explaining claim frequency in auto insurance. However, to reinforce this conclusion more research is necessary, which we will discuss below.

(34)

our model that the order of interactions goes up to 5. We expected that by adding the strongest interactions to the GLM as a simple multiplicative interaction, we would increase predictive performance. This seems however not the case. The question that remains is then if other, more complex forms of interactions between these variables will increase performance. Answering this question would require more research on these particular interactions as well as how to include them in the GLM.

(35)

Bibliography

Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-ciple. 2Nd International Symposium on Information Theory, 73:1033–1055.

Anderson, D., Feldblum, S., Modlin, C., Schirmacher, D., Schirmacher, E., and Thandi, N. (2007). A practitioner’s Guide to Generalized Linear Models. CAS.

Association of British Insurers (2010). The use of gender in insurance pricing: Analysing the impact of a potential ban on the use of gender as a rating factor. ABI Research Paper, 24.

Bakari, H. K., Adegoke, T. M., and Yahya, A. M. (2016). Application of newton-raphson method to non-linear models. Indian Journal of Pure and Applied Mathematics.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). Classification and Regres-sion Trees. Taylor & Francis.

Buchner, F., Wasem, J., and Schillo, S. (2017). Regression trees identify relevant in-teractions: Can this improve the predictive performance of risk adjustment? Health Economics, 26(1):74–85.

De Jong, P. and Heller, G. Z. (2008). Generalized Linear Models for Insurance Data. International Series on Actuarial Science. Cambridge University Press.

Denuit, M., Mar´echal, X., Pitrebois, S., and Walhin, J.-F. (2007). Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems. John Wiley & Sons, Ltd.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learn-ing and an application to boostlearn-ing. Journal of Computer and System Sciences, 55(1):119 – 139.

Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.

Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics & Data Anal-ysis, 38:367–378.

Friedman, J. and Popescu, B. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.

(36)

Harrell, F. E. (2001). Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

Henckaerts, R., Antonio, K., Clijsters, M., and Verbelen, R. (2018). A data driven binning strategy for the construction of insurance tariff classes. Scandinavian Actuarial Journal, 2018(8):681–705.

Henckaerts, R., Cˆot´e, M., Antonio, K., and Verbelen, R. (2020). Boosting insights in insur-ance tariff plans with tree-based machine learning methods. North American Actuarial Journal, 23:1–31.

Jaccard, J. and Turrisi, R. (2003). Interaction Effects in Multiple Regression. SAGE Publications, Inc.

Kaas, R., Goovaerts, M., Dhaene, J., and Denuit, M. (2008). Modern Actuarial Risk Theory: Using R. Springer.

Lorentzen, C. and Mayer, M. (2020). Peeking into the black box: An actuarial case study for interpretable machine learning. SSRN Electronic Journal.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3):370–384.

Rajbahadur, G. K., Wang, S., Oliva, G., Kamei, Y., and Hassan, A. E. (2021). The impact of feature importance methods on the interpretation of defect classifiers. IEEE Transactions on Software Engineering, PP.

Stone, C. J., Hansen, M., Kooperberg, C., Truong, Y., Fan, J., Gu, C., Hardle, W., Marron, J. S., Yang, L., Hastie, T., Tibshirani, R., and Huang, J. (1997). Polynomial splines and their tensor products in extended linear modeling. discussion and rejoinder. Annals of Statistics, 25:1371–1470.

Su, X. and Bai, M. (2020). Stochastic gradient boosting frequency-severity model of insur-ance claims. PLOS ONE, 15(8):1–24.

(37)

Appendices

Appendix A: Newton-Raphson method

Analytical solutions for complex GLMs are usually not easily available, and therefore nu-merical methods are necessary. The Newton-Raphson method can be used to approximate solutions when no closed-form exists, and it can converge quickly.

When we want to find the root of a function F : R → R, and we have initial guess x0, this

method makes use of a linear approximation, i.e. tangent line, to update the initial guess. This process repeats itself until the algorithm converges. If xnis the current estimate, then

the next estimate xn+1 is given by

xn+1= xn−

F (xn)

F0_(x n)

. (26)

Our goal is to find the system of partial log-likelihood derivatives with respect to the parameters βm, m = 0, . . . , k in equation (10). So to compute equation (26), we need the

derivative of (10), i.e. the second order partial derivatives of the log-likelihood. These are obtained by taking the derivative of (10) once more, we get

∂2` ∂β2 m = − n X i=1 x2_imexp p X j=0 βjxij ei. (27)

Thus by substituting equations (10) and (27) in to equation (26), we can solve for the parameters β. We keep iterating until convergence, that is, until the difference between two consecutive iterations is very small, for example |xn− xn−1| < = 0.00001. For a more

(38)

Appendix B: Factor categorisation

One automated approach within the GLM framework to categorise a single factor with many levels that have a natural ordering is given in Anderson et al. (2007, p. 111-112). Here, the idea is to replace the single factor with many levels with a series of binary factors and test these binary factors for significance. For example, instead of modelling driver age, the following factors can be created:

• (binary factor 1) age less than 18; • (binary factor 2) age less than 19; • . . .

• (binary factor 22) age less than 39; • (age 40 is the base level in this example) • (binary factor 23) age less than 41; • . . .

• (binary factor 82) age less than 100.

(39)

Appendix C: Splines

A spline is a piece-wise polynomial constrained to match at certain points (knots). Similar to polynomials, splines have an order. The order of a spline is equal to the maximum order of the piece-wise polynomial it consists of. For example, a cubic spline is constructed from cubic polynomials. For our purpose cubic splines are often sufficient to model most non-linear effects. Higher-order splines may better fit the data but are prone to overfitting. First, consider a spline of order one. Suppose x ∈ IR is divided into intervals with three endpoints at a, b and c, called ‘knots’. Then any linear spline function f is given by

f (x; a, b, c) = α0+ α1x + α2(x − a)++ α3(x − b)++ α4(x − c)+, (28)

where (u)+=

(

u, for u > 0, 0, for u ≤ 0.

By modelling a slope increment for x in each of the intervals, the function is constrained to join at the knots.

A cubic spline is more flexible and in contrast to linear splines is able to fit curved functions. Cubic splines can be constrained to be smooth at the knots by forcing the first and second order derivatives of the function to agree at the knots. Consider a cubic spline with again the three knots a, b and c. A cubic spline f is then given by

f (x; a, b, c) = α0+ α1x + α2x2+ α3x3+ α4(x − a)3++ α5(x − b)3++ α6(x − c)3+. (29)

(40)

Table 3: Dafault quantiles for knots. m Quantiles

3 0.10 0.50 0.90

4 0.05 0.35 0.65 0.95 5 0.05 0.275 0.50 0.725 0.95

(41)

Appendix D: H-statistic

Friedman and Popescu (2008) proposed the H-statistic to test the presence of an interaction between two specified variables. This measure is based on partial dependence functions. In equation (16), explained that in case of no interaction effect between variables xj and

xk, a function F (x) can be decomposed in two parts; one that does not depend on xj and

the other that does not depend on xk. Therefore, again assuming no interaction we have

that the partial dependence Fjk of F (x) on (xj, xk) can be decomposed into the sum of

the respective partial dependences on each variable separately:

Fjk(xj, xk) = Fj(xj) + Fk(xk). (30)

Then an estimated difference between the left- and right-hand side of equation (30) would indicate the presence of an interaction effect between the variables. Based on this idea, an interaction between the variables (xj, xk) can be tested using the statistic

H_jk2 = Pn i=1 Fˆjk(xij, xik) − ˆFj(xij) − ˆFk(xik) 2 Pn i=1Fˆjk2(xij, xik) . (31)

The H-statistic thus measures the fraction of variance of ˆFjk(xj, xk) not captured by

ˆ

(42)

Appendix E: Results GLM with interaction

In table 4 below, we present an overview of each of the interactions tested for improvement based on both of our performance measures, when included in the GLM as a multiplica-tive interaction term. The difference in AIC is defined as ∆AIC = AICnew− AICoriginal.

Hence, a difference ∆AIC < 0 implies a reduction in the AIC and therefore an improve-ment. The difference in exposure weighted deviance residuals is defined as |Dnew−Doriginal|,

where D = D(y, ¯y) is defined in equation (20). Lastly, the column on the right shows the number of significant parameters part of the interaction term with p-value p < 0.05.

Table 4: Performance measures for GLMs with different included interactions.

On the improvement of risk classiﬁcation using boosted trees to search for interactions

On the improvement of risk classification using

boosted trees to search for interactions

On the improvement of risk classification using boosted trees

to search for interactions

Contents

1

Introduction

2

Problem formulation

3

Literature review

4

Research question

5

Hypotheses

6

Methodology

7

Data

8

Results

9

Conclusion and discussion

Bibliography

Appendices