Individual reserve forecasting in non-life insurance using machine learning

(1)

Universiteit Van Amsterdam

Master Thesis Actuarial Science and Mathematical Finance:

Quantitative Risk Management Track

Individual Reserve Forecasting in

Non-Life Insurance using Machine Learning

Author: Jamie Kane Student Number: 11751053 Supervisors: Lu Yang Katrien Antonio (University of Amsterdam) Reinout Kool Yoeri Arnoldus (Deloitte)

August 15th 2018

(2)

Statement of Originality

This document is written by Student Jamie Kane who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

(4)

I would like to offer a word of thanks to everyone at Deloitte for their support throughout this thesis. More specifically I would like to thank Reinout Kool and Yoeri Arnoldus for their guid-ance and critical analysis of my ideas.

Within UvA I would like to thank my supervisors Katrien Antonio and Lu Yang for their sup-port throughout the writing of this thesis.

Thank you very much! Jamie Kane

(5)

(6)

Within non-life insurance a sum of money is required to be held in reserve for a portfolio to cover claims which have not been paid out. This is composed of money set aside for both claims which are Reported But Not Settled (RBNS claims), and money set aside for an estimation of the claims which have been Incurred But Not Reported (IBNR claims). The total annual IBNR claims are estimated based on the size of the portfolio, development rates of IBNR claims in previous years and the amount of reported claims in the current year. Adding the IBNR estimate to the RBNS claims and settled claims in each year, we have a total expected amount of claims in each year, known as the ultimate claim amount.

The purpose of this research is to use claims data, as in traditional pricing in insurance to extract the expected amount each policyholder will claim in each month. By forecasting the total claims in each year for each policyholder, we can sum this value to produce an estimation of the total claims in a portfolio on a monthly basis. We shall denote this expected value of claims in each month as NINR: Not Incurred Not Reported claims for each month.

With NINR estimates we propose a new framework for estimating the total claim amount based on information on each of the policyholders within a portfolio. We outline how by modelling for each calendar month, the probability that each policyholder will make a claim in each month and what the expected size of such a claim would be. More understanding of the total expected value of claims incurred each month allows for further insight which will be useful in determining empirically the total amount of liabilities in each month for an individual and a portfolio in a data driven manner.

At the end of each month, the difference in the observed claim amount and predicted amount from the beginning of the month would be an estimate of IBNR for this month. This is concep-tually, a new way for an insurer to understand what they might expect in future claim amounts and how this might dictate expected reserves.

We shall investigate modeling this probability and expected size using Classification And Re-gression Tree (CART) approaches, implemented through gradient boosting machine learning ensemble techniques. This will allow us to model the impact and interactions of different vari-ables in predictive analysis without making underlying parametric assumptions about the data. In doing so we have captured empirically which policyholder characteristics are the large drivers in determining the ultimate claim amount for each individual and we can use this to make predictions of the amount of reserve to be held for a portfolio.

Keywords: Gradient Boosting, Machine Learning, Reserving, Insurance, XGBoost, Non-Life Insurance, Motor Insurance, Vehicle Insurance, NINR, Not Incurred Not Reported, CART

(7)

1

Introduction

In non-life insurance, companies need to hold reserves in the form of capital on their books in order to fulfill future liabilities. If these are underestimated then the capital of the company must be used to cover the excess losses that month and if they are overestimated then there would be free capital available that could have otherwise been invested in something profitable.

The conventional techniques used to estimate these reserves are the Chain Ladder [1], Bornhuetter-Ferguson [2] or other run-off triangle methods. These are intuitive estima-tions of the amount of reserve necessary to hold each year and these estimates offer a simple and reasonably useful prediction based on these previous total annual claim val-ues. These methods have been sufficiently effective in practice and are currently used as the standard procedure in industry. This process has a rather large shortcoming in that it does not utilise the full power of the data available.

Traditionally reserves in insurance are estimated in this way, on an annual (or semi-annual) basis using the end of year (or 6 month) aggregated values, this means that much information and insight about the data and its structure is lost when the data is aggregated into year end totals.

In reserving IBNR is an acronym for “Incurred But Not Reported” (accident has happened 10

(11)

11 but the insurance company has not been informed of it yet). RBNS is an acronym for “Reported But Not Settled” (insurance company is aware of the accident but has not yet completed payments for it).

In triangle and aggregate reserving we look at the rates at which IBNR claims developed into reported claims in previous years (known as loss development factors). We calculate, using the number of active policies in the current portfolio and the loss development rates in the portfolio from previous years an estimate for the IBNR claims in the most recent year.

We also produce estimates for the size of the liabilities of those claims which have been reported but not settled (RBNS) on an individual case basis, these are expert predictions produced for each reported claim. This is known as the claims reserve. The amount of money to be held in reserve for the outstanding claims reserve is the expected amount of money that will cover both the RBNS claims and the IBNR claims.

Outstanding Claims Reserve = RBN S + IBN R

We can then add this outstanding claims reserve to the amount of settled claims in each year we produce an estimate of the ultimate claim amount in each year.

U ltimate Claim Amount = Settled Claims + Outstanding Claims Reserve This loss forecast is simply a forecast for the next annual (or semi-annual) required sum. The premise of this research is to more directly and optimally predict the ultimate claim amount for the portfolio. By predicting the expected ultimate claim amount per individual policyholder and aggregating these predictions of the individual expected claim amounts to find the portfolio expected claim amount. This is a model predicting future claims and as such will not relate to past exposure. When the model is applied over a long duration previous monthly estimates can be used as a model for past exposure.

We forecast the expected amount of ultimate loss that each claimant will incur attributable to each month and not how much will be payed out or reported in each month, concep-tually this should be the same valuation as is used for pricing within insurance. We shall introduce notation for the forecast of the ultimate claims in each month as a new term, “NINR” or Not Incurred, Not Reported claims. The NINR claims can be calculated for each month for each individual policyholder. Taking a sum of all individual policyholder NINR claims in each month we have an NINR claims estimate for the portfolio for each month. At any point in time looking backwards at previous NINR estimates the difference in the NINR estimate and the settled sum is the estimate for reserve. When we refer to

(12)

modelling claims throughout this research this is in the context of modelling the claim occurence rather than reporting or settlement.

For each individual policyholder the prediction of their ultimate claim amount is not ex-pected to be inherently useful due to the unlikely nature of a claim occurring. However by treating each policyholder as an individual risk, under the assumption that a policyholder can make at most one claim per month, we say that for sufficiently many policyholders the sum of these individual predictions will equal the total expected claim. This is reinforced by arguments proposed in Feller [3] which suggest that as Khintchine’s (weak) Law of Large Numbers states, the sample mean of an expanding pool of independent risks will converge towards true mean values.

Peter England and Richard Verrall question “whether it would not be better to examine individual claims rather than use aggregated data” and postulate that “Ideally, methods need to be found which help provide better estimates of aggregate case reserves. In this respect, models based on individual claims, rather than data aggregated into triangles, are likely to be of benefit.” [4]. This chapter also refers to this more specifically in that with the advent of new computers and increasing acceptance of simulation techniques it is possible to devise a predictive distribution of reserve estimates using simulation methods. In this respect the research will be applying some of these new techniques in order to realise this concept.

For any fixed time period the NINR estimate should be an equivalent estimation of the ultimate claim amount for accidents occurring in the same time period. We use one month in the context of this research but this model could be extended to any fixed time period, where NINR can be calculated before the period begins.

1.1 Motivation for the new framework: Portfolio composition

For each distinct policyholder in a pool of customers in an insurance portfolio we are able to attribute a set of features. These would include such characteristics as the lifestyle of the driver, the driving mannerisms of the driver, characteristics of the vehicle and the claims history of the driver. We can also list the month of the year we are observing as a feature and thus each policyholder will have up to twelve unique NINR predictions given that all other features remain the same in each year. Each policyholder will have m features.

(13)

com-1.1. MOTIVATION FOR THE NEW FRAMEWORK: PORTFOLIO COMPOSITION 13 posed of drivers producing larger expected claim amounts will produce a higher ultimate claim amount than a portfolio of the same size with drivers with low NINR predictions. Khintchine’s Weak Law of Large Numbers states that the sample mean of an expanding pool of independent risks with a common mean and same probability distribution, (X1 + ... + Xn)/n, converges in probability to the true (population) mean, µ or more

formally,

limn→∞

(X1+ ... + Xn)

n = µ (1.1)

where n is the number of risks Xi which are independently distributed with a common

mean and same probability distribution.

We can assign m different features to n different data points. If we refer to i as each data point and j as each feature including the month of the year and the policyholder characteristics then we can define for each possible distinct set of features, an observed loss Xij at each data point i. This observed loss may not be known in full until several

years after occurrence as many of these claims will not be reported and as such are still IBNR claims.

For each loss X.j (where all j terms are identical) we can model a mean expectation µj (as

in equation 1.1) which would thus be the estimate for NINR claims for each policyholder with this set of features.

Weakening the conditions here under which the Weak Law of Large Numbers applies we remove the condition of an identical distribution and common mean. We now build on this principle to explain how a model would work for a portfolio of n different policyholders (some of which may have identical features).

A sufficient condition for lim

n→∞P[|((X1− µ1) + ... + (Xn− µn))/n| > ] = 0, for any > 0 (1.2)

with finite expectation for each risk E[Xij] = µj and also with finite variance σj2.

The variance of the average of the losses of (X1+ ... + Xn)/n goes to zero if (1/n2)(σ12+

... + σ_n2) also goes to 0 as n tends to infinity. Note that (σ2₁ + ... + σ_n2) is the variance of (X1 + ... + Xn) or ((X1 − µ1) + ... + (Xn− µn)) and hence (1/n2)(σ21 + ... + σn2) is the

variance of ((X1− µ1) + ... + (Xn− µn))/n.

(14)

0 as n tends to infinity for finite variance of Xj. Applying the Chebyshev inequality we

can extrapolate from this and apply it to this general case.

This is the crux of my hypothesis. When the claim amount can be tailored to sum each case expectation rather than an estimation based on the number of claimants and previous years totals, and then implicitly forecasting the total claims, I postulate that we should be able to capture trends in changes of portfolio composition. As such NINR is developed as a better foundation from which we could estimate the total reserve.

Some examples of how this may be done, although out of scope for this research is that this estimation would be performed on past NINR estimates, one such estimation would be to subtract the reported claims to date with the remaining amount as the IBNR for this period. Another implementation would be to multiply the NINR by what propor-tion of this is expected to be claimed in the next month either on an policy or on a portfolio level. For example, if we have a portfolio NINR estimate for February of 100 and we expect 60% to be reported in March, and 30%to be reported in April, then we would have an expected IBNR of 40and 10in March and April for claims pertaining to accidents in February respectively. These are not researched or proven methods for esti-mating outstanding reserve however that is the purpose of this research. This offers us a new conceptual approach to estimating reserve considering individual policyholders and forecasting forward.

By developing a model for estimating expected ultimate claim amount on an individual level, by the linearity of expectation this should be just as valid an estimation for total portfolio claims. This is because expected value of a sum of random variables is equal to the sum of the individual expected values. Modelling in this way we should better be able to capture the effects of a change in portfolio composition by not assuming claims are made homogeneously throughout the year by every type of policyholder. This is not a direct estimation that can be used for reserves but rather an estimation of total claims, the difference between this total expectation and the total settlements being used as an estimate for outstanding reserve.

For a more detailed example, within motor insurance it is common knowledge and implic-itly assumed that young drivers will have on average a higher claim amount than older drivers. In an aggregate reserving model we may have a portfolio of a constant number of drivers with the majority age 50 to60and a minority age 18 to20 claiming at a constant rate. If the relative proportions of each age group invert at some point but the total num-ber in the portfolio remained constant we would expect to see a much higher claim total due to more frequent and more expensive claims from the younger drivers. The classical aggregate models, which estimate previously incurred and unreported claims, could not

(15)

1.2. MOTIVATION FOR THE NEW FRAMEWORK: CLAIM VARIATION BY MONTH 15 account for such a change in composition and would forecast an unchanged rate which could be captured in an NINR model.

1.2 Motivation for the new framework: Claim variation by month

In a conventional reserve model the amount of capital required to cover losses is derived from annual estimates of outstanding claims. As a result we assume that these outstanding estimates pertaining to claims in each month deplete throughout the year by way of a constant rate of settlement. By forecasting total claim amounts within specific months we hope to be able to account for seasonality in total claim amounts between months by an NINR estimate on the portfolio level for each month separately.

We have an example of this seasonal trend in a paper by Crevecoeur, Antonio and Verbelen on reporting delay dynamics [5] where the moving average of the claim occurrence follows a cyclical trend annually. A further example is found in the UK insurance market displaying this trend, presented at a seminar at the actuarial society entitled: “Seasonality and the issues of the season”. A study [6] refers to an analysis on weather against accident rates in the Netherlands, France and Greece. Again this cyclical trend was spotted although in this study it was used in weather analysis. Rainfall is analysed (and positively correlated) but the most prominent feature of the study regarding season is temperature. This is positively correlated with accidents on the roads with 1◦C increasing injury accidents by

1% on motorways and 2% to 3% on rural roads. Seasonal risk of fatal accidents is also presented in a time series analysis of road safety across Europe [7]. Again this study show a seasonal trend in accident rates throughout the year.

Rather than looking at the cash flow from the date at which a claim payment is processed or the date at which a claim is reported, we look at the date at which the event resulting in the claim happened (usually simply referred to as the occurrence or accident date). We thus estimate, for claims in this respect, a probability that given a policy is active during this month, the policyholder has an accident in this month (assuming that at most one accident can be made per policyholder per month). When we refer to the time of claim or when the policyholder makes a claim in the context of this research we are referencing the time at which the accident occurred and not the reporting or settlement time. We should be able to model the expected amount that each policyholder is likely to claim by combining the probability of making such a claim with the expected ultimate severity of such a claim. Combining these figures we define the expected amount for each policyholder as the individual NINR or best forecast for an individuals Not Incurred, Not Reported claims.

(16)

As NINR is a forecast of the claims that will appear for a specified policyholder in each month and hence can be used to describe the total claims in the next year or in the residual lifetime of their contract by summing the NINR in each month. The NINR could also be aggregated for one particular month for every policyholder with an active policy during that month. This would allow us to estimate the total number of claims or the “Portfolio NINR” in that month. The Portfolio NINR summed across all 12 months should be our estimate of the total annual ultimate claims for this year.

Figure 1.1 shows a simplified example of how the predictions of NINR for three individuals in each month can be aggregated. Each box represents size of NINR in each month. The left column represents the claims total for all policies in all months.

Figure 1.1: Illustration of aggregating monthly NINRs

In practice, although out of scope for this research, is that we could use the portfolio NINR as a basis from which to model estimated developments of:

1) The expected time to report these claims as a prediction for IBNR development from the NINR predictions based on features of each policyholder in each month.

2) The expected time to claim settlement for as an estimation for how much reserve the company should hold at any one time by modelling the expected time to settlement based on the same characteristics in each month.

3) Look at how changes in portfolio composition (with regards to characteristics) and how this effects the expected ultimate claims.

(17)

1.2. MOTIVATION FOR THE NEW FRAMEWORK: CLAIM VARIATION BY MONTH 17

If we were to predict the time until change of state as such, then for claims from any given policyholder we could estimate an amount of IBNR in each future month and an amount expected to be settled in each future month for different policyholder characteristics. This would allow us to forecast each of these metrics with a degree of precision attributable to individuals rather than aggregate sums. There is no precise format for doing this within this research but this would involve predicting the proportion of each NINR estimate expected to transition to each other state (IBNR, RBNS or Settlement). This could in theory be done using Markov state models for example. There is further research (uncompleted at the time of this paper) modelling the state transition using Poisson processes within Deloitte.

The research will focus on building the model to find NINR for different policyholders. In particular, we shall focus on attritional claims, these are claims of value below a certain threshold. These make up the majority of claims and are inherently predictable in a portfolio of mean claim expectations. The rationale behind modelling these claims specifically will be explained in the data section of this paper.

Structurally, in this paper we will begin by introducing an overview of data (Chapter 2) we have available including how we have chosen a particular subset (Section 2.1.1), and what data (Section 2.3) we shall use to model, test and validate in the quantitative part of the research. We shall then describe the methodology (Chapter 3) and outline the techniques and mathematics behind the research. After this we shall analyze the results of the models and draw insight from them (Chapter 4). Finally we shall draw some conclusions and discuss points for further research (Chaper 5).

(18)

2

Data

Within this section we outline our motivation for the choice of data and determine which sections of the portfolio are particularly useful, this is how we shall focus our research and investigate the contents of the data. For the purposes of this research we must first discuss the format of the data as it is available, how we transform the data from representing annual to approximating monthly information, how subsets of this large monthly dataset should be transformed into two different data sets for producing different models and finally we outline the contents of each of the policyholder details that we shall use as predictive covariates later in our modeling.

2.1 Choice of the data

Within this research we consider data from a large Dutch insurer looking specifically at WA (Wettelijke Aansprakelijkheid) claims. WA coverage is the compulsory legal minimum required for car insurance in the Netherlands. This insurer has two databases, one of which collects payment, refund and expert costs with the date of each cash-flow and named this the claims or “Schadebestand” dataset. The policy dataset or “Premiebestand” is much larger and contains information on start and end dates of each policy and also a large

(19)

2.1. CHOICE OF THE DATA 19 number of features relating to each policyholder including the ultimate claim amount (labeled in the Premiebestand as SCH_LAST_WA).

As this research is centered around predicting claim amounts based on policyholder char-acteristics, vehicle characteristics and details on the relevant month of the claim for each individual, the principal approach requires analysis on both the total claim amount and the characteristics. These datasets are not centrally unified, thus this will require merging of the two datasets to understand which policies incur losses and how much this is likely to cost when they do. Both datasets contain information on the payments related to each accident and both include a reference to the data of the accident. This is crucial for our predictions as there are inconsistencies with how expert costs are assigned, these must be unified in order to make the response variable consistent.

Filtering out everything in the dataset not relevant to coverage and costs for WA claims and partitioning the features attributable to each individual policy on a monthly level we have a dataset of 26740951 rows of data points. How we partition the data into this monthly framework we shall explain in detail further in section 2.2 This is one for each month in which there is either an accident or in each month in which there is no accident but an active insurance policy for this individual at this time, with the covariates in each month comprised of policy details and claims information. Of this data 0.298% or 79820 datapoints relate to months containing details of accidents data from the beginning of2011 to the end of 2017. This means that each month with an accident can be marked with the total claim cost and each month with no claim is marked as such. This is displayed visually in table 2.3.

For the most of the claims, the net payments after refunds and expert costs from the Schadebestand matches the variable in the Premiebestand for ultimate claim amount. Over 80% of these net payments match the ultimate claim amount in the Premiebestand exactly. Around 1.3%of the net payment details were less than the ultimate amount of the payments in the Schadebestand, this is attributable to claims only partially settled (partially RBNS claims). 175 of these payment details displayed net values greater than the recorded ultimate claim amount. These are claims for which additional costs were later added and thus resulting in a larger total claim amount and the final value had not been included in the Premiebestand. The remaining ultimate claim amounts have not had the expert costs included, these have been added to the net values replacing the ultimate claim amount as this is representative of the incurred payout for the insurance company. Correcting for these data inconsistencies, when the net payments in the Schadebestand exceed that of the Premiebestand, this new net value is taken as the ultimate value in the Premiebestand. The 175 values of net payments that fall below the ultimate claim

(20)

value registered in the Premiebestand will be replaced by this Premiebestand value as these observations include RBNS claims and this is our best estimate of the ultimate claim amount from expert judgment. There is a reporting limit of 3 years for accidents, with the vast majority settled within 1 year for these claims, thus any claims pertaining to accidents before this cannot be filed and we thus have a full account of all data in our modelling set. As such our data up until2015is considered fully reported. Modifying the data in this way gives us a consistent valuation of the best known ultimate claim amount for each individual to use as our response variable.

Now we shall look at the total volume of data in terms of aggregated claim amount each year and outline which data we shall use and to what purpose. Figure 2.1 is a stacked histogram displaying the ultimate claim amount for claims within each month in a different colour and stacked across each year. There are very few claims values in 2017 and for this reason we have decided to exclude them from the study, the reason for the lower claim amount each year is that the portfolio size has decreased in each year.

Figure 2.1: Total claim value each year

2.1.1 Attritional claims

In predicting claims using machine learning techniques there are two sections each requir-ing different approaches, the rare large claims and the frequent smaller claims which we shall define as “Attritional claims”.

(21)

2.1. CHOICE OF THE DATA 21 A study [8] within Deloitte has documented how machine learning can be used to predict claims with a value above a certain threshold in the context of rare event prediction for high severity claims using extremal value theorem. Similar to this study, these predictions are based on policyholder characteristics. These rare events are considered as each of the claims above a certain value threshold (of claim amount). The premise of this thesis is to predict these attritional claims, those that are below this threshold. The distribution of claim sizes has a large right tail where the vast majority of claims are clustered around lower claim values.

Modelling distributions that fit from extremal value theory is possible if the excess values for different thresholds fit to a Generalized Pareto Distribution (GPD) [9]. More specif-ically McNeil [10] shows that a GPD is a good approximation in the tail of insurance losses. Cebrian et al. [11] discuss EVT models for modelling data from an insurance portfolio focusing on large claims on excess over threshold models. For a GPD fitted to the claim sizes we can observe the minimum threshold value for which the fitted shape parameter of the GPD is constant thereafter. This is the requirement to fit an extreme value distribution to the excess loss beyond this threshold, thus we shall use everything below this threshold as our definition for attritional claims. Observing all of the observed ultimate claim amounts we can fit a GPD distribution to our portfolio and produce shape and scale parameters for GPD distributions above different thresholds as shown in figure 2.2.

Figure 2.2: Scale and Shape for GPD thresholds

(22)

as the threshold for which to fit extremal values. In order to fit attritional claims we shall remove the 2623observations that exceed this value from out dataset as these would be more suited to a high severity claim model. Figure 2.3 shows the distribution of all claims and the point from which we append the dataset removing claims above the €9000 threshold.

Figure 2.3: Claim amounts

This leaves us with77197observations which are below this threshold. Displayed in figure 2.4 is a histogram of the total ultimate claim amount size when only looking at claims below the €9000 threshold value. Using the new appended dataset there is less of a light tail and although not symmetrical as there is much kurtosis, there should be sufficient data to predict each of the underlying claim sizes.

(23)

2.2. DATA FRAMEWORK: CONVERTING ANNUAL DATA TO MONTHLY 23 In table 2.1 we look at the breakdown of the number of claims in each year and the mean ultimate claim amount before and after excluding those values which exceed our threshold by principals of extreme value theory respectively. We also display a naive probability which is simply a percentage of the number of months in which there is a claim from the total number of months containing policyholder information.

Year Policy months Claims (€) Mean claims (€) Claims (<€9000) Mean claims (<€9000) Portfolio probability (%) 2016 3838996 11494 2210.67 10863 1461.49 0.282166213 2015 3798915 11374 1881.33 11029 1476.78 0.289479281 2014 3732449 11175 1835.98 10861 1468.61 0.290144243 2013 4069719 12183 1876.78 11840 1501.46 0.290080079 2012 4874860 14595 1849.57 14211 1507.75 0.290668693 2011 5574244 16688 1815.33 16260 1493.89 0.290850313

Table 2.1: Mean claim sizes and naive probability

From the breakdown of our claims data each year we see that the 2016 data has higher mean claim value when including those claims above €9000 and a comparably similar mean claim amount to the portfolio mean for attritional claims. We have also defined here a naive probability as “Portfolio probability”, this is simply a ratio of the number of month in which there was a claim to the number of months where there was no claim. The mean ultimate size is €1918.87 for the full dataset, much larger than the mean claim size of €1483.11found when only considering the claims below €9000.

2.2 Data framework: Converting annual data to monthly

The granularity of the data available to us can only offer the features of each policyholder on an annual basis as we only have policy start and end dates with a typical duration of one year. By duplicating each annual row by the number of months a policyholder has an active policy in any calender year we can assign a 1 to each month as an indicator variable. Essentially this translates table 2.2 into table 2.3. Each of the time intervals is kept relatively homogeneous. This keeps the reporting concise and simple while giving a good mechanism to bin to a specific time frame a set of covariates composing the risk profile of each policyholder.

For example a customer holds a contract from October of 2012to October of 2013, they renew their contract in October of2013which is a new policy. For each policy in each year

(24)

we are provided data with start dates, end dates and some policy information, therefore we create a row for each month representing the contract across both 2012and 2013. This structure for the data gives us a full set of features such as age of the driver and the vehicle details on a month by month, easy to update basis. If the policyholder changed policy every month, then the policy information sets are guaranteed to update for change of policyholder characteristic or for a change in years, with the exception being missing data for time dependent covariates such as age of the driver. Age change can occur at any point during which the policy is active.

The structure of the data available from the Premiebestand policyholder information is included in table 2.2. In the following case there are three policies from one policyholder, running from October to October in 2011 to 2012, 2012 to 2013, and 2013to 2014 respec-tively. We only display here the data for the years 2012and 2013 as an example of what we have available, in reality there will be more data provided in the Premiebestand with an information set for the end of 2011 and one for the start of2014.

The column in this table “Explanatory variables” is used as a placeholder for each of the predictive covariates in the tables 2.4 and 2.5 other than SCH_LAST_WA and Year (JAAR). These features are both included separately in table 2.2 as the full set of features at the monthly level is too large to display in one table with clarity in this paper. These two features are simply included as they are relevant in illustrating the structure of the data with respect to time and claim amount.

SCH_LAST_WA Explanatory variables Year Start date End date 0 Policy information 1 in 2012 2012 21/10/2011 20/10/2012 0 Policy information 2 in 2012 2012 21/10/2012 20/10/2013 2000 Policy information 1 in 2013 2013 21/10/2012 20/10/2013 0 Policy information 2 in 2013 2013 21/10/2013 20/10/2014

Table 2.2: Policy data available on annual level

Splitting this policy information into months will allow us to capture the impact that each month will have on the expected claims relative to other months. In this case the driver makes a claim for an accident occurring in April of 2013. Observing the policy data, the row for April will represent the ultimate claim amount for this claim.

This framework allows us to attribute each feature to a specific month rather effectively, and from this we can produce some good estimates of the claim probability and severity as a prediction in each month. The monthly policyholder information which we shall use for modelling is therefore displayed in Similarly we have removed all of the explanatory

(25)

2.3. HOW THIS DATA IS USED FOR THE RESPONSE VARIABLE IN EACH MODEL 25 variables other than year and ultimate claim amount.

SCH_LAST_WA Explanatory variables Year January February March April May June July August September October November December

0 2012 Policy information 1 2012 1 0 0 0 0 0 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 1 0 0 0 0 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 1 0 0 0 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 1 0 0 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 0 1 0 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 0 0 1 0 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 0 0 0 1 0 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 0 0 0 0 1 0 0 0 0 0 2012 Policy information 1 2012 0 0 0 0 0 0 0 0 1 0 0 0 0 2012 Policy information 2 2012 0 0 0 0 0 0 0 0 0 1 0 0 0 2012 Policy information 2 2012 0 0 0 0 0 0 0 0 0 0 1 0 0 2012 Policy information 2 2012 0 0 0 0 0 0 0 0 0 0 0 1 0 2013 Policy information 1 2013 1 0 0 0 0 0 0 0 0 0 0 0 0 2013 Policy information 1 2013 0 1 0 0 0 0 0 0 0 0 0 0 0 2013 Policy information 1 2013 0 0 1 0 0 0 0 0 0 0 0 0 2000 2013 Policy information 1 2013 0 0 0 1 0 0 0 0 0 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 1 0 0 0 0 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 0 1 0 0 0 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 0 0 1 0 0 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 0 0 0 1 0 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 0 0 0 0 1 0 0 0 0 2013 Policy information 1 2013 0 0 0 0 0 0 0 0 0 1 0 0 0 2013 Policy information 2 2013 0 0 0 0 0 0 0 0 0 0 1 0 0 2013 Policy information 2 2013 0 0 0 0 0 0 0 0 0 0 0 1

Table 2.3: Framework for monthly indicator variables

The response variable for each of the months in which there is an active policy and no claim will be 0, and the ultimate claim amount (sum of payments minus any refunds and expert costs) will be the response variable for each row representing the claim.

2.3 How this data is used for the response variable in each model

Due to the scarcity of non-zero claim amounts in the total dataset it is preferable to model the probability of making a claim and the size of such a claim separately. Utilising the above framework for binding our data to time intervals we can now look at the explanatory covariates and response used in the predictive model in our analysis. Here we outline how we split the dataset we have created into appropriate structures for using the ultimate claim amount in each month for each model.

2.3.1 How the data is used in a model for estimating the probability of making a claim

The first step is to predict whether or not a policyholder will make a claim. To do so we look at the full dataset of months containing claims and non-claims to try and predict from the population the probability that each policyholder will make a claim in each month. This will require modelling predictive covariates against a series of1and 0values where there are claims and no claims respectively as a response variable.

(26)

This requires us to modify table 2.3 replacing each non-zero ultimate claim amount with a 1, implying there is exactly one accident in that month and if this is a 0 there are no accidents in that month.

From our data there are no examples where a driver has had two accidents within 1 month, although this is technically possible we have decided to ignore the possibility as the probability is so small for 2 (or more) accidents to happen in such quick succession that we shall consider the possibility and thus the impact negligible. This is the rationale behind only modeling for the probability of making exactly 1or 0claims.

2.3.2 How the data is used in a model for estimating the ultimate claim amount

The second step is to calculate the claim size given that a claim has occurred. This requires a model built only on the dataset of occurred claims, rather than using an indicator variable to record if there is a non zero claim amount as in the probability model we now keep these non zero claim amount values present in table 2.3 and discard all of the zero values, leaving us with a much smaller dataset of only non zero claims.

(27)

2.4. EXPLANATORY VARIABLES 27

Figure 2.5: Total claim value each month

When investigating the monthly claim total in our dataset, the overall pattern however appears to follow a cyclical trend with high peaks in the Winter months and in May of each year, and low peaks in February and August.

2.4 Explanatory variables

We were offered a dataset including over 200 covariates, this is too many to use and many of them were relatively subjective or at least heuristically would not be considered exceptionally useful for predicting car insurance claims. In particular such a large set of covariates results in a very large datafile which exceeds the maximum space available on the local machines or on any servers accessible during this research. As such we have chosen a subset of the covariates which one would intuitively assume to hold some predictive prevalence. The initial shortlist of the explanatory variables are:

(28)

Explanatory Variables

Name (in data) Name Description

JAAR Year Year to which this policy information references

SCH_LAST_WA Ultimate claim amount Sum of payments minus any discounts applied and including any expert costs KILOMETRAGE Kilometrage The average number of kilometers this driver will travel in a calendar year.

CATWAARDE Vehicle value Initial value of the vehicle

SVJ Claim free years This values rises by one each year but it is somewhat of a misnomer._{This value will drop by a variable amount capped at 5 after a claim.} OUDERDOM Age of vehicle How many years old is the vehicle?

REGIO Region of Country The country is partitioned into 4 clustered regions. LOOPTIJD_JAREN Years with policy Number of years in an insurance contract with this company.

LEEFTIJD Age Age of the driver of the vehicle in years burgst Relationship status Relationship status of the main policy holder

kidzhh Children Does the policyholder have children, are they living at home? persauto_per_hh Average cars Average number of cars per household

stedelijkheid Unbanization How Urban is the area this policyholder lives in eerste_kleur Colour Main colour of the vehicle

mmt_Deuren Number of doors Number of doors on the vehicle mmt_Brandstof Fuel Type of fuel in the vehicle

mmt_Turbo Turbo Does the vehicle have a turbo injector mmt_CC Cubic centimetres Engine size in cubic centimetres mmt_Automat Automatic Is the car automatic or not? mmt_Gewicht Weight Weight of the vehicle

Table 2.4: Explanatory variables shortlist

Many of these variables proved to show high correlation with each other or not much predictive prevalence in initial rounds of modeling, as such we have elected to remove them from the dataset due to multicollinearity and lack of necessity. The final set of features we have decided to use in modelling with our data is:

Explanatory Variables

Name (in data) Name Description

JAAR Year Year to which this policy information references

SCH_LAST_WA Ultimate claim amount Sum of payments minus any discounts applied and including any expert costs KILOMETRAGE Kilometrage The average number of kilometers this driver will travel in a calendar year.

SVJ Claim free years This values rises by one each year but it is somewhat of a misnomer._{This value will drop by a variable amount capped at a change of 5 after a claim.} OUDERDOM Age of vehicle How many years old is the vehicle?

REGIO Region of Country The country is partitioned into 4 clustered regions. LOOPTIJD_JAREN Years with policy Number of years in an insurance contract with this company.

LEEFTIJD Age Age of the driver of the vehicle in years burgst Relationship status Relationship status of the main policy holder

kidzhh Children Does the policyholder have children, are they living at home? stedelijkheid Unbanization How Urban is the area this policyholder lives in mmt_Brandstof Fuel Type of fuel in the vehicle

mmt_CC Cubic centimetres Engine size in cubic centimetres mmt_Automat Automatic Is the car automatic or not?

mmt_Gewicht Weight Weight of the vehicle

Table 2.5: Explanatory variables final

Included also as covariates, as mentioned are the indicator terms for each month of the year where a 1 indicates the relevant month. Allowing a combination of each of these predictors with the monthly indicators mentioned in table 2.3 we have a full set of predictive variables and one response variable, the non zero values of which can be used both as an indicator variable for a reference in the full dataset for when a claim is made in the probability model and their size also used directly as a response in the size model. SVJ is a variable that does not change by a fixed amount each year. This does not matter provided the procedure remains the same for determining the extent of the size of reduction within this company for future claims.

(29)

2.4. EXPLANATORY VARIABLES 29 the data. Pictured here is the frequency at which each of the variables is present in our dataset.

Kilometrage density Weight density

SVJ density Power density

Vehicle age density Policy age density Figure 2.6: Densities of the data (1)

(30)

Fuel density Automatic density

Urbanity density Region density

Age density Relationship status density Figure 2.7: Densities of the data (2)

We have included in our appendixA.1 a series of plots of empirical claim sizes and fre-quencies by creating Generalised Additive Models (GAMs), modelling claim size and probability against each of the continuous covariates individually. This is to visually aid some understanding of how each explanatory variable might be useful in predicting claim size or probability of making a claim. These will not be used for any of the modelling within this research although are useful in presenting how each covariate will have different interactions with our response variable in each model.

2.4.1 Missing values and data conversion

For the purposes of this research, as we have a mixture of discrete, categorical and con-tinuous data types. We shall also have many missing values, not actually included in the data. This may be as a result of poor data quality for some individuals where informa-tion about where they live or the weight of the vehicle is missing. We shall reference the categorical variables in a sparse matrix as our input. This will make use of a series of

(31)

2.4. EXPLANATORY VARIABLES 31 indicator variables indicating if each categorical value exists for each value in the data and a zero if it does not. This is exemplified in how we have transformed the monthly data into a sparsed format. Rather than have one column with 12 potential values we have 12 columns with an indicator variable.

Missing data is not a problem in the context of this information as we shall use techniques which will make the most optimal prediction given the data limitations, this is further outlined in section 3.4.4.

(32)

3

Methodology

In this chapter we describe the methods and techniques used to perform the analysis in this paper. We have explained in detail the data in the previous chapter and now we shall elaborate upon the modeling concept based on said data. Here we shall give some mathematical justification for the concepts and models outlined in the last. We shall also study how various machine learning techniques work and how they can be used in prediction of claim size in regression models and how probabilities can be formulated from the expected output classification models. Finally we will look at XGBoost - why it is such a popular modeling tool, why we have chosen it to enact our analysis and how exactly it is used to classify and predict within the context of this research.

3.1 The necessity for two models

In the data section we have outlined the framework within which we shall constrain our predictions of claims for each individual, that is predictions at the monthly level. Thus far we have only given a qualitative overview of the data and how we shall use it.

In this chapter we shall more formally define the usage of this framework. Using the ultimate claim amount variable SCH_LAST_WA as our response variable in both of our

(33)

3.2. MODEL CHOICES 33 models, each row will have an attributable policyholder number and set of m features. Mathematically we denote this set of covariates X where each policyholder in each month i has a set of m covariates where the ith data point xi ∈ X with dimension Rm. Each

policyholder can have any number of rows of data (indicative of months with an active policy in this portfolio). Each data point can be written as xi = (xi1, ..., xim)where xij is

the jth covariate of the ith row.

This data set contains far more data on policyholders that do not make a claim than information on policyholders that do make a claim, this is because accidents are quite a rare event. In terms of predicting expected claim amount for each individual for each month, it is necessary to treat this research in two steps rather than as a single regression problem using each of the datasets. As there are relatively few months in which an accident occurs we cannot properly properly capture the distribution of claim amount when considering the entirety of the data as the model would be very heavily balanced towards zero values. Thus modelling the probability of making a claim separately such as in compound or hurdle models. Frequency and severity models are not new conceptually within actuarial modelling [12]. The frequency model is typically built with a GLM using a probit or logit form for binary outcomes such as in this case. For a conditional severity model is is common to use a Gamma or Inverse Guassian distribution, often with a logarithmic link.

Modelling risk in this compound way manner are covered rather extensively but on the portfolio level in Modern ART [13], this book cites that a collective risk model turns out to be both computationally efficient and rather close to reality. Typically these distributions work on the assumption of a set of homogeneously distributed claim severity and frequencies and thus are not applicable in this case.

3.2 Model choices

Here we shall outline the approaches for each of the models for making predictions in each month for the probability of making a claim and the corresponding size given that a claim is made. Combining these two values gives us an expectation, this is the estimation of expected claim size which we shall denote as the NINR for this month. For an individual we reference each month in which they have an active policy with an indicator variable of 1 in that month and this is contained in the information of xi.

(34)

point will have an expected probability of making a claim, pi, and an expected conditional

severity si = ˆpi∗ [ˆyi|claim occurrence].

In principle, we require a model for the probability of each policyholder making a claim within a specific month, and a model for the conditional expectation of the claim size for the same individual within the same month.

3.2.1 A model for probability

The first step is to predict whether or not a policyholder will make a claim in a specific month. To do so we look at the full data set of claims and non-claims using an indicator variable such that we re-purpose any non zero values of our ultimate claim amount from yi to y∗i as: y_i∗ =    0, if yi = 0 1, otherwise (3.1)

To try and predict from the population, the estimated probability of making a claim based on the probability of making one claim vs making no claim within any given month we shall use a classification model between a series of 1’s and 0’s. This probability will be representative of whether or not there is a claim within each month. The output here will be that each policyholder in each month xi will have an accident with probability pi

where we should find that each pi lies between 0 and 1.

We train our model over a large set of data, predicting which sets of covariates lead to a larger probability of claiming, seasonality is built into this model as each month is used as a variable. As the model is trained, different sets of covariates in combination will offer unique predictive values. These covariates will have different interactions and will produce a range of predictions based on relative predictive prowess.

Logistic regression conventionally involves fitting a linear combination of predictive vari-ables to the log-odds of making a claim, this occurs in the format;

ln( pi 1 − pi ) = ηi = β0+ m X j=1 βjxij (3.2)

(35)

3.2. MODEL CHOICES 35 From a machine learning perspective, rather than fit a linear model to the log-odds of making a claim, we fit a non-parametric function to provide an estimate for ηi based on

a series of decision trees (decision trees and more complex models used to fit this are outlined below in section 3.3).

Just as in logistic regression this means that our probability of making a claim can be empirically estimated for each policyholder in each month by estimating ηi and converting

the resultant estimation to the probability of making a claim as: ˆ pi = exp(ˆηi) 1 + exp(ˆηi) = 1 1 + exp(−ˆηi) (3.3)

3.2.2 A model for claim size

The second step is to calculate the claim size given that a claim has occurred. Using decision trees we can predict an expected ultimate severity for each policyholder. Rather than predicting a score of ηi as before, for training this model we shall directly predict si.

We are modelling our prediction on the conditional value of yi by only looking at values

of si=[yi|yi > 0].

Of course looking at this from a machine learning perspective again we shall fit an ensemble of weak learners in the form of a series of short regression trees in order to build a nonparametric model. This allows us to predict this value si from the vector of covariates

xi. This model does so by incrementally minimizing a square error loss function in order to

iteratively converge to a prediction for a given input xi. Modelling this for policyholders

with identical characteristics and in the same month, we will not predict any high or low values for each policyholder as observed, but rather will produce the mean value based on the characteristic set available.

3.2.3 Combining the two models

With the two models fit together, we find that our combined model is now more clearly defined for each individual in each month marked i, this joint expectation is built on estimates as:

(36)

3.3 Machine Learning

In this section we build up a repository of machine learning definitions and techniques in order to demonstrate which will be useful in prediction of NINR. Our ultimate focus will be on extreme gradient boosting (XGBoost [14]) as an ensemble method [15] for combining weak classifiers into strong classifiers. The weak classifiers under consideration here are a series of short regression trees, the average with allows us to form a regression. The reasoning behind this tooling choice we shall outline in this section. We use terminology of a weak classifier throughout this research; these are classifiers of data which perform only slightly better in classification than random chance. One simple example of which is a decision stump which is a one level decision tree (covered in section 3.3.1).

Our aim is to use machine learning to best approximate an underlying distribution non-parametrically in order to capture a large range of feature combinations. There are many combinations of features with no observed claim amount in training, building an ensemble model will allow us to estimate this response based on interaction of these features. We use machine learning to iteratively minimize the error in residuals between each fitted and observed value. There is a trade-off in the degree of complexity and the fit we apply to our models requiring us to balance model complexity and accuracy. In practice on our training data we can fit a distribution to this too precisely such that the fit may not predict future observations reliably. We use complexity controls in our model to mitigate this.

3.3.1 Decision Functions and Trees

Simply put, decision rules are used to segregate conclusions about an item’s target value or attributes based on observations about said item. The goal is to split the target variables up in an efficient way such that the value of a target variable can be approximated by analysis of the input parameters. Individually one of these splits is referred to as a decision rule. This can be done using both numerical and categorical data.

By building on each split with multiple input variables iteratively we minimise appropriate loss functions without assuming linear or generalized linear (non normal error distribu-tions) combinations of parameters in our model. In fact we do not prescribe a relationship between different covariates in our model, thus allowing our model to minimize losses non-parametrically in a flexible, data driven and results based manner.

(37)

3.3. MACHINE LEARNING 37 and continuous data is known as regression. Recursive partitioning of the training set of either of these using these decision functions is known as a decision tree.

3.3.2 CART: Classification And Regression Trees

The principal for splitting two groups in any one particular place is loss minimisation. Classification and regression trees contain nodes of information, a full explanation is available in the works of Breiman et al. [16]. Starting from a root node NP, which

contains information on all variables and observations. From this root node, there is some condition by which the data is split. The algorithm tests all possible splits among all variables Xj.

The set of splits S depends on the value of one variable Xj. If this variable is numeric,

we consider splits of the type xij ≤ e ⊂ S. If Xj is categorical we consider all splits of

the type xij ∈ A ⊂ S, with A a non-empty proper subset of the potential values xij.

How well a splitting function works is measured by an impurity function, for classification trees, this is typically the Gini index. An impurity measure is a measure of the number of observations from each class within the same node. Gini impurity function is calculated as

IG(NP) = 1 − πP,−12 − π 2

P,1 (3.5)

where each value of π2

P,1 and πP,−12 represents the portion of the observations in each class.

The split in the data is determined along each of the split points maximizing the Gini gain when splitting the data into a left and right node, NL and NR. The gain in Gini

score is maximised as

max{IG(NP) − IG(Xj)} (3.6)

where

IG(Xj) = πLIGj(NL) − πRIGj(NR) (3.7)

with πL and πR the relative fraction in assigned to each child node.

This is repeated until either the node is pure (contains observations of only one class), the minimum prescribed node size is reached or there is insufficient Gini gain achieved from another split.

Regression trees predict values instead of classes as a classification tree does. Again, starting with a root node NP, we have response vector si containing all the values in our

(38)

approximat-ing the impurity function here by minimisapproximat-ing a loss function, typically the residual sum of squares.

First we calculate the error of the root node as the average error value of the observations in node NP. I(NP) = 1 n n X i=1 (xi− x)2 (3.8)

The best split is found by maximising the decrease in the least square criterion splitting the data into a left and right nodes, NL and NR based on the optimal split on Xj.

We minimise either the absolute loss error function: L(y, F ) = |y − F | or the square error loss: L(y, F )=(y − F )2 for regression problem, by minimising a loss function we are

aiming to minimize the sum of all of the differences in predicted and observed variables. Splitting the data, minimising the difference between the mean response and the observed responses as a method to approximate a distribution can be visually demonstrated in table 3.1. How an ensemble of decision trees works is explained in Chen’s paper on XGBoost [14]. How precisely regression trees can emulate a distribution is itself an abstract concept. In order to demonstrate this for the ease of the reader we have produced a more classical Generalised Additive Model (GAM) on a single variable that models nonparametrically the empirical claim size for different ages of policyholder. The value of presenting GAMs in this report is to provide an easily interpretable underlying model as a visual aid for which we can see step-wise how recursive splits can approximate this function from a regression tree on one variable. GAMs are covered in the appendix A.1 and is not considered in our modelling.

The horizontal line here in the first graph represents the mean claim amount in the root node of a decision tree. We then split this group into two smaller subgroups; those drivers above the age of 50 and those drivers below. This is the most optimal split in age to minimise the residuals from the predictions at this step. We then split on the mean predictions of the prediction’s below 50 producing yet another split at 33 years old. The more times this process is repeated, the better it becomes at empirically predicting the underlying distribution. It is evident already how this technique can be used to approximate a function after just 3 splits in the data. This concept will be the core of the modeling process on which the analysis in this paper is based. By using splits of this nature and with many more variables on which to split we shall capture more complex interactions and produce a model that will be able to produce an estimate for every combination of input features.

(39)

3.3. MACHINE LEARNING 39

0 splits 1 split

2 splits 3 splits

Figure 3.1: Recursive splits in decision tree

3.3.3 Gradient Boosting

Gradient boosting is a prediction model wherein the loss function is minimized as in a regression tree. Gradient boosting utilises an ensemble of the weak classifiers described above in order to stage-wise build a model, and generalises the model by enabling op-timization of an arbitrary differentiable loss function, the goal of this is maximising the descent of the loss function by choice of parameters.

By maximising this gradient descent incrementally we gradually produce models that fit the structure of the predictions and use this forecast to make predictions based on other sets of characteristics. The gradient boosting machine is described more formally in the original text by Friendman [17].

(40)

to our response in an additive manner as: F (x) = K X k=1 γkhk(x) + const. (3.9)

where hk(x)is a base (weak) learner.

The method tries to find an approximation that minimises the average value of the loss function on the training set. It does so by starting with a model consisting of a dataset and a split on a variable or small tree of variables, such that the loss function is minimized. From this point we can then approximate a gradient of change for the loss function with respect to the predictive function at the point of the previous best estimate of the predictive function. Gradient boosting focuses on minimizing the difference in prediction by weak learners on the pseudo-residuals resulting from the difference in the prediction from a strong learner. The contribution of the weak learner is applied through a gradient descent optimization process.

Mathematically gradient descent is described as the largest descent in change of loss function when differentiating the residuals of the loss function of our predictions from our differentiable function f(xi)with respect to this function at the most optimal set of these

parameters from the previous iteration ˆf(m−1)_(x i). −ˆgm(xi) = −[ ∂L(yi, f (xi)) ∂f (xi) ]_{f (x} i)= ˆf(m−1)(xi) (3.10)

This change on loss function is only defined for a series of points and thus to generalise the step the function takes next, we need an approximate negative gradient using a restricted set of possible functions. We can thus constrain the set of possible solutions to a set of basis functions Φ. We map this basis function as that with the closest correlation with our negative gradient at each possible set of points xi. Therefore at iteration k of the

algorithm we seek from each of our sets of points to find the highest possible correlation between −ˆgk(xi)n_i=1 and φk(xi)n_i=1. This is obtained by optimising

ˆ φk = argminφ∈Φ,β n X i=1 [(−ˆgk(xi) − βφ(xi))]2 (3.11)

to most closely mirror the direction of the gradient of our loss function. The length of such a step ρk at iteration k is measured using a procedure called line search. This is

(41)

3.4. XGBOOST 41 optimised at each iteration as

ˆ ρk= argminρ n X i=1 L(yi, ˆfk−1(xi) + ρ ˆφk(xi)) (3.12)

We optimise our loss function L at shrinking or learning rate η ∈(0,1], this is the factor by which each subsequent tree’s step size can constrain the data to. This operates as a step taken at each iteration and it’s purpose is to prevent over-fitting.

Collating all of this information and optimising accordingly by choosing the best splits and direction at each step, we can apply the learning rate to produce an optimal function at step k:

ˆ

fk(xi) = η ˆρkφˆk(xi) (3.13)

Iterating through this process K times we build a model as an ensemble of our prediction for the variable ˆyi as:

ˆ yi = ˆf (xi) = ˆf(K)(xi) = K X k=0 ˆ fk(xi) (3.14)

3.4 XGBoost

XGBoost is a highly effective machine learning method [18]. In recent years XGBoost has become more and more widely adopted due to simplicity in implementation, speed, scal-ability and predictive accuracy. XGBoost is an extension of the gradient boosting model and can utilise more computational power and improve accuracy by enabling multiple si-multaneous threads and employing robust regularisation. XGBoost has regularisation and complexity control built in, implemented by design through different parameter tuning control which we shall outline in section 3.4.2, this enables us to tune a model with more precision. The engine behind XGBoost is described in detail in Chen [14]. Complexity of the model can be defined as:

Ω(ft) = γT + 1 2λ T X j=1 w_j2 (3.15)

where γ minimises loss reduction required to make further partition on each leaf node, larger values make the algorithm more conservative. T is the total number of leaves each tree splits in to, the rest of the equation represents L2 or Euclidian normalisation.

(42)

The traditional gradient boosting machine tries to choose the direction of the step in gradient descent and then approximate the size of such a step whereas XGBoost tries to solve this directly as:

∂L(y, f(m−1)(x) + fm(x))

∂fm(x)

= 0 (3.16)

using the second order approximation as a partial differential hessian matrix denoted hi .

We can then take a Newton expansion to select our basis function from a restricted set of functions. Found by solving for the optimal function we find that we can treat this as a weighted least squares optimization problem.

ˆ φk= argminφ∈Φ n X i=1 [(ˆgk(xi)φ(xi) − 1 2 ˆ hk(xi)φ(xi) 2 )] (3.17)

thus, rearranging we have: ˆ φk= argminφ∈Φ n X i=1 1 2 ˆ hk(xi)[(− ˆ gk(xi) ˆ hk(xi) ) − φ(xi)]2 (3.18)

which allows us to now frame our optimal function as ˆ

fk(xi) = η ˆφk(xi) (3.19)

where we have η(eta) as a learning rate, a constant factor by which each step is multiplied, this is one of the tuning parameters (described in section 3.4.2). Again by iterating K times we can formulate this model as

ˆ f (xi) = ˆf(K)(xi) = K X k=0 ˆ fk(xi) (3.20)

As XGBoost includes this second order approximation of the optimisation, empirically it will learn better tree structures than through simply implementing gradient boosting. We can also use deeper trees to capture higher dimensional interactions in the covariates, this allows us to record and model the effects of multiple parameters and how their combined interaction may be used to forecast the value of the response we wish to predict. Where XGBoost is particularly useful is that the tree depth is not fixed as in other boosting techniques and so different combinations of parameters and different numbers of combinations of parameters can be used, provided that they will produce some predictive benefit.

Defining Gj =

P

i∈Ijgi and Hj = P

i∈Ijhi we have an optimal weight in each leaf as w∗_j = − Gj

Hj+λ and the objective value of the tree structure is −

1 2 PT j=1 G2 j Hj+λ + γT.

(43)

3.4. XGBOOST 43 We can collect all of our predictive and regularisation terms as such with this objective. As there can be infinite possible tree structures we find the split maximising the gain in:

Gain = 1 2[ G2 L HL+ λ + G 2 R HR+ λ − (GL+ GL) 2 HL+ HR+ λ ] (3.21)

. In our research XGBoost will empirically build a predictive function on a set of features xi by this process and on a different set of features in the same format we can use these

functions to predict claim probability and size as response variables in the context of section 3.2.

Boosted models already have much popularity as predictive models due to flexibility and accuracy but XGBoost above other machine learning techniques has many more advantages, this has caused it to become widely popular in data analysis competitions due to speed and robustness. A major advantage is that XGBoost models can implement parallel processing and thus can train on multiple cores on several threads simultaneously, allowing the run time for each split to be much shorter. If information is missing from the data, XGBoost will make a best prediction by assuming an average scoring for a continuous variable or by assuming whichever direction adds the most predictive accuracy. Thus when inputting a sparse matrix, XGBoost will handle missing data as a non observation and still choose the best prediction path given the information available.

Traditional boosted models will keep splitting trees until a negative loss is observed, XGBoost will attempt to make splits up until the maximum tree depth provided and work backwards to determine where to cut off the prediction. This means that if a negative loss is observed in the second level of a tree but a larger positive loss is observed after, where a classical gradient boosting machine would simply stop at the first split, XGBoost has been able to produce a more complex and flexible prediction by cutting off after accounting for the net loss of both.

3.4.1 Feature importance

An issue with many machine learning techniques is the extent to which they are considered black-box models. By this we mean there is a difficulty in determining which features are most prevalent as predictive covariates for predicting an expected response. In a GLM or GAM model the impact that each variable presents in predicting the response is immediately visually clear in that for larger optimised coefficients βj will, for positive

input variables produce larger positive responses. A combination of these weights can be added to produce an expected response for any set of variables.