Optimization of dynamic pricing strategies for automobile insurance through an on-line sales channel

(1)

Optimization of dynamic pricing

strategies for automobile insurance

through an on-line sales channel

Meindert van Dijk

Master’s Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics

Author: Meindert van Dijk

Student nr: 10164693

Email: meindertvd@yahoo.com

Date: July 5, 2015

Supervisor: Dr. Sami Umut Can

(2)

(3)

Abstract

In this thesis, advanced statistical learning techniques are applied to the problem of predicting the price sensitivity of customers buying car insurance. A data set from an on-line agent provided quote and sales information for all quotes for car insurance placed on its system during a period of three months, thus providing a close-to-complete picture of its on-line insurance market. Using this data set, predictions were made for the price sensitivity of four different labels of one insurance company, using several statistical learning techniques: logistic regression, regularized regression, bagging, boosting, and random forests. The data set was coupled to external marketing data, in order to assess the relevance of such data for the purchase behavior of customers.

The main conclusion of the study is that price is by far the most important variable in predicting the purchase behavior for the system studied here. Other variables had a limited contribution, and the relevance of the external data sources was limited to none. Comparison of statistical learning techniques revealed a beneficial advantage of applying regularized regression techniques, mainly due to their ease of use. For decision-tree techniques, predictive power over simple trees improved by applying more complex tree-based techniques, while the possibility of inference can be maintained through variable importance measurements.

(4)

1 Introduction 5

1.1 Dynamic Pricing . . . 5

1.2 Research question . . . 6

1.3 Description of the on-line system of Agent A . . . 6

1.4 Statistical language and supporting packages . . . 7

1.5 Data description . . . 7

2 Statistical techniques 11 2.1 Model selection techniques . . . 11

2.2 Model-preformance measures in classification . . . 12

2.3 Logistic regression . . . 13 2.4 Regularized regression . . . 13 2.5 Decision Trees . . . 15 3 Results 17 3.1 Data exploration . . . 17 3.2 Logistic regression . . . 21 3.3 Decision trees . . . 34 3.4 Model comparison . . . 34 3.5 Price-sensitivity curves . . . 38 4 Conclusion 45 4.1 Summary of findings . . . 45

4.2 Recommendations for further research . . . 45

4.3 Acknowledgments . . . 46

Bibliography 46

(5)

Chapter 1

Introduction

1.1 Dynamic Pricing

Pricing of insurance products has traditionally been a two-step process. First, an actu-arial premium (or pure premium) is derived from the expected loss, increased with risk loadings. The second step is an adjustment of the pure premium to arrive at the com-mercial premium for which the product is offered to customers. The adjustments could be driven by e.g., market conditions, the desire to write a certain volume, or incentives for cross-selling of other products.

More and more, insurers are revising their pricing strategies to adopt a more dynamic framework. Dynamic pricing is an optimization strategy for pricing of products that aims to optimize supply and demand in a different way from traditional pricing. A well-known example is the pricing of travel tickets, which can vary day-to-day based on the expected demand for a ticket. In transitioning towards dynamic pricing strategies, insurers are factoring in wider arrays of data that are available for their customers. These strategies are typically aimed at improving insight into the risk profile of customers. A classic example is the correlation between credit rating and claims behavior in automobile insurance, which was discovered in the 1990s [1].

One step further would be to address the profitability of the business as a whole through the commercial premium. This requires a better understanding of the sensitiv-ity of customers to the different prices offered by insurers that are active in the market. An important metric that is considered for such studies is Customer Lifetime Value ([2]), which looks at the potential value of a customer over her entire lifetime, includ-ing potential churn, up-sell, or cross-sell, rather than just takinclud-ing the current contract into account. The advancement of data analytics and predictive modelling could be an important factor in incorporating these insights into the pricing strategy [3].

Another constraint that could be factored into pricing considerations are capital constraints. Rather than pure volume-driven sales targets, this would factor in return metrics that are corrected for the risk that is inherent in each product. This could be addressed through a metric such as RAROC (Risk-Adjusted Return on Capital), in which the required return on investments is corrected for the capital consummation of a product and/or customer [4, 5].

The first published attempt of determining the optimum pricing strategy in a com-petitive market was published by Taylor in 1986 [6]. The utility of wealth (discounted expected profit or capital) is maximized over a finite time horizon, using the premium rate as a control variable. A demand function is included, which depends on the price difference with the average market premium and a set of all other variables consid-ered to be relevant (labeled as θ). Taylor applies deterministic projections, as well as the assumption that θ can be neglected. Emms et al. [7] expanded Taylors work to a stochastic framework. Pantelous et al. [8] also use a stochastic framework, but allow for the inclusion of θ, which they calibrate with historical data on a per-company basis,

(6)

without specifying the origins of the variables impacting θ.

A slightly different approach was taken by Dutang et al. [4], who use a game-theoretic approach to model the problem of optimal pricing in a competitive market, and allow for the inclusion of capital constraints. A simulation is run in which several insurance companies set price levels, after which the impact on profitability and capital is cal-culated. The model then seeks to find the premium rate that optimizes profitability, under the constraint that capital should remain positive. The model includes explicit functions for lapse and loss distributions. Lapse is modeled as a multinomial logit model, which depends on a price-sensitivity function. Loss is modeled through a frequency and severity model.

1.2 Research question

The purpose of this thesis is to improve optimization of the commercial premium mar-gins, by applying advanced statistical learning techniques and data analytics solutions to understand and calibrate the relevant parameters that drive price sensitivity of in-surance customers.

The main question that is studied is to what extent customers are driven by price, and to what extent by other factors, such as age, income, or value of the insured vehi-cle. In addition, the added value of external data sources on the predictive power for purchasing behaviour is studied. For this, two separate data sets are coupled to data available internally to an insurer. One data set is accurate on a post code level, the second is available on a household level. Understanding their impact is important, as it can influence the design of the pricing model, as well as the amount of personal data acquired from a customer in the quotation phase. For the household data to be used in setting the premium, the acquisition process should be changed, since currently the house number is not acquired at that stage.

The data for this study is a data set provided by a Dutch insurance agent (referred to as “Agent A” for reasons of confidentiality), that operates on-line by providing extensive price comparisons based on a limited set of customer-provided input. The data is coupled to external data sources with potentially explanatory variables for customer behavior. In order to derive a model for price sensitivity, several statistical-learning techniques will be evaluated, such as (regularized) generalized linear models, and decision-tree methods such as random forests. See Chapter 2 for a detailed discussion of these techniques, as well as a discussion of methods to assess their predictive power.

The study aims to improve insight in the underlying mechanics of the quote system and the behaviour of customers using the system. In particular, the purchase behaviour with regards to automobile insurance products of four different labels of one Dutch insurer, (referred to as “Insurer I” for reasons of confidentiality), is studied.

1.3 Description of the on-line system of Agent A

Agent A operates one of the largest websites in the Netherlands for price comparison of non-life and health insurance policies, which can be used for requesting quotes for a wide range of insurance products. This study, however, is limited to automobile insurance.

After providing a basic set of personal information, which is used to make a risk assessment and includes e.g. age and license plate number, a wide range of quotes from several insurers is displayed. The quotes are presented in a ranked order, which is based on price, service ratings, and an analysis of policy conditions. This order is referred to as “Beste Koop”, or in English, “Best Buy”.

Initially, only the top-3 ranked quotes are shown, and the customer needs to click to the next page to see the other quotes. In case the cheapest quote does not fall within

(7)

Dynamic pricing for automobile insurance — Meindert van Dijk 7

the top 3, it is also shown on the first page as a fourth option. In this study, a quote is classified as top 3 if it is shown on the first page, i.e., the fourth option is included.

For this study, all quotes for automobile insurance in the website of Agent A were provided for a period of three months (September through November 2014). These quotes were compared to information from each policy sold during the same period. If a policy was sold to one of the labels of Insurer I, this information was provided. If a policy was sold to a different insurer, the specific name or label was not disclosed.

Linkage of data to external sources was carried out by Agent A, so that no informa-tion was shared through which it would be possible to identify individual customers of Agent A or Insurer I.

1.4 Statistical language and supporting packages

All analysis was carried out using the statistical language R [9]. The following packages (all available from R’s CRAN repository) have been used as part of the model: gridExtra [10], dummies [11], hexbin [12], glmnet [13, 14], arm [15], stargazer [16], randomForest [17], RColorBrewer [18], psych [19], tree [20], pROC [21], MASS [22], stats [9], corrplot [23], bibtex [24], and gbm [25].

1.5 Data description

The study is based on quote and sales data from Agent A, specified for Insurer I, from September through November 2014. The data was modified to synthesize and derive meaningful variables to study the price sensitivity of customers.

1.5.1 Quote data

For the quotes, the following information is provided. • A unique quote identifier

• The date of the quote

• A list of premiums offered by all insurers in the market. The label that quoted the premium is only identified if it is a label of Insurer I. Labels of other insurers are anonymized.

• For each quoted premium, the position in the “Best Buy” ranking • For each quoted premium, the position in the “Best Price” ranking

For each label of Insurer I, for each quote the following information has been ex-tracted from this data set (the variable name in the model is in parenthesis).

• The premium quoted by the label (premie.quote) • The lowest premium in the quote (pmin.pr)

• The premium belonging to the first-ranked quote in the “Best Buy” ranking (pmin.bk)

• The price difference with the lowest premium in the quote (pdiff.pm)

• The price difference with the first-ranked quote in the “Best Price” ranking (pdiff.bk) • The price difference with the mean of all quoted premiums (pdiff.pm)

(8)

• The position in the “Best Buy” ranking of the premium quoted by the label (pos.bk).

• The position in the “Best Price” ranking of the premium quoted by the label (pos.prijs).

1.5.2 Sales data

For the sold policies, the following information is provided. • A unique policy identifier

• The status of the policy (only status “Polis” is used, since that indicates that the policy is final)

• Premium at which the policy was sold

• Label that sold the policy. The label is only identified if it is a label of Insurer A. Labels of other insurers are anonymized.

• Information about the customer (such as age, (4-digit) postal code, number of years since obtaining a driver’s license)

• Information about the insured vehicle (such as catalog value, current value, value of accessories, make and model, annual distance driven)

In addition, a list that links the unique quote identifier to the unique policy identifier was also provided, for the policies with a final status.

1.5.3 External data

The data is linked to external marketing data provided by a company called Bisnode, through which additional information on the customers is provided, such as income, education level, and property value. The external data is both available at postcode level (6-digits), and at household level (6-digit post code plus house number). In order to ensure the privacy of its customers, Agent A linked the external data to the quote identifier (for the post-code-level data) and policy identifier (for the household-level data).

The Bisnode data set contained 299 predictive variables. A qualitative reduction was carried out, to avoid duplication of variables (e.g. two different variables were: owns a cat yes/no, and owns a cat extremely little, very little, (...), very often, extremely often. In that case, only the former variable was included), and to avoid excessive amounts of dummy variables for unordered catagorical variables, such as city.

Data was provided for four separate labels of Insurer I (denoted as Label A, Label B, Label C, and Label D). For as much as possible, analysis was carried out for each label separately, in order to capture the specificities of each label.

1.5.4 Response variable

The data set contains in total more than 879,000 quotes, and 43,206 final sales. The quote data is likely to contain multiple quotes requested by the same customer, as well as quotes requested by customers that decided not to buy insurance, or that only used Agent A’s website for price comparison and eventually bought the policy through a different channel. To remove such uncertainties from the analysis, the analysis was carried out only on the sold policies, thereby limiting the market to those customers that bought insurance through Agent A between 1 September 2014 and 1 December 2014. The differentiating factor that determines the response variable is whether the policy was bought from one of the labels of Insurer I, or from a different insurer.

(9)

Note that for practical reasons, for each label the data set is further reduced to in-clude only those sales for which the label offered a quote. Otherwise, predictive variables such as price difference between quoted and offered price would be meaningless for part of the data set.

1.5.5 Data dictionary

A description of the data fields and their source (Agent A or external data provider) is provided in Table 1.1. The descriptions are only available in Dutch. This table only shows the input fields of the model. Synthesized variables are specified earlier in this section.

Table 1.1: Description and source of data fields (in Dutch) used in the study.

Field Description Source

1 pos.bk Positie label in beste koop Agent A Quote Data

2 pos.prijs Positie label in beste prijs Agent A Quote Data

3 pos.top3 Positie label in top 3 beste koop Agent A Quote Data

4 premie.quote Quote door label Agent A Quote Data

5 pmin.pr Premie beste prijs Agent A Quote Data

6 pmin.bk Premie beste koop Agent A Quote Data

7 pdiff.pr Prijsverschil label en beste prijs Agent A Quote Data

8 pdiff.bk Prijsverschil label en beste koop Agent A Quote Data

9 pdiff.pm Prijsverschil label en gemiddelde

premie (alle quotes met rating)

Agent A Quote Data

10 pdiff.p3 Prijsverschil label met top 3 Agent A Quote Data

11 pdrel.pm Relatief prijsverschil met

gemid-delde premie

Agent A Quote Data

12 pdrel.p3 Relatief prijsverschil met top 3 Agent A Quote Data

13 leeftijd Leeftijd primaire klant Agent A Sales Data

14 premie Premie afgesloten Agent A Sales Data

15 rbw.jaar Aantal jaren rijbewijs Agent A Sales Data

16 verz.jaar Aantal jaren verzekerd Agent A Sales Data

17 schadevrij.jaar Aantal schadevrije jaren Agent A Sales Data

18 auto.cataloguswaarde Cataloguswaarde Agent A Sales Data

19 auto.dagwaarde Dagwaarde Agent A Sales Data

20 auto.gebruik Gebruik Agent A Sales Data

21 auto.gewicht Gewicht Agent A Sales Data

22 auto.jaarkilometrage Jaarkilometrage Agent A Sales Data

23 auto.waarde.acc Waarde accessoires Agent A Sales Data

24 auto.waarde.audio Waarde Audio Agent A Sales Data

25 positie.quote Positie in zoekresultaat Agent A Sales Data

26 hh 4geotyp Geotype External Household

27 hh inkom2 Inkomen External Household

28 hh tweevd Tweeverdieners External Household

29 hh opleid Opleiding External Household

30 hh lvnsfs2 Levensfase External Household

31 hh apershh Aantal personen in huishouden External Household

32 hh hond Bezit hond External Household

33 hh kat Bezit kat External Household

34 hh won typ Type woning External Household

35 hh won eig Eigendom woning External Household

(10)

Table 1.1: Description and source of data fields (in Dutch) used in the study.

Field Description Source

37 hh won lst Woonlasten External Household

38 hh b beleg Financieel: beleggen External Household

39 hh b lenen Financieel: lenen External Household

40 hh b spaar Financieel: sparen External Household

41 hh credcrd Bezit creditcard External Household

42 hh verzkl3 Aantal verzekeringen External Household

43 hh fintype Fintype External Household

44 hh a auto Aantal autos External Household

45 hh tydinet Tijd op internet External Household

46 URB Urbanisatie External Post code

47 INKOMEN Inkomen External Post code

48 SOCKLASSE Sociale klasse External Post code

49 OPLEIDING Opleiding External Post code

50 PRI WONIN3 Primair woningtype External Post code

51 K WON EIG2 Eigendom woning External Post code

52 WOONLAST Woonlasten External Post code

53 WOZ WON WOZ waarde woning External Post code

54 LEVENSFAS2 Levensfase External Post code

55 K GESLACH2 Geslacht External Post code

56 BELEGGEN Beleggers External Post code

57 LENEN Leners External Post code

58 SPAREN Spaarders External Post code

59 OUDEDAG Oudedagsvoorziening External Post code

60 ZKTEKN SW Switchgevoeligheid ziektekosten External Post code

61 MERKTROUW Merkentrouw External Post code

62 BESTKRUID Besteding dagelijkse boodschappen External Post code

63 GEOTYPE4 GeoType External Post code

64 AFS WIN10 Afstand tot klein winkelcentrum External Post code

65 AFS WIN100 Afstand tot groot winkelcentrum External Post code

66 AFS BANKFI Afstand tot bankfiliaal External Post code

67 AFS SUPERM Afstand tot supermarkt External Post code

68 WSM1 ZWANG Zwangerschapsindex External Post code

69 FINTYPE FinType External Post code

70 KL WONING Beoordeling woningkwaliteit External Post code

71 KL VEILIG Beoordeling veiligheid External Post code

72 KL SOCSAM Beoordeling sociale samenhang External Post code

73 KL VERKEER Beoordeling verkeersoverlast External Post code

74 KL BUURT Beoordeling van de buurt External Post code

(11)

Chapter 2

Statistical techniques

This chapter describes the statistical techniques used in this study. The problem under study is predicting which customers will buy insurance. In essence, this is a classifica-tion issue, i.e., a qualitative outcome is predicted: buy or not buy. There are several different techniques that can be used for this problem, such as logistic regression, linear discriminant analysis, generalized additive models, decision trees (including bagging, boosting, and random forests). Books on statistical learning by James et al. [26], Hastie et al. [27], and Kuhn and Johnson [28] are recommended reads on these techniques. This chapter is limited to techniques used in the study: logistic regression, penalized logistic regression, and decision trees.

Classification can be done for a response variable with two levels (e.g., yes or no, buy or don’t buy), or multiple levels (e.g. blue, brown, or green eyes). For reasons of consistency with the subject of this thesis, the remainder of this chapter is limited to classification of two-level response variables.

This chapter is organized as follows. In order to properly discuss the different tech-niques used in this study, firstly, techtech-niques and metrics used to assess the performance of models are discussed in Section 2.1 and Section 2.2. Following that, the regression-based and tree-based techniques are discussed in Section 2.3, Section 2.4, and Section 2.5.

2.1 Model selection techniques

2.1.1 Hold-out sample

When assessing the performance of any given model, it is important to not only consider the performance on the data set that was used in calibrating the model, but rather assess its ability to provide meaningful predictions on different data sets. One way to do this is to randomly split the data set in two: one training set that is used to calibrate the model, and one test set that is used to validate its performance. When quantitatively assessing the model performance, it is usually the error measured on the test set, the test error, rather than the training error, that is the most meaningful parameter, and typically, it is higher.

2.1.2 k-fold Cross Validation

Taking this one step further, the process of separating the test and training set can easily be repeated multiple times, in order to improve the accuracy of the test error.

In k-fold cross validation (a typical choice for k is k = 10), the data set is randomly divided in k samples. In turn, the model is calibrated using k −1 sections of the data set, and validated using the remaining 1 section of the data set. By repeating this process k times, each section takes a turn as a validation set.

The test error can now be determined as the average test error for each of the k samples, and additionally, a measure of uncertainty for the test error is acquired as the

(12)

standard error of the k test error computations.

The disadvantage of this technique obviously lies in computing effort, since the process of calibrating the model needs to be repeated multiple times. See Figure 3.15 for an example of the application of 10-fold cross validation.

2.1.3 Bootstrap

In bootstrap sampling [29], data subsets are generated by sampling from the original data set with replacement. This means that individual elements from the original data sets can be duplicated in the subsets used for calibrating and validating. The subsets used for calibrating have the same size as the original data set, and validation is done only using samples that were left out of the training set (so-called out-of-bag samples). This technique is particularly powerful as it can be used to obtain multiple random subsets of original sets where only limited data is available, for example, historical equity performance. The disadvantage of that is that it could lead to biased predictions, since the calibration would be blind to events that are not in the original data set, for example, a market crash with a severity that has not been observed before.

2.2 Model-preformance measures in classification

2.2.1 Classsification error

A classification model ultimately aims to predict a qualitative response variable, for example, ”yes” or ”no”. The most straightforward way to measure the performance of such a model, is simply to count the number of mis-classifications on a test set, i.e., how often did the model predict ”yes” on a test set, while the true value is ”no”, and vice versa.

The output of a classification model is typically a probability p that, given the values of the predictors for a given data point in a set, the resulting classification is ”yes”, rather than ”no”. The modeler could then classify each resulting prediction with a probability p > 50% as yes, and then compare the predictions of ”yes” and ”no” to the true values in the test set. This will result in a number of correct predictions, as well as a number of false positives and false negatives. These results are usually presented in the form of a confusion table, an example of which is shown in Table 2.1.

Table 2.1: Example of a confusion table, showing the occurrence of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) classifications

True class Predicted class

Yes No

Yes TP FN

No FP TN

The error rate is defined as the ratio of correct predictions (on-diagonal in the table) and the false predictions (off-diagonal in the table). The true-positive rate of the model is called the sensitivity, while the true-negative rate is the specificity.

Depending on the context of the model, the modeler can also decide to set the classification threshold at a different value than 50%, e.g., 25% or 75%. This will affect the true-positive and false-positive rate, as well as the overall error rate. Especially when the data set is very imbalanced, i.e., the occurrence of ”yes” strongly outweighs the ”no” or vice versa, the overall meaningfulness of the model may improve at a different threshold, even though the overall error rate could be higher.

(13)

2.2.2 Area Under the Curve

Model performance independent of the choice of classification threshold can be assessed through a Receiver Operating Characteristic curve, or ROC curve. An ROC curve plots the sensitivity of a model against the specificity, for all possible choices of thresholds (see Fig. 3.21 for an example). If the threshold is set at p = 0, all model output will be classified as ”yes”, leading to a true positive rate of 1, and a true negative rate of 0. At p = 1, all model output is classified as ”no”, so the true positive rate is 0, and the true negative rate is 1. Obviously, both results are meaningless, and the optimal threshold will be somewhere in between. A perfectly predicting model would have a specificity of 1 and an sensitivity of 1. Walking through the curve, the optimum is found where the curve gets closest to the ”perfect” outcome. The Area Under the Curve (AUC) can be used to characterize the performance of the model in a single number [30, 31]. A perfect model would have an AUC of 1, while a random predictor has an AUC of 0.5.

2.3 Logistic regression

The problem of regressing a binary classification variable, is to convert a binary response variable (0, or 1) to a predicted probability in the range of [0, 1]. A commonly applied model is the logistic function.

p(X) = e

βX

1 + eβX (2.1)

where (β) is the vector of fit parameters, which has the same dimension as X, the vector of model variables. With some manipulation, it can be shown that the logit

log p(X) 1 − p(X) = βX (2.2)

is linear with respect to X.

Logistic regression is a widely used classification technique, and is relatively straight-forward to implement in R, using a Generalized Linear Model [26]. However, the useful-ness of the technique is limited if the number of variables is large. For large number of variables, the modeler has to remove model variables that do not contribute significantly to the prediction, as well as deal with collinearity, or correlation between variables.

One way of eliminating variables that do not contribute significantly is to use forward or backward stepwise selection. In backward stepwise selection, the modeler starts with the full model, and one-by-one eliminates the least significant variable (i.e., the one with the highest p-value) from the model. The procedure is repeated until all variables except the intercept are eliminated, and the best model, based on an established model selection criteria (e.g. AUC, or cross-validation test error) is selected. Forward stepwise selection follows the procedure in the other direction, starting from the NULL model (i.e., a model with only the intercept and no variables) and adding variables in a step-wise manner. This means that first, all variables have to be tested in a univariate approach, secondly, the variable which has the lowest p-value is selected, and the process starts all over with the remaining variables to select the second one, until the model is complete. Obviously, both forward and backward stepwise selection are very laborious and time-consuming procedures.

2.4 Regularized regression

Regularized regression techniques greatly alleviate the process of optimizing the fit result for a (logistic) regression problem. The two most commonly known methods for penalized regression are ridge regression [32], and lasso [33]. Both methods rely

(14)

on the same principle, which is to add a regularization term to the error function of the fit optimization specification. The regularization term scales with the sum of the fit parameters, multiplied by a tuning parameter. Therefore, by increasing the tuning parameter, the fitting routine optimizes towards a reduced version of the full model.

The difference between ridge regression and lasso is the scaling of the sum of fit parameters β with the tuning parameter λ. Ridge regression applies a quadratic sum

in the regularization term (or L2-norm): λP

i(βi)2, while lasso applies the sum of the

absolute values of fit parameters (or L1-norm): λP

i|βi|.

This means, as will be shown in more detail in the subsections below, that the regularization procedure for ridge regression is linear, while the absolute value operation in the lasso term introduces a nonlinearity. As a result, when increasing the tuning parameter λ in ridge regression, the fit parameters are reduced, but will always be larger than 0, and the total number of variables remains the same. The nonlinearity in lasso causes fit parameters to be reduced to 0, and thereby eliminating variables from the fit result, as the tuning parameter is increased.

In terms of interpretation of results, lasso obviously outperforms ridge regression, since the latter keeps non-contributing variables as part of the fit result. An important disadvantage of lasso however, is how it deals with correlated variables. Correlated variables would be expected to be regularized towards equal fit parameters, and thus be grouped. Ridge regression tends to handle correlated variables in that way. Lasso however, tends to randomly favor one variable in the group over the others, which may be an undesirable result.

A third regularization technique, the elastic net [34] aims to solve the issue of cor-related variables in lasso. The regularization term in the elastic net is a combination of ridge and lasso terms, and thus introduces an additional tuning parameter α. The

regularization term is described as λ(αP

i(βi)2+ (1 − α)Pi|βi|).

Mathematically, there are two equivalent ways of describing the ridge and lasso op-timization functions. Using the regularization terms as described earlier in this section, the optimization function for the fit parameters β under ridge regression becomes

ˆ βridge= argmin_β    N X i=1 (yi− β0− p X j=1 xijβj)2+ λ p X j=1 β_j2    (2.3) while for lasso, the optimization function is specified as

ˆ βlasso= argmin_β    1 2 N X i=1 (yi− β0− p X j=1 xijβj)2+ λ p X j=1 |βj|    (2.4) Equivalently, the optimization functions can be written as

ˆ βridge/lasso = argmin_β N X i=1  yi− β0− p X j=1 xijβj   2 (2.5) for ridge constraint by

p

X

j=1

β2_j ≤ λ (2.6)

and for lasso constraint by

p

X

j=1

|β_j| ≤ λ (2.7)

The shape of the constraint functions can be used as an explanation why the lasso cuts off variables at 0 [27]. In the case of two variables, β1, and β2 (the argument applies

(15)

equally to multiple dimensions), the constraint function for ridge Eq. (2.6) is a circle, while the constraint function for lasso Eq. (2.7) is a diamond, with sharp edges at the axes β1 = 0 and β2 = 0. A solution is found when the residual sum of squares coincides

with the constraint function. Due to the sharp edges for the lasso, this could occur at an edge on the axis, where one of the variables is equal to 0. For ridge this is not possible. The constraint function for elastic net can be visualized as a diamond with sharp edges, where the lines connecting the edges are curved outwards. As a result, the opti-mization behaves as a mixture of ridge and lasso, shrinking correlated variables towards equal values, but maintaining the variable selection feature of the lasso [27].

2.5 Decision Trees

Whereas statistical learning through regression aims at fitting a model to a data set, and subsequently derive conclusions, a decision tree infers conclusions by dividing the data into subsets. A very simple example would be to estimate salary in a data set of employees based on education: if education = university, then salary > 30000. This could be expanded by using more variables (e.g. education level and work experience), and estimating more levels of salary. A regression tree results in an estimated quantitative value for the target variable (although by definition of the method, the resulting estimate will always be discrete). A classification tree is used for prediction of a qualitative outcome (e.g. yes or no, or red, blue, yellow or green).

2.5.1 Decision tree basics

The idea behind a regression tree is to divide the feature space defined by the variables into distinct regions. The estimate of the response variable is the mean response in each region. The objective of the optimization is then to choose the regions such that the residual sum of squares (RSS) is minimized. This is done in a stepwise process, called recursive binary splitting [26]. The process is recursive, and each step splits the feature space into two, until the RSS can’t be decreased further. This means that the process cycles through each predictor in the data set, for each chooses a cut point for which RSS is minimized, and then splits the tree according to the predictor and cut point for which the RSS is minimal. This leads to a tree with one node and two leaves (end points), and the process is repeated for each leave, until no further reduction of the RSS can be achieved.

For classification trees, it is not possible to use RSS as a model criterion, since the response variable is discrete. Alternatives are the classification error rate (i.e. the fraction of observations that do not belong to the predicted class), or alternatively the Gini-index, which is a measure of the purity of the node,

G = K X k=1 ˆ pmk(1 − ˆpmk) (2.8)

Here, k designates the class of the response variable (e.g., k = 1 for yes, k = 2 for no), while m designates the node in the regression tree. ˆpmk is the fraction of observations

in the mth node that are of class k.

For example, if we split a binary classification tree (response has two classes k = 1 and k = 2) into two nodes (m = 1 and m = 2), in such a way that the resulting classification is perfect, i.e. in node m = 1 all observations are k = 1 and in node m = 2 all observations are k = 2, the resulting Gini index is G = 0 for each of the two nodes. However, if the classification would be random, i.e. half of the observations in both m = 1 and m = 2 are k = 1 and half are k = 2, the Gini index for each node is G = 0.25.

(16)

It is likely that the above described method leads to overfitting, and to an overly complex tree. Several techniques exist to optimize the tree fitting procedure, in order to optimize the test error, rather than the training error. One such method is pruning, which is the reduction of the fully-grown tree to a subset with minimal test error (e.g. through cross validation). Other techniques (bagging, random forests, and boosting) rely on aggregation of many randomly grown trees e.g. through bootstrapping.

2.5.2 Bagging

A familiar issue with the technique decribed in Section 2.5.1 is overfitting. The tree performs well on the training set that was used in the tree construction procedure, but performance on a test set is much poorer. This variance in performance can be reduced by growing trees on many different training sets, and use the average result of all those trees as the resulting model.

A way to generate many different training sets is through bootstrapping: repeated samples from the original training set. In other words, of our original data set of n observations of k predictors, we select a total of B subsets, each containing resampled observations from the original data set. It can be shown that the average outcome of the resulting B decision trees can dramatically improve the training set error of the model. This technique is called bootstrap aggregation, or in short, bagging. The value of B is not critical, if it is too high, it will not lead to over fitting, but it needs to be large enough to allow for sufficient variance between the sampled trees.

2.5.3 Random Forests

Random Forests give an improvement over bagging, by reducing the correlation between trees in the bootstrap sample. the bagging procedure can lead to the trees in the sample being very similar, especially if one or more variables dominate the others. This can be prevented by limiting the predictors that can be selected in each node of the split to a subset of the full predictor space. In other words, in each split of the tree, the procedure can only select from a subset m of all k variables. Typically, m is chosen as m =√K.

2.5.4 Boosting

Boosting applies a different approach to the classification problem, and does not rely on bootstrapping, as bagging and random forests. Boosting applies sequential steps of complexity to the observations, by adjusting the data in order to optimize the prediction [27].

Initially, a weak classifier is fitted to the data, e.g. by limiting the depth of a decision tree to only one node. Subsequently, the data set is modified, by applying a weight to each observation at each subsequent step in the fitting process. The weighting is increased if the observation is misclassified, and decreased if the observation is classified correctly. By applying this process recursively, a fitting routine is created in which each subsequent classifier gives more weight to the observations that are hardest to classify. The resulting model is the sum of all classifiers.

2.5.5 Variable importance

While a decision tree is relatively easy to interpret, it is not possible to represent the outcome of bagging, boosting, or random forests as a single tree. As a result, all three techniques can behave as a black box, from which it is difficult to extract which of the original predictors contribute most to the model.

However, variable importance can be extracted by calculating for each predictor how much the Gini index in Eq. (2.8) (or the RSS in a regression setting) is reduced on average from splits over that predictor.

(17)

Chapter 3

Results

This chapter provides an overview of the model results. In order to answer the research questions posed in Section 1.2, firstly a qualitative data exploration has been carried out (Section 3.1), including histogram and correlation plot analysis. The second step was to perform several logistic-regression (Section 3.2) and decision-tree techniques (Sec-tion 3.3), in order to determine which models perform best in predicting the buying behavior of customers, and to determine which predictive variables are most relevant. The performance of all models is compared in Section 3.4. The final step is an analysis of price-elasticity curves in Section 3.5, intended to both visualize and quantify the price sensitivity of customers.

Section 1.3 provides a description of the quote and sales system that was studied, and Section 1.5 provides a description of the data set, variables used, and data modifications.

3.1 Data exploration

3.1.1 Histograms

To obtain a better understanding of the data variables as well as their coherence, several multidimensional histograms were analyzed. These were mostly based on the internal variables (i.e. variables from data sources provided by Agent A), since these proved to be the most relevant variables for the final results.

Correlation between price and ranking

It could be expected that the Best Buy ranking has a strong correlation with the price that is offered, since price is an important ingredient for determining the ranking. This correlation is indeed evident from the slope in the plots in Fig. 3.1.

Interestingly, the cloud of points for Label D seems to be slightly shifted upwards and to the left, in comparison to Labels A and C. This could indicate that corrected for price differences, Label D is consistently rated higher by Agent A than the other labels, i.e., Label D makes the top 3 with a premium that is on average 29.8% cheaper (standard error 0.2%) than the market average, while Label A and C have to offer a premium that is 35% cheaper (standard error 0.2%) than the market to obtain the same ranking. Label B does not appear to be very competitive, as it hardly appears in the top 3 ranking results.

Age dependence of ranking and price difference

Fig. 3.2 shows the correlation of the price difference and age. This shows a different age dependence of price for each of the labels, which could be a result of different pricing strategies. Label A strongly focuses on younger age groups (below 25). Label B and C appear to target at the ages of 25 to 30. The age dependence of the quote prices of

(18)

Label A

Best Buy ranking

Pr

ice diff

. with mean competitors

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 5 10 15 20 25 Label B

Best Buy ranking

Pr

ice diff

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 5 10 15 20 25 Label C

Best Buy ranking

Pr

ice diff

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 5 10 15 20 25 Label D

Best Buy ranking

Pr

ice diff

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 5 10 15 20 25

Figure 3.1: Multidimensional histograms showing the coincidence of the relative price difference with the mean for all ranked quotes and the position in the best buy ranking for all four labels

Label D is less clear, though a slightly decreasing price trend can be observed for older ages.

Label A

Age

Pr

ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 20 40 60 80 Label B Age Pr ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 20 30 40 50 60 70 80 Label C Age Pr ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 20 40 60 80 Label D Age Pr ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 20 40 60 80

Figure 3.2: Multidimensional histograms showing the coincidence of age and the position in the best buy ranking for all four labels.

(19)

Dynamic pricing for automobile insurance — Meindert van Dijk 19 Label A Age Best Buy r anking 5 10 15 20 25 20 40 60 80 Label B Age Best Buy r anking 5 10 15 20 25 20 30 40 50 60 70 80 Label C Age Best Buy r anking 5 10 15 20 25 20 40 60 80 Label D Age Best Buy r anking 5 10 15 20 25 20 40 60 80

Figure 3.3: Multidimensional histograms showing the coincidence of age and the relative price difference with the mean for all ranked quotes for all four labels.

The impact of these pricing strategies on the Best Buy ranking can be seen in Fig. 3.3. Label A consistently reaches top-3 rankings for the youngest customers, while lower rankings for Label C mostly occur in the age range of 25 to 30. Label D shows a broad distribution of rankings for younger ages, with a relatively higher representation of lower rankings for ages above 40 than the other labels.

Car-value dependence of ranking and price difference

Fig. 3.4 shows the correlation of the price difference and car catalog value. From the graph, no clear trend can be observed, nor is there a trend appearing in the histogram of Best Buy ranking and car catalog value in 3.5

3.1.2 Correlation plots

To understand the correlation within the data set, correlation matrices were analyzed for several subsets of the data, and plotted visually for each label in Figs. 3.6, 3.7, 3.8, and 3.9.

Before calculating the correlation matrix, the data was standardized by scaling to the mean and standard deviation of the observed distribution.

The correlation plots were determined with the R package corrplot. The variables are grouped by hierarchical clustering [26, 27], leading to a representation in which strongly correlated parameters are grouped together. The ordering of variables is determined with hierarchical clustering for Label A, in order to maintain a consistent ordering for all labels.

The correlation largely behaves as expected. The price and ranking variables are strongly correlated, since the ranking strongly depends on price. The top-3 indicator (pos.top3) is 1 if a quote is in the top 3, and 0 otherwise, so a negative correlation with the price and ranking variables is expected.

(20)

Label A

Car value

Pr

ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

2e+04 4e+04 6e+04 8e+04

Label B

Car value

Pr

ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

2e+04 4e+04 6e+04 8e+04

Label C

Car value

Pr

ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

2e+04 4e+04 6e+04 8e+04

Label D

Car value

Pr

ice diff

erence with mean

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

2e+04 4e+04 6e+04 8e+04

Figure 3.4: Multidimensional histograms showing the coincidence of car catalog value and the relative price difference with the mean for all ranked quotes for all four labels.

Label A Car value Best Buy r anking 5 10 15 20 25

2e+04 4e+04 6e+04 8e+04

Label B Car value Best Buy r anking 5 10 15 20 25

2e+04 4e+04 6e+04 8e+04

Label C Car value Best Buy r anking 5 10 15 20 25

2e+04 4e+04 6e+04 8e+04

Label D Car value Best Buy r anking 5 10 15 20 25

2e+04 4e+04 6e+04 8e+04

Figure 3.5: Multidimensional histograms showing the coincidence of car catalog value and the position in the best buy ranking for all four labels.

(21)

Dynamic pricing for automobile insurance — Meindert van Dijk 21 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 pos .bk pos .pr ijs

pdrel.pm pdrel.p3 leeftijd rbw.jaar verz.jaar schade

vr ij.jaar pos .top3 auto .dagw aarde auto .w aarde .acc auto .w aarde .audio auto .jaar kilometr age auto .catalogusw aarde auto .ge wicht pos.bk pos.prijs pdrel.pm pdrel.p3 leeftijd rbw.jaar verz.jaar schadevrij.jaar pos.top3 auto.dagwaarde auto.waarde.acc auto.waarde.audio auto.jaarkilometrage auto.cataloguswaarde auto.gewicht

Figure 3.6: Visual representation of correlation of internal variabels for Label A.

Age (leeftijd) and years of owning a drivers license (rbw.jaar) are strongly correlated, which is expected, since most people obtain a drivers license at around age 18. Years of having insurance (verz.jaar) and years of driving without accidents (schadevrij.jaar) are correlated with age to a lesser extent.

For the variables relating to the insured vehicle, the original car value, or cata-logue value (auto.cataloguswaarde) is strongly correlated with the weight of the car (auto.gewicht), and mildly correlated with the current value of the car (auto.dagwaarde), since the latter also depends on the age and milage of the car, both of which are un-known. The value of accessories (auto.waarde.acc) and value of audio (auto.waarde.audio) are strongly correlated to the current car value. Annually driven milage (auto.jaarkilometrage) is not correlated with any other variable. This variable is entirely based on customer input, and might not be very accurate.

More interesting are the correlations of the age-related variables and the car related variables with the price related variables. For Label A (Fig. 3.6), there is a slightly pos-itive correlation between age and price difference (0.3 for age and price difference with the mean of all quotes, pdrel.pm), nearly no correlation between car catalog value and price difference (-0.07), and slightly positive correlation between car current value and price difference (0.12). This picture is similar for Labels B (Fig. 3.7) and C (Fig. 3.8), except for Label B a higher correlation is observed for age and price difference (0.6) and for Label C a higher correlation is observed for car catalog value and price difference (0.3). For Label D (Fig. 3.7) however, some correlations are inverted, and slightly neg-ative for age (-0.13) and car current value (-0.12). This could indicate different pricing strategies, which were also observed in the histogram analysis in Section 3.1.1.

3.2 Logistic regression

One important goal of this study is to predict the buying behavior of customers, given both their personal characteristics and the quoted premium. In practice, a pricing

(22)

an-−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 pos .bk pos .pr ijs

Figure 3.7: Visual representation of correlation of internal variabels for Label B.

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 pos .bk pos .pr ijs

(23)

Dynamic pricing for automobile insurance — Meindert van Dijk 23 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 pos .bk pos .pr ijs

Figure 3.9: Visual representation of correlation of internal variabels for Label D.

alyst would be interested to understand the change in the conversion rate for specific customer segments as a result of a price change.

One other goal is interpretation of the buying behavior of customers. To what extent are they purely driven by price, and to what extent do factors such as brand loyalty play a role. Obviously, those two objectives may require different modeling approaches. The model that is best at predicting may well be a black box if it comes to interpretation of the results.

The task at hand is classification, in which a binary outcome should be predicted: buying a policy at one of the labels of Insurer I, or buying a different policy. Several different models are compared, and for each model an optimal set of variables is selected. This section will focus on methods based on logistic regression, while the next section focuses on methods based on decision trees.

Following the qualitative reduction of variables performed on the data set (see Sec-tion 1.5), 60 predictive variables remain, 27 of which are (unordered) categorical vari-ables of various levels. In order to use the categorical varivari-ables, these should be expanded to dummy indicators for each level of the variable, which expands the total number of predictive variables to 204. In order to make the best possible prediction and the best possible interpretation, a subset of these 204 variables should be found that is reduced to contain only the variables that have a relevant impact on the response.

3.2.1 Logistic regression using all available variables

A first step for predictive modeling of buying behavior is to perform logistic regression without any parameter tuning or selection. All variables are used, without any nonlin-ear or interaction terms. Numerical variables are scaled to a mean of 0 and standard deviation of 1.

To get a first idea on the impact of each predictive variable, Tables 3.1, 3.2, 3.3, and 3.4 show the parameters with p-value below 0.05. From each table, it is obvious that the price parameter (pdrel.pm, the relative price difference with the mean of all quotes)

(24)

dominates the others, both in magnitude and significance of the coefficient.

For each label, several external variables are returned in the tables. However, their significance is typically very small, and the p-value is rarely below 0.01.

Obviously, regressing on all parameters and looking at those with p-value below 0.05 is not a good way to tune the model, though it does give a sense of direction. In practice, statistically more sound methods for calibration are needed (see section 2.1, and [27]). Table 3.1: Parameters with p-value below 5%, resulting from logistic regression with all variables on the probability to buy a policy, for Label A.

Estimate Std. Error z value Pr(>|z|)

pdrel.pm -2.346 0.059 -39.579 0 verz.jaar 0.415 0.073 5.713 0 schadevrij.jaar -0.568 0.078 -7.287 0 auto.cataloguswaarde 0.498 0.062 7.976 0 auto.jaarkilometrage -0.151 0.038 -4.007 0.0001 hh inkom2 0.165 0.078 2.111 0.035 hh lvnsfs27 1.710 0.841 2.033 0.042 hh lvnsfs28 1.772 0.836 2.120 0.034 hh lvnsfs29 2.076 0.841 2.468 0.014 hh apershh2 -0.919 0.406 -2.261 0.024 hh tydinet 0.134 0.058 2.286 0.022 kl verkeer -0.128 0.054 -2.375 0.018

Table 3.2: Parameters with p-value below 5%, resulting from logistic regression with all variables on the probability to buy a policy, for Label B.

pdrel.pm -4.942 0.459 -10.770 0 auto.dagwaarde 0.389 0.180 2.157 0.031 auto.waarde.acc 0.351 0.150 2.338 0.019 hh fintype6 3.810 1.353 2.816 0.005 levensfas26 -4.125 1.883 -2.191 0.028 levensfas28 -5.270 2.459 -2.143 0.032 k geslach22 3.572 1.651 2.164 0.030 afs win100 -0.629 0.209 -3.006 0.003 bs afs -0.407 0.183 -2.219 0.026

Another way of assessing model performance (in a qualitative way) is by using a lift curve. For a lift curve, the model results are compared to a test set, which was not used in calibration of the model. First, a response is predicted for each customer in the test set. The customers are then ranked on probability, and divided into 10 deciles. For each decile, the actual conversion rate is calculated from the observations in the test set.

The lift curves for Labels A-D are shown in Fig. 3.10 for the actual probability per decile, and in Fig. 3.11 for the relative difference with the global mean. For Labels A, C, and D, the conversion rate averaged over all customers varies between 4.5%, and 8%. For Label B, the conversion rate is much lower, around 0.7%. This value represents the market share for each label for policies closed by Agent A.

For each label, a clear lift effect is observed, meaning that the higher estimated conversion rate for the higher-ranked customers is consistent with observations in the test set. For a random prediction, each decile would have a result equal to the global

(25)

Table 3.3: Parameters with p-value below 5%, resulting from logistic regression with all variables on the probability to buy a policy, for Label C.

pdrel.pm -3.991 0.130 -30.743 0 leeftijd -0.437 0.166 -2.623 0.009 auto.waarde.audio 0.210 0.071 2.980 0.003 hh won eig1 0.649 0.313 2.072 0.038 merktrouw 0.151 0.073 2.065 0.039 geotype41 -1.682 0.513 -3.281 0.001 geotype44 -1.041 0.339 -3.071 0.002 fintype2 -0.560 0.224 -2.504 0.012

Table 3.4: Parameters with p-value below 5%, resulting from logistic regression with all variables on the probability to buy a policy, for Label D.

pdrel.pm -2.364 0.084 -28.242 0 schadevrij.jaar 0.227 0.056 4.028 0.0001 auto.cataloguswaarde 0.243 0.078 3.131 0.002 auto.gewicht -0.160 0.075 -2.132 0.033 auto.waarde.acc 0.184 0.058 3.190 0.001 auto.waarde.audio 0.154 0.068 2.269 0.023 hh lvnsfs22 0.610 0.301 2.028 0.043 hh hond1 0.262 0.118 2.218 0.027 hh won typ6 -1.132 0.565 -2.004 0.045 hh won typ14 -2.072 1.024 -2.024 0.043 hh tydinet -0.149 0.066 -2.247 0.025 inkomen1 -0.986 0.432 -2.282 0.022 inkomen3 -0.890 0.425 -2.095 0.036 afs bankfi 0.104 0.052 1.998 0.046

(26)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 Label A Label B 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Label C Label D Obser v ed con v ersion r ate Predicted score

Figure 3.10: Lift curve of actual conversion rates in the test set plotted against decile predictions from the model that was fitted to the training set. The horizontal line represents the global mean

mean conversion rate. The relative performance (Fig. 3.11) is similar for each label, including Label B, despite its lower overall conversion. Note however, that the lift effect strongly depends on the highest ranked decile, most likely due to the dominance of the price variable.

3.2.2 Regularized regression

As mentioned in Section 3.2.1, a more statistically sound method of variable selection is needed. One such method is backward stepwise selection, where a series of regressions is done, starting with the full model, and iteratively running regressions while removing the parameter with the highest p-value. The model that results in a minimum fit criterion (e.g. deviance or AIC [27]) is ultimately selected. Naturally, this process is very slow as it requires a large amount of iterations.

Another method for variable selection is to use regularized regression, in which the loss function is augmented with an additional term that increases with the size of the parameters. In this way, the model will optimize by balancing the residual error with the dimensions of the parameter space. See Section 2.4 for more elaborate explanation. There are three flavors for performing regularized regression: Ridge, which applies an L2-norm penalty, Lasso, which applies an L1-norm penalty, and Elastic Net, which applies a linear combination of an L2-norm and L1-norm penalties. A tuning parameter λ tunes the amount of penalization, and a second tuning parameter α tunes the mixture of the L1-norm and L2-norm penalty terms in the Elastic Net regression (with α = 0 leading to Ridge regression, and α = 1 leading to Lasso regression).

For a least squares regression, the loss function mathematically looks as follows.

L(β) = |y − (X)β|2+ λα |β|₁+ (1 − α) |β|2 (3.1)

(27)

Dynamic pricing for automobile insurance — Meindert van Dijk 27 −2 0 2 4 6 8 Label A Label B −2 0 2 4 6 8 Label C Label D Obser v ed con v ersion r

ate / Global mean con

v

ersion r

ate

Predicted score

Figure 3.11: Lift curve of relative conversion rate compared to the global mean in the test set plotted against decile predictions from the model that was fitted to the training set.

actual variable selection, in which more variables are reduced to zero as λ increases. Ridge regression shrinks variables, but all variables remain part of the model (see also Section 2.4). The disadvantage of this is that the model result is more difficult to in-terpret than the outcome of a Lasso regression, which will contain less parameters. The disadvantage of the Lasso regression, is that it deals less well with correlated variables, typically at random selecting one of a group of correlated variables, and removing the others, where Ridge regression would converge the group towards similar coefficients. Elastic Net aims to combine the best of both worlds as a combination of Ridge and Lasso regression.

The optimal value of λ is selected through 10-fold cross validation (see also Sec-tion 2.1.2, an out-of-sample re-sampling method, in which the optimal model is selected by minimizing the binomial deviation of an hold-out sample. By doing this for 10 dif-ferent hold-out samples, a confidence interval around the deviation can be established). The simplest model (i.e. highest lambda, which means using less variables) within 1 standard error of the minimum deviation is selected.

Tuning α requires quantitative and qualitative arguments. A modeler could run the tuning procedure for λ for a series of α values, an thus obtain the best predicting model in the two-dimensional parameter space. However, considerations of interpretability and dealing with correlated variables should also be taken into account when choosing the value for α.

Results for Ridge, Lasso, and Elastic Net

In Figures 3.12, 3.13, and 3.14, the fit coefficients are plotted against the tuning param-eter λ. As λ increases, the coefficients shrink, and are cutoff at 0 for Lasso and Elastic Net. The plots all look similar, with one variable consistency dominating the others. The dominant variable is the price difference with the mean (pdrel.pm).

(28)

175 150 105 46 11 5 3 2 1 1 −10 −8 −6 −4 −2 −4 −3 −2 −1 0 1 2 Label A 179 173 153 108 42 17 3 1 1 −12 −10 −8 −6 −4 −4 −3 −2 −1 0 1 2 Label B 180 166 125 75 36 15 9 1 1 1 −12 −10 −8 −6 −4 −4 −3 −2 −1 0 1 2 Label C 177 169 133 78 34 11 6 5 4 1 −12 −10 −8 −6 −4 −4 −3 −2 −1 0 1 2 Label D Coefficient Estimate log(Lambda)

Figure 3.12: Development of the coefficient estimates for the lasso regression along the fitted λ sequence. The top axis indicates the degrees of freedom at each value of λ. The two vertical lines indicate the λ value for which the binomial deviance is minimized (left) and the 1 standard-error value above (right).

176 151 108 49 11 5 4 3 1 1 −10 −8 −6 −4 −2 −4 −3 −2 −1 0 1 2 Label A 189 180 164 112 50 21 6 1 1 −12 −10 −8 −6 −4 −4 −3 −2 −1 0 1 2 Label B 186 166 128 79 35 17 10 1 1 −10 −8 −6 −4 −2 −4 −3 −2 −1 0 1 2 Label C 182 172 137 79 37 12 6 5 5 1 −10 −8 −6 −4 −2 −4 −3 −2 −1 0 1 2 Label D Coefficient Estimate log(Lambda)

Figure 3.13: Development of the coefficient estimates for the elastic net regression (α = 0.5) along the fitted λ sequence. The top axis indicates the degrees of freedom at each value of λ. The two vertical lines indicate the λ value for which the binomial deviance is minimized (left) and the 1 standard-error value above (right)

(29)

Dynamic pricing for automobile insurance — Meindert van Dijk 29 202 202 202 202 202 202 202 202 −8 −6 −4 −2 0 2 4 −4 −3 −2 −1 0 1 2 Label A 202 202 202 202 202 202 202 202 −10 −8 −6 −4 −2 0 2 −4 −3 −2 −1 0 1 2 Label B 202 202 202 202 202 202 202 202 −10 −8 −6 −4 −2 0 2 4 −4 −3 −2 −1 0 1 2 Label C 202 202 202 202 202 202 202 202 −8 −6 −4 −2 0 2 4 −4 −3 −2 −1 0 1 2 Label D Coefficient Estimate log(Lambda)

Figure 3.14: Development of the coefficient estimates for the ridge regression along the fitted λ sequence. The top axis indicates the degrees of freedom at each value of λ. The two vertical lines indicate the λ value for which the binomial deviance is minimized (left) and the 1 standard-error value above (right)

In order to tune the parameter λ, a 10-fold cross validation was carried out for each label and for the three regression types lasso, elastic net and ridge. The results are shown in Figures 3.15, 3.16, and 3.17.

The results per label are similar for lasso, elastic net, and ridge. Only Label B shows a significant improvement in deviance for higher values of λ. For the other labels, the deviation does not decrease as variables are removed from the full model. As long as the deviance does not increase, however, increasing λ would still be a model improvement, as it is best practice to select the simplest model (i.e. least number of variables) with the lowest deviance.

From lasso (Fig. 3.15) to elastic net (Fig. 3.16), to ridge (Fig. 3.17), the number of nonzero variables at the optimum λ-value chosen by cross validation (at 1 standard error above the minimum deviance), increases. For ridge, by definition no variables are removed. For lasso, the model is reduced to 2 variables for Label A, and only 1 variable for Label B. For elastic net, the number of variables varies between 3 and 20. An overview of the nonzero coefficients at the λ1se point is given in Table 3.5 for lasso

and in Table 3.6 for elastic net.

Evaluating the results of the (regularized) logistic regression analysis, it is clear that the price variable stands out as dominant in the regression analysis. Other variables do make a significant contribution, however, their contribution is at least an order of magnitude smaller. For the regularized regression results, the ridge regression by default is hard to interpret, while there is no considerable difference between the elastic net and lasso results in terms of improvement in binomial deviance. This would lead to a model selection of the lasso model as the most parsimonious model with the lowest deviance. Further model comparison is done in Section 3.4

(30)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 176 152 118 62 20 3 2 1 1 1 −10 −8 −6 −4 −2 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Label A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 175 171 160 141 97 32 9 1 1 −12 −10 −8 −6 −4 0.06 0.07 0.08 0.09 0.10 0.11 Label B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 176 161 137 82 41 16 4 1 1 0 −10 −8 −6 −4 0.20 0.25 0.30 0.35 Label C ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 179 164 140 94 40 10 5 4 3 1 −10 −8 −6 −4 0.35 0.40 0.45 0.50 Label D Binomial De viance log(Lambda)

Figure 3.15: Cross-validation result for the lasso regression along the fitted λ sequence. The red dots indicate the average binomial deviance for 10 hold-out samples, while the error bars give the 95% confidence interval. The top axis indicates the number of nonzero variables at each value of λ. The two vertical lines indicate the λ value for which the binomial deviance is minimized (left) and the 1 standard-error value above (right)

Table 3.5: Nonzero coefficients for each label at the λ1se point for the lasso regression.

A B C D pdrel.pm -1.9964 -1.5467 -3.1455 -1.7837 schadevrij.jaar 0.1764 auto.cataloguswaarde 0.1465 auto.dagwaarde 0.0183 0.0825 auto.waarde.acc 0.0732 auto.waarde.audio 0.0475 0.1323 hh inkom2 -0.0481 hh woz won -0.0456 beleggen -0.0272 lenen 0.0707 oudedag 0.0152 kl woning -0.0027 kl verkeer -0.0394 kl buurt -0.0201

(31)

Dynamic pricing for automobile insurance — Meindert van Dijk 31 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 179 156 125 67 24 5 3 3 1 1 −8 −6 −4 −2 0.30 0.35 0.40 0.45 0.50 0.55 Label A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 185 180 166 155 110 29 7 1 1 −12 −10 −8 −6 −4 0.06 0.07 0.08 0.09 0.10 0.11 Label B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 184 170 143 91 42 24 7 1 1 −10 −8 −6 −4 −2 0.20 0.25 0.30 0.35 Label C ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 181 166 140 92 38 9 6 5 4 1 −8 −6 −4 −2 0.35 0.40 0.45 0.50 Label D Binomial De viance log(Lambda)

Figure 3.16: Cross-validation result for the elastic net regression (α = 0.5) along the fitted λ sequence. The red dots indicate the average binomial deviance for 10 hold-out samples, while the error bars give the 95% confidence interval. The top axis indicates the number of nonzero variables at each value of λ. The two vertical lines indicate the λ value for which the binomial deviance is minimized (left) and the 1 standard-error value above (right)