Classification Models for Lapse Risk

(1)

M ASTER ’ S T HESIS

An Analysis of

Classification Models for Lapse Risk

Author:

E LLES VAN S ARK

S3115615

Supervisor:

D ^R . K ^ESINA

A ^UGUST 27, 2020

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Dr. Kesina

Second assessor: Dr. Vullings

(3)

M ASTER ’ S T HESIS

An Analysis of Classification Models for Lapse Risk

E LLES VAN S ARK

Abstract

A high lapse ratio has several negative consequences for an insurance com-

pany. Namely, it could result in a liquidity problem. Furthermore, the insur-

ance company loses the ability to make future profits on the lapsed policies

and its reputation could be damaged. To mitigate these consequences it

is crucial that insurance companies are able to predict which policies will

lapse. Also, if insurers could accurately predict which policies will lapse,

strategies could be employed to retain these policies. In this thesis six clas-

sification models are compared on their out-of-sample predictive ability of

lapse. The six models examined are the logit model, the complementary

log-log model, the k-nearest neighbour algorithm, a random forest, a neural

network and the extreme gradient boosting algorithm. In this thesis car in-

surance data from a Dutch insurance company is used. The out-of-sample

predictive ability of the models is evaluated with five performance metrics,

namely the accuracy, the F-score, the area under the receiver operating char-

acteristic curve, Matthews correlation coefficient and Theil’s U. Based on

the performance metrics the random forest and the extreme gradient boost-

ing algorithm have a greater out-of-sample predictive ability than the other

models. Furthermore, the extreme gradient boosting algorithm performs

marginally better than the random forest.

(4)

1 Introduction 1

1.1 Context . . . . 1

1.2 Problem Statement . . . . 2

1.3 Structure of the Thesis . . . . 3

2 Literature Review 4 2.1 Lapse Research . . . . 4

2.1.1 Economic and Company Based Research . . . . 4

2.1.2 Policy and Policyholder Based Research . . . . 5

2.2 Classification Models . . . . 6

2.3 Contribution to the Literature . . . . 9

3 Methods 10 3.1 Models . . . 10

3.1.1 Generalized Linear Models . . . 10

Logit Model . . . 12

Complementary Log-Log Model . . . 12

3.1.2 k-Nearest Neighbour Algorithm . . . 13

3.1.3 Random Forest . . . 15

3.1.4 Neural Network . . . 17

Neural Network Structure . . . 18

Neural Network Optimization . . . 20

Regularization . . . 21

Relevance for this Thesis . . . 22

3.1.5 Extreme Gradient Boosting . . . 22

Extreme Gradient Boosting Framework . . . 22

Additional Extreme Gradient Boosting Features . . . 24

Loss Function . . . 25

Relevance for this Thesis . . . 25

3.2 Performance Metrics . . . 25

3.2.1 Accuracy . . . 26

3.2.2 F-score . . . 26

3.2.3 Area Under the Receiver Operating Characteristic Curve . . . . 26

3.2.4 Matthews Correlation Coefficient . . . 27

3.2.5 Theil’s U . . . 27

4 Data 29 4.1 Explanatory Variables . . . 29

4.1.1 Characteristics of the Policy . . . 30

4.1.2 Characteristics of the Policyholder . . . 31

4.2 Data Preparation . . . 32

4.3 Descriptive Statistics . . . 33

5 Implementation and Estimation 37

5.1 Generalized Linear Models . . . 37

(5)

5.1.1 Logit Model . . . 39

5.1.2 Complementary Log-Log Model . . . 41

5.2 k-Nearest Neighbour Algorithm . . . 42

5.3 Random Forest . . . 43

5.4 Neural Network . . . 44

5.5 Extreme Gradient Boosting . . . 48

6 Implementation and Estimation with a Balanced data set 50 6.1 Generalized Linear Models . . . 51

6.1.1 Logit Model . . . 51

6.1.2 Complementary Log-Log Model . . . 52

6.2 k-Nearest Neighbour Algorithm . . . 53

6.3 Random Forest . . . 54

6.4 Neural Network . . . 55

6.5 Extreme Gradient Boosting . . . 55

7 Results of the Out-of-sample Analysis 57

8 Conclusion and Discussion 60

A Appendix A 63

B Appendix B 71

(6)

List of Figures

3.1 An example of the 3-NN majority vote procedure. . . 13 3.2 An example of a problematic majority vote. . . 15 3.3 An example of a tree on the left and the corresponding regression sur-

face on the right (Efron and Hastie, 2016). . . 16 3.4 A schematic representation of a neural network with three hidden lay-

ers, p input units and nine output units (Efron and Hastie, 2016). . . . 18 4.1 Histograms of lapsed and non-lapsed policies according to the change

in four product categories. . . 35 4.2 Correlation matrix of the regressors. . . 36 5.1 Cross-validation estimates of accuracy and kappa for different values

of k for the k-NN algorithm. . . 43 6.1 Cross-validation estimates of accuracy and kappa for different values

of k for the k-NN algorithm using the balanced data set. . . 54 A.1 Histogram of lapse with comprehensive coverage. . . 63 A.2 Histogram of lapse with accident coverage. . . 63 A.3 Scatter plots of hyperparameters of the neural network and the accu-

racy on the validation set. . . 65 A.4 Scatter plots of hyperparameters in the XGBoost algorithm and the

accuracy on the validation set. . . 66 A.5 Scatter plots of hyperparameters of the neural network and the accu-

racy on the validation set using the balanced dataset. . . 69 A.6 Scatter plots of hyperparameters in the XGBoost algorithm and the

accuracy on the validation set using the balanced dataset. . . 70

(7)

List of Tables

2.1 Literature on Lapse considering Policy and Policyholder Characteristics. 7

4.1 Summary statistics. . . 34

5.1 Final logit model. . . 40

5.2 Final complementary log-log model. . . 41

5.3 Out-of-bag estimate of the error rate of the random forest for differ- ent values of the number of classification trees and the number of ex- planatory variables considered at each split. . . 44

5.4 Ranges for the hyperparameters of the neural network. . . 47

5.5 Ranges for the hyperparameters of the XGBoost algorithm. . . 49

6.1 Final logit model using the balanced data set. . . 52

6.2 Final complementary log-log model using the balanced data set. . . 53

6.3 Out-of-bag estimate of the error rate of the random forest for differ- ent values of the number of classification trees and the number of ex- planatory variables considered at each split using the balanced data set. . . 55

7.1 Performance metrics for the models using the out-of-sample data set. . 58

7.2 Performance metrics for the models using the balanced out-of-sample data set. . . 59

A.1 Optimal threshold value for the logit model with the training data. . . 64

A.2 Optimal threshold value for the complementary log-log model with the training data. . . 64

A.3 The hyperparameter values and accuracy on the validation set of the ten best performing neural network configurations. . . 65

A.4 The hyperparameter values and accuracy on the validation set of the ten best performing XGBoost configurations. . . 66

A.5 Performance metrics for the models using the training dataset. . . 67

A.6 Summary statistics of the balanced dataset. . . 67

A.7 Optimal threshold value for the logit model using the balanced train- ing data. . . 68

A.8 Optimal threshold value for the complementary log-log model using the balanced training data. . . 68

A.9 The hyperparameter values and accuracy on the validation set of the ten best performing neural network configurations using the balanced dataset. . . 68

A.10 The hyperparameter values and accuracy on the validation set of the

ten best performing XGBoost configurations using the balanced dataset. 69

A.11 Performance metrics for the models using the balanced training dataset. 70

B.1 Performance metrics. . . 71

(8)

Acronyms

k-NN k-Nearest Neighbour.

AC Accident Coverage.

AIC Akaike Information Criterion.

AUC Area Under the receiver operating characteristic Curve.

CART Classification And Regression Tree.

CC Comprehensive Coverage.

CLL Complementary Log-Log.

FN False Negative.

FP False Positive.

LSTM Long Short-Term Memory.

MCC Matthews Correlation Coefficient.

NCDL No-claim Discount Lost.

ROC Receiver Operating Curve.

SVM Support Vector Machine.

TN True Negative.

TP True Positive.

XGBoost EXtreme Gradient Boosting.

(9)

1 Introduction

The aim of this thesis is to identify the optimal classification model to predict lapse of insurance policies based on several performance metrics. First, a general intro- duction into this topic is provided. Afterwards, the research question is introduced.

Lastly, the structure of this thesis is outlined.

1.1 Context

Insurance companies have a crucial function. They provide a way in which indi- viduals and companies are able to exchange risk, often associated with an uncertain financial consequence, for regular fixed premium payments (Swain and Swallow, 2015). A bankruptcy of an insurance company could have detrimental effects on the financial system and the economy (Swain and Swallow, 2015). Therefore, an insurance company should be aware of and manage all types of risk it faces. To properly supervise insurers, a regulatory framework called Solvency II was imple- mented across Europe in 2016. Solvency II is based on three pillars, namely quan- titative requirements, qualitative supervision and transparency. Under Solvency II the solvency capital requirement of an insurance company depends on risks taken by the insurance company. If the insurance company takes more risk, the solvency capital requirement increases. Different types of risk are considered in the compu- tations, such as interest rate risk, spread risk and lapse risk. This thesis focuses on lapse risk.

Under Solvency II lapse risk accounts for all risk stemming from potential policy- holder options to "fully or partly terminate, surrender, decrease, restrict or suspend the insurance cover, but also the right to fully or partially establish, renew, increase, extend or resume the insurance cover" (Burkhart, 2018). An insurance policy is sur- rendered if the policyholder terminates the policy before the original end date (Dick- son et al., 2009). The policyholder could receive a cash payment, called the surrender value. Studies concerning lapse frequently do not distinguish between a lapse and a surrender and use the terms interchangeably. Occasionally, the distinction is made that in case of a lapse the policyholder does not receive a cash payment, while this does occur in case of a surrender. However, in this thesis the definition as in the Solvency II framework is used.

Next, the potential consequences of a lapse for the insurance company are exam-

ined. A high lapse ratio could entail several adverse consequences for an insurance

company. The lapse ratio is defined as the number of lapsed policies in a given time

period divided by the total number of policies. Eling and Kochanski (2013) describe

four of these consequences. Firstly, insurance companies could face a liquidity prob-

lem should many policies lapse since it might be obliged to pay the surrender values

to the policyholders. A company faces liquidity risk when it is not able to easily con-

vert its assets into cash in order to meet financial obligations. Liquidity risk forces

the sale of assets at a reduced price or entails paying interest on loans taken to meet

payments (Kamau and Nejru, 2016). Hence, there is a negative relationship between

liquidity risk and the return on equity of an insurer. Secondly, the insurer loses the

(10)

ability to make future profits on the policy. Particularly if lapse occurs shortly after the policy is established the insurer might not recover initial expenses, such as the acquisition and underwriting costs. Thirdly, the ability to lapse increases the risk of adverse selection, which will decrease the profitability of the insurer. Lastly, a high lapse rate could diminish the insurer’s reputation. This could cause even more lapse events and is detrimental for acquiring new customers. Kuo et al. (2003) iden- tify similar consequences. Furthermore, they find that insurers select a more liquid investment portfolio, which normally yields lower returns, to reduce the liquidity risk associated with lapse. This could lead to an increase in premiums, which fur- ther deteriorates the probability of obtaining new customers. A study about lapse rates of Chinese life insurance products confirms that high lapse rates could harm the financial status of an insurer and the ability to acquire new customers (Yu et al., 2019). Thus, lapse affects multiple aspects of an insurance company ranging from policy design and pricing to how risk managed (Eling and Kochanski, 2013).

1.2 Problem Statement

The above implications of lapse demonstrate the importance of predicting lapse for insurance companies. This allows them to diminish the negative effects of lapse. To identify which policies are at risk of lapsing, a classification model could be used.

If insurers could accurately predict which policies will lapse, strategies could be employed to retain these policies. Furthermore, identification of important lapse drivers allows an insurance company to target the preferred demographic group. A classification model predicts the status of a policy based on certain characteristics. In this thesis several classification models are used to identify policies at risk of lapsing.

There are numerous classification models available, such as generalized linear mod- els. Due to progress in electronic computation, it became feasible to estimate newer approaches such as neural networks and random forests. Of course, insurers should be aware of the advantages and disadvantages of each of these models. An exten- sive comparison of classification models is therefore useful for insurers. A crucial feature of these models is their out-of-sample predictive ability, which is assessed by several performance metrics. The aim of this thesis is to identify which model has a superior out-of-sample predictive ability when applied to lapse data.

From the problem statement the research questions follows:

Which model has the best out-of-sample predictive ability for lapse risk based on several performance metrics?

The models included in the comparison are 1. Logit model,

2. Complementary Log-Log (CLL) model, 3. k-Nearest Neighbour (k-NN) algorithm, 4. Random forest,

5. Neural network,

6. EXtreme Gradient Boosting (XGBoost) algorithm.

Furthermore, to compare the models the performance metrics used are given by 1. Accuracy,

2. F-score,

(11)

3. Area Under the receiver operating characteristic Curve (AUC), 4. Matthews Correlation Coefficient (MCC),

5. Theil’s U.

Before the performance metrics are computed, the optimal configuration for each model is found. For most models this involves determining good values for their hyperparameters. Furthermore, the company that provided the data is interested in the effect of a lapse in another product category, on lapse in the main product category in this thesis, which is car insurance. Often, policyholders purchase not just one insurance policy from an insurer. For example, a policyholder could own a home insurance policy and car insurance policy from the same insurance company.

It is interesting to examine whether a lapse of a policy in another product category affects the likelihood of lapse of the car insurance policy. Therefore, this will receive special attention in the remainder of this thesis.

1.3 Structure of the Thesis

The remainder of this thesis is structured as follows. In Chapter 2 the existing litera-

ture on lapse prediction is outlined. In Chapter 3 the models and performance met-

rics used in this thesis are discussed. Subsequently, Chapter 4 provides a description

of the data used. Afterwards, in Chapter 5 the implementation and estimation of the

final models are discussed. In Chapter 6 the models are estimated with an artificially

created data set, which is more balanced with respect to the response variable. This

allows for an additional comparison. In Chapter 7 the out-of-sample results are dis-

cussed. In Chapter 8 the main conclusion, discussion points and recommendations

for future research are provided. The appendix and bibliography are found at the

end of this thesis.

(12)

2 Literature Review

The potential causes of lapse have been of interest for a long time. This chapter provides an overview of relevant academic literature about lapse risk. Furthermore, papers that apply classification models to other problems are also reviewed. These are examined since there might be good classification methods which have not yet been applied to predict lapse.

2.1 Lapse Research

The existing empirical literature could typically be divided into two classes (Eling and Kochanski, 2013). These classes differ with respect to the types of explanatory variables examined. The first class considers characteristics of the insurance com- pany and macroeconomic variables. The second class examines characteristics of the policy and policyholder. Although the above division is convenient, the difference between classes is not absolute. Research that focuses on macroeconomic variables occasionally includes policy characteristics and vice versa. This thesis is part of the second class of models. However, since the distinction between the two classes is not absolute, the first class is discussed as well for completeness. Moreover, this allows the reader to place this thesis in a broader context.

2.1.1 Economic and Company Based Research

The first class focuses on company characteristics and macroeconomic variables.

Originally, research concentrated on the interest rate and unemployment rate. They are used to investigate the interest rate hypothesis and emergency fund hypothe- sis respectively. The interest rate hypothesis assumes that the lapse rate increases when the interest rate rises. The emergency fund hypothesises presumes that the lapse rate increases in times of economic downturn. Policyholders use the surrender value as an emergency fund or can no longer afford to pay the premiums. However, no general consensus exists. Outreville (1990) finds evidence to support the emer- gency fund hypothesis, but not the interest rate hypothesis. Kuo et al. (2003) confirm the explanatory power of the unemployment rate, but also find that, in the long run, the interest rate affects the lapse rate.

Several papers expand the above analysis, either by considering macroeconomic

variables besides the interest rate and unemployment rate or by considering com-

pany and policy characteristics. Kim (2005) examines the lapse rate of a Korean

based life insurance company by a logit model, a CLL model and an arctangent

model. He studies both macroeconomic variables and a policy characteristic, namely

the policy age since inception. He finds that this wider range of explanatory vari-

ables does affect the lapse rate. Furthermore, the logit model and CLL model gen-

erally outperform the arctangent model. Yu et al. (2019) consider several macroeco-

nomic variables, such as the unemployment rate and migration, and firm character-

istics like firm size and reputation. They confirm the benefit of including additional

explanatory variables in the analysis of lapse. Barsotti et al. (2016) incorporate the

effects of macroeconomic variables and lapse among other policyholders in their

(13)

study. Policyholders’ decisions could be highly correlated when facing adverse eco- nomic scenarios. Therefore, taking this effect explicitly into account provides more precise modeling of lapse.

2.1.2 Policy and Policyholder Based Research

The second class focuses on characteristics of individual policies and policyholders.

Studies typically include variables such as the payment frequency and gender of the policyholder. The number of papers belonging to the second class is limited. This is in part due to the confidential nature of the data needed. This thesis is part of the second class, therefore these papers are discussed elaborately.

Renshaw and Haberman (1986) examine life insurance lapses from several Scottish life insurance companies. Their explanatory variables are the age at entry of the policyholder, duration of the policy, the office and the type of policy. They estimate both a linear model with normal errors and a logit model. All four explanatory variables have a significant effect on lapse.

Kagraoka (2005) uses a negative binomial model to examine the number of insur- ance lapses for a Japanese insurance company. In addition to the change in the un- employment rate, several policyholder and policy characteristics are considered. He finds that the change in the unemployment rate and contract duration are important determinants of lapse.

Cerchiara et al. (2009) examine lapse for an Italian life insurance policy by using generalized linear models, in particular the Poisson model. They determine that policy duration, calendar year of exposure, product class and policyholder are are important drivers of lapse.

Milhaud et al. (2011) use a logit model and a Classification And Regression Tree (CART) model to examine lapse rates for endowment products. Their explanatory variables include the duration of the policy and policyholders age at entry. The most important drivers of lapse are the type of contract and duration. If the policy has reached the period in which the policyholder can terminate the policy without a financial penalty, the probability of lapse increases substantially. Furthermore, the logit model has a higher sensitivity than the CART model.

Pinquet et al. (2011) analyse lapse behavior in long-term insurance by a proportional hazard model. Explanatory variables considered include policyholders age at entry and calendar year. They find that younger policyholders are prone to lapse since they have lower wealth, their insurance needs are more likely to change and the losses they face due to lapse are lower. It is concluded that inadequate knowledge about the insurance policy is the main driver behind lapse.

Eling and Kiesenbauer (2014) use generalized linear models and a proportional haz- ard model to determine causes of lapse in life insurance in Germany. All their ex- planatory variables have a significant effect on lapse. They consider variables such as policyholder’s current age, policy age and premium payment frequency. The three main drivers of lapse are the calendar year, policy age and payment frequency of premiums.

Liefkens (2019) employs two generalized linear models, namely the logit model and

CLL model and a random forest to analyse lapse in the Netherlands. She considers

numerous explanatory variables, which can be found in Table 2.1. Yearly premium,

payment frequency, expired duration, contract duration, sex, medical raise and the

(14)

presence of a second policyholder are identified as important lapse drivers. Further- more, the random forest outperforms the generalized linear models with respect to its predictive ability.

Xong and Kang (2019) consider a logit model, the k-NN algorithm, a neural net- work and a Support Vector Machine (SVM) to examine life insurance lapse risk of a Malaysian based insurance company. Their explanatory variables include the pol- icyholders gender and age at entry. The four models are compared based on their classification accuracy and the AUC. They conclude that the neural network and the SVM outperform the logit model and the k-NN algorithm.

Table 2.1 provides an overview of the literature belonging to the second class. This leads to a few general observations. First, popular approaches to model lapse in- clude generalized linear models, such as the logit model, CLL model and the Poisson model. Moreover, survival analysis and machine learning techniques like a neural network and a random forest are frequently used. Furthermore, only three papers explicitly compare different methods to predict lapse, namely Milhaud et al. (2011);

Liefkens (2019); Xong and Kang (2019). They identify the logit model, a random for- est, a neural network and a SVM as good classification methods for lapse. Therefore, it would be interesting to also compare these methods in this thesis. Additionally, the papers that do compare classification methods all use different performance met- rics.

2.2 Classification Models

Classification models are used in many fields other than lapse research. There- fore, additional literature concerning classification models is examined. Min and Lee (2005) apply a SVM, multiple discriminant analysis, a logit model and a neu- ral network to the bankruptcy prediction problem. The SVM outperforms the other methods. Various papers likewise identify the high classification accuracy of the SVM (Bellotti and Crook, 2009; Huang et al., 2008). However, Karaa and Krichene (2012) conclude that a multilayer neural network is superior to a SVM for classifying credit risk based on its prediction accuracy and reduction of type I error. Further- more, a neural network is an effective classifier, which can easily be retrained to accommodate changes in the environment (Wójcicka, 2017).

Brown and Mues (2012) compare several classification algorithms such as a logit model, a neural network, a least square SVM and a random forest for loan default prediction. They conclude that a logit model is quite competitive compared to more complicated methods such as the random forest, even for imbalanced data. This result is confirmed by Gouvêa and Gonçalves (2007), who identify the slight superi- ority of the logit model to a neural network for credit models. They also consider a genetic algorithm, however this model is outperformed by the logit model and the neural network.

The XGBoost algorithm is a newer classification method, which has not yet been

used to predict lapse. It is introduced by Chen and Guestrin (2016). The XGBoost al-

gorithm has a good predictive ability and is generally faster than comparable meth-

ods (Chen and Guestrin, 2016). Hence, it would be fascinating to compare this

method to other classification methods.

(15)

T able 2.1: Literatur e on Lapse considering Policy and Policyholder Characteristics. Author(s) Country T ime Period Expl anatory V ariables Performance Metric(s) Model(s) Consider ed Pr eferr ed Model(s) Renshaw and Haberman ( 1986 ) Scotland 1976 Poli cyholder age at entry Linear model with normal err ors Contract age Lo git model Of fice Pr oduct type Kagraoka ( 2005 ) Japan 1993-2001 Poli cyholder gender Negative binomial model Policyholder age at entry Po isson model Seasonality Contract duration Change in the unemployment rate Heter ogeneity Cer chiara et al. ( 2009 ) Italy 1991-2007 Pr oduct Poisson model Calendar year of exposur e Contract Duration Y ear of policy inception Milhaud et al. ( 2011 ) Spain 1999-2007 Contr act type Sensitivity Logit model Logit model Policyholder age at entry Specificity CAR T model Contract duration Pr emium fr equency Saving pr emium Face amount Pinquet et al. ( 2011 ) Spain 1993-2006 Poli cyholder age at entry Pr oportional hazar d model Calendar year Contract age Health bonus-malus coef ficient Continued on next page. This table is adjusted fr om Eling and Kochanski ( 2013 ) and Eling and Kiesenbauer ( 2014 ).

(16)

T able 2.1: Literatur e on Lapse considering Policy and Policyholder Characteristics (continue d). Author(s) Country T ime Period Explanatory V ariables Performance Metric(s) Model(s) Consider ed Pr eferr ed Model(s) Eling and Kiesenbauer ( 2014 ) Germany 2000-2010 Policyholder curr ent age Poisson model Policyholder gender Binomial model Policy age Negative binomial model Pr oduct type Pr oportional hazar d model Pr emium fr equency Distribution channel Calendar year Supplementary cover Liefkens ( 2019 ) Netherlands 2015-2017 Pr ovision Theil’s U Logit model Random for est Surr ender ed amount CLL model Y early pr emium Random for est Death benefit Expir ed duration Lapse duration Contract duration Pr emium state Payment fr equency Policyholder gender Age gr oup Medical raise Indicator Second Policyholder Xong and Kang ( 2019 ) Malaysia 2017 Pr emium fr equency Accuracy Logit model Neural network Policyholder age at entry AUC k-NN algorithm SVM Policy term Neural network Sum assur ed SVM Policyholder gender This table is adjusted fr om Eling and Kochanski ( 2013 ) and Eling and Kiesenbauer ( 2014 ).

(17)

2.3 Contribution to the Literature

In general, this thesis contributes to the literature in two ways. First, this thesis makes a practical contribution. Insurance companies can use the findings of this thesis to select which models they use to predict lapse. Second, this thesis makes a theoretical contribution. The existing literature that focuses on characteristics of individual policies and policyholders is limited. Therefore, additional research on lapse prediction with these characteristics is valuable. Moreover, from Section 2.1 it follows that most of the papers on lapse risk identify lapse drivers, but do not ex- plicitly compare the out-of-sample predictive ability of different models. Accurate prediction of lapse is important since, as discussed in Chapter 1, a lapse has mul- tiple negative consequences for an insurance company. Furthermore, comparing different models is not only useful for lapse research, but also for other classification problems.

There are three papers in the literature that do compare the predictive ability of clas- sification models, namely Milhaud et al. (2011), Xong and Kang (2019) and Liefkens (2019). This thesis extends their work in several ways. First, more classification methods are compared in this thesis. In previous research only a limited number of models are compared. For example, there is no comparison between a random forest, a neural network and the XGBoost algorithm. These models are identified as good classification methods in the literature. Second, more performance metrics are used in this thesis. In this thesis five performance metrics are examined to de- termine which model is superior. In the existing literature classification methods are compared using only one or two performance metrics. Some performance metrics could indicate promising results, while the model might not be satisfactory. This oc- curs especially if the data is imbalanced. Lapse data is often imbalanced since most policies will not lapse. Therefore, these fallacies should be taken into account. Insur- ance companies often have thousands of policies, so even though a lapse does not occur frequently, it is still essential to examine.

Lastly, the effect of a lapse in another insurance category, on lapse in the main insur- ance category of this thesis, car insurance, is investigated. This has not been exam- ined in the existing literature. However, this is important for insurance companies, as it could affect how changes across different insurance categories are implemented.

For example, suppose a lapse of a home insurance policy increases the probability

that a policyholder also lapses on his or her car insurance policy. In this case, an

insurance company should take this effect into account when changing the price of

their home insurance policies.

(18)

3 Methods

From the literary review in Chapter 2 it follows that popular approaches to model lapse are generalized linear models and machine learning techniques. In this chap- ter the methods used in this thesis are discussed. First, the different models are outlined. All the models can be used to assess the relationship between a binary response variable and explanatory variables. Afterwards, the performance metrics are provided.

3.1 Models

The selection of the models is based on the literary review and more recent classifi- cation methods. Previous papers have also used count data models (Poisson model and negative binomial model) and survival analysis (proportional hazard model).

In this thesis these types of models are not considered. This is due to the fact that the focus of this thesis is the occurrence of a lapse of an individual policy. Count data models focus on the total number of lapsed policies and survival analysis examines the time until a lapse occurs. Furthermore, different data is required to estimate these models.

All models used in this thesis can be characterized as machine learning methods.

Two main areas of machine learning are unsupervised and supervised learning. In unsupervised learning the training set is not labeled in advance. The objective of unsupervised learning is to discover structures in or features of the data (Sutton and Barto, 2018). In supervised learning a training data set with labeled examples is available. The principle objective of supervised learning is to obtain an algorithm that is able to find a relationship between the input and output, such that the al- gorithm performs adequately with out-of-sample data (Sutton and Barto, 2018). In this thesis the actual status of the policy, lapsed or not lapsed, is known. Therefore, supervised learning is used in this thesis.

3.1.1 Generalized Linear Models

Two generalized linear models are considered in this thesis, namely the logit model and the CLL model. Therefore, this section starts with a short introduction into generalized linear models. Thereafter, the two specific models are discussed.

The normal distribution has played a central role in statistical analysis. A larger class of probability distributions, the exponential family, has certain properties in common with the normal distribution. These properties include straightforward ex- pressions for the score statistic and the information matrix. This realization lead to the introduction of the generalized linear model (Nelder and Wedderburn, 1972).

The generalized linear model expands upon ordinary linear regression since it al-

lows for a response variable with an error distribution different from the normal

distribution. Furthermore, in a generalized linear model the response variable is

related to a function of the explanatory variables.

(19)

A generalized linear model consists of three components (Dobson and Barnett, 2008).

Firstly, the response variables, Y ₁ , ..., Y N , which are assumed to be independent and identically distributed according to a distribution from the exponential family. Sec- ondly, a set of parameters, β, and corresponding explanatory variables, x ₁ , ..., x _N , all of dimension p:

X =





 x ₁ ^T

.. . x ^T _N





 =







x ₁₁ x ₁₂ ... x _1p .. . .. . .. . x _N1 x _N2 ... x _{N p}





 . (3.1)

Thirdly, a monotone, differentiable transformation function F (·) that relates the con- ditional mean of the response variable to the explanatory variables

E [ Y _i | _Ω _i ] = F ( x ^T _i β ) , (3.2) where Ω i denotes the information set, consisting of explanatory and predetermined variables for i = 1, ..., N ¹ .

In this thesis the response variable is binary and indicates whether or not a policy is lapsed:

Y _i =

( 0 if the policy is not lapsed,

1 if the policy is lapsed. (3.3)

Let π _i and denote the probability that Y _i = 1 conditional on the information set Ω i . Then, 1 − π _i denotes the probability that Y _i = 0 conditional on the information set Ω i . This implies that, π _i = Pr [ Y _i = 1 | _Ω _i ] = E [ Y _i | _Ω _i ] . Assuming independence, the joint probability function of Y ₁ , ..., Y _N is given by

∏ N i = 1

π ^Y _i

ⁱ

( ₁ − π _i ) ¹ ⁻ ^Y

ⁱ

_, _(3.4) which is a member of the exponential family. Therefore, a generalized linear model could be used. In the most elementary case, the probabilities π _i can be modelled using the identity function as F (·) , which yields

π _i = E [ Y _i | _Ω _i ] = F ( x _i ^T β ) = x _i ^T β. (3.5) However, this could lead to values of π _i larger than one or smaller than zero. As π _i is a probability this is not reasonable. In order to ensure that π _i is in the interval [ _{0, 1} ] , some conditions should be placed on F (·) (Davidson and MacKinnon, 2004).

The function F (·) has the following properties

x →− lim _∞ F ( x ) = 0, lim

x → _∞ F ( x ) = 1, and f ( x ) ≡ ^dF ( x )

dx > 0. (3.6) These properties are the defining characteristics of a cumulative distribution func- tion. They guarantee that even though x ^T _i β could be any real number, 0 ≤ F ( x ^T _i β ) ≤ 1. These properties additionally guarantee that F (·) is a nonlinear function. There- fore, changes in the value of an element of x _i affect π _i differently depending on the magnitude of x _i . The type of generalized linear model arises from the choice of F (·) .

1

In some textbooks, including Dobson and Barnett (2008), a link function g (·) is used, such that

g ( E [ Y

_i

| _Ω

_i

]) = x

^T_i

β. However, this expression is equivalent to the one used in this thesis with g (·) =

F

⁻¹

(·) .

(20)

Furthermore, there cannot be multicollinearity among explanatory variables, which occurs when variables are highly correlated.

Logit Model

For the logit model the function F ( x ) is the standard logistic function Λ ( x ) = ^e

x

1 + e ^x , (3.7)

with first derivative

λ ( x ) = ^e

x

( 1 + e ^x ) ² = _Λ ( x ) _Λ (− x ) . (3.8) This first derivative is symmetric around zero (Davidson and MacKinnon, 2004).

Using Λ ( x ^T _i β ) as the transformation function F ( x _i ^T β ) yields that

π _i = ^e

x

_i^T

β

1 + e ^x

ⁱ^T

^β ⇐⇒ log

π _i 1 − π _i

= x ^T _i β. (3.9)

Thus, the logarithm of the odds is a linear function of the explanatory variables.

The logit model is considered for two reasons. Firstly, from the literature review it follows that the logit model is popular among research concerning lapse. Secondly, the logit model is computationally easier than the related probit model (Dobson and Barnett, 2008). The probit model is another commonly used generalized lin- ear model. In the probit model the function F ( x ) is the cumulative standard normal distribution.

Complementary Log-Log Model

The CLL model is closely related to the logit model. For the CLL model the function F ( x ) is given by the cumulative extreme value distribution function

Θ ( x ) = 1 − e ⁻ ^e

^x

, (3.10)

with first derivative

θ ( x ) = e ^x ⁻ ^e

^x

. (3.11)

The first derivative is not symmetric around zero. The CLL model is useful when the probability of the occurrence of an event is either small or large. The CLL model and the logit model are comparable for values of π _i close 0.5, but differ for values of π _i around zero or one (Dobson and Barnett, 2008).

Using Θ ( x ^T _i β ) as the transformation function F ( x ^T _i β ) yields that

π _i = 1 − e ⁻ ^e

^xTⁱ^β

⇐⇒ log (− log ( 1 − π _i )) = x _i ^T β. (3.12) Thus, the CLL of π _i is a linear function of the explanatory variables.

The CLL model is considered in this thesis since it was used in other research con-

cerning lapse (Liefkens, 2019). Furthermore, a lapse does not occur frequently. There-

fore, the asymmetry of the CLL model is suitable.

(21)

3.1.2 k-Nearest Neighbour Algorithm

The k-NN algorithm is a non-parametric, instance-based learning, classification and regression method. Since the algorithm is non-parametric, no assumption about the underlying distribution of the data is made. Instance-based learning algorithms compare new observations to stored training observations. Since this thesis is con- cerned with the classification of lapse, the k-NN algorithm for classification is dis- cussed in this section.

In the k-NN algorithm, a new observation is assigned to the class to which the major- ity of its k nearest training observations belong. This is the majority vote procedure (Coomans and Massart, 1982). Note that k is a positive integer, generally small. If there is a tie in the majority vote, the classification occurs randomly. However, to avoid ties, k is often chosen as an odd number. The majority vote procedure is illus- trated in Figure 3.1 for k equal to three. In this example a new observation can either be classified as a square or a triangle. The blue circle is the new observation to be classified. The blue circle will be classified as a square since two of its three nearest training observations are in this class.

Figure 3.1: An example of the 3-NN majority vote procedure in a two-dimensional space, where the squares and triangles are training observations and the circle is the new observation.

The value of the hyperparameter k is important. The value of a hyperparameter is

determined before the estimation starts. There is an important trade-off to consider

when choosing the optimal k (Xong and Kang, 2019). If k is decreased, the prediction

becomes less stable since less training observations are used for prediction. How-

ever, larger values of k result in an unclear separation between classes. Generally, k

increases as the number of training observations increases (Abdelmoula, 2015). The

optimal choice of k depends on the data used. A popular method to find the optimal

k is cross-validation (Wong, 2015). In particular, the k-fold cross-validation approach

is often used. Note that the k in k-fold cross-validation does not indicate the same

variable as the k in the k-NN algorithm. To avoid confusion, the k of k-fold cross-

validation is indicated with k cv from now on. The k cv -fold cross-validation method

randomly splits the data into k _cv groups. In each iteration, one group is used as the

test set, while the rest are used as training observations. This procedure is repeated

for different values of the hyperparameter k. The k that on average yields the best

results based on a specified performance metric, such as the accuracy, is chosen as

the optimal k.

(22)

Another essential consideration is the distance metric used to determine the k near- est observations. The most commonly used distance metric is the Euclidean distance.

The Euclidean distance between two column vectors u, w ∈ _R ^p for p ∈ _{N could be} written as

d _E ( _{u, w} ) = q

( u − w ) ^T ( u − w ) . (3.13) The k-NN algorithm is summarized as follows. Suppose a training data set contains observations of a response variable Y _i and p explanatory variables x _i = ( x _i1 , x _i2 , ..., x _ip ) for i = 1, ..., N. Each observation is defined as ( Y _i , x _i ) . The k-NN algorithm to clas- sify a new observation x ∗ = ( x _∗ ₁ , x ∗ 2 , ..., x ∗ p ) is then defined as follows:

1. Specify the parameter k.

2. Calculate d i = d _E ( x _i , x ∗ ) _{for each i} = _{1, ..., N.}

3. Create the set C _k which contains the Y _i of the observations with the k smallest values of d _i .

4. Assign the new observation to the class which is most common in C _k .

The k-NN algorithm is straightforward and easy to implement. However, the al- gorithm becomes increasingly computationally expensive as the number of training observations grows (Harrison, 2018). Furthermore, the k-NN algorithm is not able to deal with all categorical variables. If the categorical variable is binary or has intrinsic ordering, they can be represented by integers. However, if categorical variables have no intrinsic ordering, they cannot be included in the analysis. The algorithm is also sensitive to irrelevant explanatory variables and the scale of the explanatory vari- ables (Bronshtein, 2017; Wettschereck and Aha, 1995). If one explanatory variables is on a larger scale than others, this could heavily influence the distance between two observations. These variables should therefore be standardized.

The majority voting procedure is problematic if two overlapping classes have sub- stantial size difference (Coomans and Massart, 1982). The more frequent class highly influences the classification of a new observation. Due to the abundance of the more frequent class, they are likely among the nearest neighbours of any observation. Fig- ure Figure 3.2 illustrates this problem. A new observation is either classified as a square or as a triangle. The new observation is indicated by a circle. The square class is more frequent than the triangle class and the two classes overlap in the area of the blue circle. For every value of k the blue circle is classified as a square. How- ever, if the class frequencies are more balanced, this might not be the case.

The frequency of lapsed policies is often minimal compared to the frequency of non- lapsed policies. Therefore, the issue of the majority voting procedure should be taken into account. It could be partially resolved by using a weighted k-NN algo- rithm (Bicego and Loog, 2016). The k nearest training observations are given weights which are inversely related to their distance from the new observation. The new ob- servation assigned to the class for which the sum of weights of training observations belonging to that class are the highest.

The k-NN algorithm is included in this thesis since it is used in the literature (Xong

and Kang, 2019). Furthermore, the k-NN algorithm is intuitive and straightforward

to implement (Cunningham and Delany, 2020). Thus, it serves as a starting point to

compare to more advanced methods.

(23)

Figure 3.2: An example of a problematic majority vote in a two-dimensional space, where the squares and triangles are training observations and the circle is the new observation.

3.1.3 Random Forest

A random forest is a non-parametric, ensemble classification and regression method based on classification or regression trees. As discussed in Section 3.1.2 a non- parametric method does not make assumptions about the distribution of the data.

An ensemble method uses multiple algorithms simultaneously to attain a superior predictive ability compared to each of the algorithms on their own. As the classifica- tion of lapse is the objective of this thesis, a random forest method for classification is discussed in this section.

Before the random forest algorithm is discussed, the crucial component of a random forest, a tree, is considered. A classification tree uses recursive partitioning to clas- sify new observations. In principle, a classification tree is a sequence of questions that assigns an observation to a class. Every classification tree has a similar structure (Lemon et al., 2003). A classification tree starts with one parent node, which con- tains the entire data set. The data is partitioned according to a selected explanatory variable. The parent node splits into two child nodes. At these child nodes another partition could take place. The node at which the partitioning stops is called a ter- minal node. Each terminal node has a connected class prediction.

An example of a tree based on two explanatory variables and the corresponding regression surface is given in Figure 3.3. The test observations are split at the parent node according to their value of X ₁ . The observations for which X ₁ is smaller or equal to t ₁ move to the left child node. The other observations move to the right child node. Ultimately, all observations end up in one of the five terminal nodes of this tree, R 1 up to R 5 , at which point the observations are classified.

There are multiple methods to determine which partition variable and threshold

are chosen at each node of the classification tree. The most common method yields

two groups that are as different from one another as possible with respect to the

response variable. This method functions as follows (Efron and Hastie, 2016). As

before, suppose a training data set contains observations of a response variable Y _i

and p explanatory variables x _i = ( x _i1 , x _i2 , ..., x _ip ) for i = 1, ..., N. Each observation is

defined as ( Y _i , x _i ) . Assume that at the kth partition, group _k , consisting of N _k training

observations, is going to be split. The mean and sum of squares of group _k are given

(24)

Figure 3.3: An example of a tree on the left and the corresponding regression surface on the right (Efron and Hastie, 2016).

by

m _k = ∑

i ∈ group

_k

Y _i

N _k and s ² _k = ∑

i ∈ group

_k

( Y _i − m _k ) ² . (3.14)

The partitioning yields two new groups, namely group _k1 and group _k2 . The parti- tioning variable and threshold are chosen such that the sum of s ² _k,1 and s ² _k,2 is mini- mized. This ensures that the two groups are as different as possible with respect to the response variable. The partitioning stops when the average classification error estimated by holdout observations or cross-validation is minimized. As discussed before, with cross-validation the training data is sampled randomly multiple times and split into data used for estimation and validation (Kana, 2020). The classification error for a single observation is equal to zero if the observation is correctly classified and one otherwise. In the ideal case a terminal node is pure. A terminal node is called pure if it only contains observations that belong to the same class. The clas- sification prediction of every terminal node is determined by the majority class of training data in that node. For completeness it is important to note that there are several additional stopping criteria possible. For example, the partitioning could stop if the number of observations in a node is lower than a specified minimum or the maximum number of terminal nodes allowed is reached.

Relatively small classification trees are easy to interpret. Furthermore, explanatory

variables could be numeric or categorical. There is an important trade-off between

the bias and variance of a classification tree (Efron and Hastie, 2016). A deep classi-

fication tree often has a low bias, meaning that the classification error of the training

observations is low. A tree is called deep if it has a large depth. The depth of a tree

is measured by taking the largest distance between the parent node and one of the

terminal nodes. However, a deep classification tree has a high variance as the out-

of-sample classification error is high. Thus, an individual classification tree is not a

good predictor. This led to the introduction of a random forest (Breiman, 2001). In

a random forest the high variance of deep individual classification trees is reduced

by averaging them (Hastie et al., 2001). The random forest algorithm classifies by

assigning the class that the majority of the trees predict. The various classification

trees are constructed using randomized versions of training data (Efron and Hastie,

(25)

2016). The randomization could occur in multiple forms, such as bootstrapping or subsampling the training observations and subsampling explanatory variables. The randomization ensures that the various trees are not too highly correlated, which is crucial for variance reduction (Breiman, 2001).

The random forest algorithm is defined as follows (Efron and Hastie, 2016):

1. Given the training data ( Y _i , x _i ) for i = 1, ..., N. Fix an m ≤ p and a B, which is the number of classification trees.

2. For each b = 1, ..., B:

(a) Use random sampling with replacement from the training data to gener- ate a bootstrapped data set with N observations.

(b) Grow a maximal-depth tree with the bootstrapped data set. At each par- tition randomly select m of the p explanatory variables. Consider only these m explanatory variables for partitioning.

(c) Save the tree and the corresponding sampling frequencies that indicate how often each training observation is used in the bootstrapped data set.

3. Assign a new observation x ∗ to the class the majority of the classification trees predict.

4. Estimate the classification error rate by calculating the mean out-of-bag error of all observations in the training data. The out-of-bag error of an observation ( Y _i , x i ) is equal to the average prediction error of the classification trees for which the observation was not part of the bootstrapped data set.

From the above algorithm it follows that the random forest has two key hyperpa- rameters. The first is the number of trees in the random forest, B. The second is the number of explanatory variables that are considered for partitioning, m. It is impor- tant that good values for these two hyperparameters are determined because they influence the performance of the random forest.

The random forest method is considered in this thesis since it is used in the literature concerning lapse (Liefkens, 2019). Furthermore, it has shown superior results to a logit model and a CLL model (Couronné et al., 2018; Liefkens, 2019). Additionally, a random forest can be seen as an intricate version of a k-NN algorithm since a random forest can choose which explanatory variables to use (Efron and Hastie, 2016). Thus, it is interesting to examine whether or not the random forest will outperform the k-NN algorithm.

3.1.4 Neural Network

A neural network is a method that could be used for regression or classification. The

structure of a neural network is loosely based upon the structure of the nervous sys-

tem. The term ‘loosely’ is used as the nervous system is too complex to be captured

in a comprehensible model. In this thesis the input of the neural network consists

of the explanatory variables and the output is a prediction about whether or not a

policy is lapsed. The literature on neural networks is extensive. There are numerous

types of neural networks and different algorithms available to train these neural net-

works. In this section only the concepts essential for this thesis are dealt with. This

section is mainly based on Efron and Hastie (2016) and Nielsen (2015).

(26)

Neural Network Structure

The structure of a neural network is determined by the memory units or neurons and the connections between these neurons. The neurons are arranged in different layers. The first layer is referred to as the input layer, while the last layer is called the output layer. Typically, one or more hidden layers are placed between the input layer and output layer. The hidden layers allow the neural network to capture nonlinear relationships between its input and output. Different types of layers exist. However, in this thesis only fully connected layers are considered. This layer type is most relevant for the application in this thesis. Every neuron in a fully connected layer receives a certain input from every neuron in the previous layer. In Figure 3.4 a schematic representation of a feed-forward neural network with three hidden layers is displayed. In a feed-forward network the neurons are connected such that the output of a neuron does not affect its input (Sutton and Barto, 2018). In recurrent neural networks these loops are present. However, to properly use recurrent neural network for prediction longitudinal data is required. Since this is not available, a recurrent neural network is not considered in this thesis.

Figure 3.4: A schematic representation of a neural network with three hidden layers, p input units and nine output units (Efron and Hastie, 2016).

Every connection between two neurons, represented by arrows in Figure 3.4, has

a real-valued weight. A neuron obtains the output of the neurons of the previous

layer and multiplies these with their corresponding weight. The weights could ei-

ther magnify or lessen the input received. The neuron then adds a bias term and

performs a transformation on the value obtained. This constitutes the output of the

neuron, which it will pass on to the next layer. The function used for this transforma-

tion is called the activation function. Typically, the activation function is a nonlinear

function. The composition of linear functions is again a linear function. Thus, if

all activation functions are linear, the neural network is simply a generalized linear

model (Efron and Hastie, 2016). Popular choices for the activation function are the

sigmoid function, the hyperbolic tangent function and the rectified linear function,

(27)

which are given respectively by:

sigmoid ( z ) = ¹

1 + e ⁻ ^z , (3.15)

tanh ( z ) = ^e

z − e ⁻ ^z

e ^z + e ⁻ ^z , (3.16)

relu ( z ) = max { 0, z } . (3.17) A neural network could thus be perceived as an intricate function, f ( _s; W ) , depend- ing on the input, s, and with the weights and biases as its parameters, W (Efron and Hastie, 2016). In this thesis the input of the neural network is the explanatory variables, previously denoted by x _i . The training of the neural network refers to the iterative process in which its weights and biases are updated. Before, it is dis- cussed how the weights and biases of the neural network are updated exactly, the mathematical framework used to describe the neural network is introduced (Efron and Hastie, 2016). Consider a neural network with M layers. For m = 1, ..., M, the transition from layer m − 1 to neuron l in layer m is described by:

z ⁽ _l ^m ⁾ = w ⁽ _l0 ^m ⁻ ¹ ⁾ +

p

m−1

j ∑ = 1

w ⁽ _lj ^m ⁻ ¹ ⁾ a ⁽ _j ^m ⁻ ¹ ⁾ , (3.18) a ⁽ _l ^m ⁾ = g ⁽ ^m ⁾

z ⁽ _l ^m ⁾

, (3.19)

where w ⁽ _l0 ^m ⁻ ¹ ⁾ is the bias of neuron l in layer m, p _m − 1 denotes the number of neurons in layer m − 1, w ⁽ _lj ^m ⁻ ¹ ⁾ is the weight associated with the connection between neuron j in layer m − 1 and neuron l in layer m, a ⁽ _j ^m ⁻ ¹ ⁾ is the output of neuron j in layer m − 1 and g ⁽ ^m ⁾ (·) denotes the activation function used in layer m. Furthermore, for the input layer the input of the lth neuron is defined as a ⁽ _l ¹ ⁾ = x _l and the number of neurons is given by p ₁ = p, where x _l denotes the lth explanatory variable and p denotes the number of explanatory variables. Equation (3.18) and Equation (3.19) mathematically describe the structure of the neural network. The term z ⁽ _l ^m ⁾ in Equa- tion (3.18) is the sum of the bias of neuron l in layer m and the linear combination of the output of each neuron in layer m − 1 and the corresponding weight. The term a ⁽ _l ^m ⁾ in Equation (3.19) denotes the output of neuron l in layer m, which is calculated by using the activation function of layer m and z ⁽ _l ^m ⁾ .

Equation (3.18) and Equation (3.19) could be rewritten using vector notation as fol- lows:

z ⁽ ^m ⁾ = W ⁽ ^m ⁻ ¹ ⁾ a ⁽ ^m ⁻ ¹ ⁾ , (3.20) a ⁽ ^m ⁾ = g ⁽ ^m ⁾

z ⁽ ^m ⁾

, (3.21)

where W ⁽ ^m ⁻ ¹ ⁾ denotes the matrix of weights and biases for the connections between

layer m − 1 and layer m, a ⁽ ^m ⁻ ¹ ⁾ denotes a vector which has a one as its first element,

followed by the outputs of layer m − 1 and g ⁽ ^m ⁾ (·) operates elementwise.

(28)

Neural Network Optimization

Assuming that the activation functions are chosen such that f ( x _i ; W ) is differen- tiable, the parameters of the neural network could be determined by minimizing a loss function, which could typically be written as:

min W

( 1 N

∑ N i = 1

L ( Y _i , f ( x _i ; W )) )

, (3.22)

where N denotes the number of training observations ( Y _i , x _i ) and L (·) denotes the loss function for a single observation. The loss function indicates how the neural network with given parameter values is performing. In this thesis the loss function is the binary cross-entropy since the neural network is used as a binary classifier.

The binary cross-entropy function is given by (Nielsen, 2015):

L ( Y _i , f ( x _i ; W )) = −[ Y _i log ( f ( x _i ; W )) + ( 1 − Y _i ) log ( 1 − f ( x _i ; W ))] . (3.23) The loss function is typically convex in f ( x _i ; W ) , but not in the weights and biases W (Efron and Hastie, 2016). Therefore, finding the optimal weights and biases is not a straightforward task.

The most popular algorithms used to optimize neural networks are forms of gra- dient descent (Ruder, 2016). However, before gradient descent is discussed, the method used to find the gradient of the loss function, backpropagation, is explained.

Backpropagation uses the chain rule for differentiation to differentiate the loss func- tion with respect to the weights and biases. In order two use backpropagation two assumptions have to be made about the loss function (Nielsen, 2015). Firstly, one should be able to write the loss function as the mean of the cost functions for single observations. This assumption is necessary since with backpropagation one finds the partial derivatives of the loss function with respect to the weights and biases by taking the average of the partial derivatives of the individual loss function for each observation. In this thesis the loss function considered is of the form of Equa- tion (3.22), therefore this assumption is satisfied. Secondly, the loss function should be a function of the output of the neural network. Again, as the loss function used is of the form of Equation (3.22), this assumption is also satisfied. The backpropagation algorithm to compute the gradient of the loss function for one single observations ( Y _i , x _i ) is summarized as follows (Efron and Hastie, 2016; Nielsen, 2015):

1. Given an observation ( Y _i , x i ) , for the input layer set a ⁽ ¹ ⁾ = [ _{1, x} _i ] _. 2. For layer m = 2, ..., M compute z ⁽ ^m ⁾ and a ⁽ ^m ⁾ = g ⁽ ^m ⁾

Classification Models for Lapse Risk

M ASTER ’ S T HESIS

An Analysis of

Classification Models for Lapse Risk

Author:

E LLES VAN S ARK

S3115615

Supervisor:

D R . K ESINA

A UGUST 27, 2020

Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Dr. Kesina

Second assessor: Dr. Vullings

M ASTER ’ S T HESIS

An Analysis of Classification Models for Lapse Risk

E LLES VAN S ARK

Abstract

A high lapse ratio has several negative consequences for an insurance com-

pany. Namely, it could result in a liquidity problem. Furthermore, the insur-

ance company loses the ability to make future profits on the lapsed policies

and its reputation could be damaged. To mitigate these consequences it

is crucial that insurance companies are able to predict which policies will

lapse. Also, if insurers could accurately predict which policies will lapse,

strategies could be employed to retain these policies. In this thesis six clas-

sification models are compared on their out-of-sample predictive ability of

lapse. The six models examined are the logit model, the complementary

log-log model, the k-nearest neighbour algorithm, a random forest, a neural

network and the extreme gradient boosting algorithm. In this thesis car in-

surance data from a Dutch insurance company is used. The out-of-sample

predictive ability of the models is evaluated with five performance metrics,

namely the accuracy, the F-score, the area under the receiver operating char-

acteristic curve, Matthews correlation coefficient and Theil’s U. Based on

the performance metrics the random forest and the extreme gradient boost-

ing algorithm have a greater out-of-sample predictive ability than the other

models. Furthermore, the extreme gradient boosting algorithm performs

marginally better than the random forest.

Contents

1 Introduction 1

1.1 Context . . . . 1

1.2 Problem Statement . . . . 2

1.3 Structure of the Thesis . . . . 3

2 Literature Review 4 2.1 Lapse Research . . . . 4

2.1.1 Economic and Company Based Research . . . . 4

2.1.2 Policy and Policyholder Based Research . . . . 5

2.2 Classification Models . . . . 6

2.3 Contribution to the Literature . . . . 9

3 Methods 10 3.1 Models . . . 10

3.1.1 Generalized Linear Models . . . 10

Logit Model . . . 12

Complementary Log-Log Model . . . 12

3.1.2 k-Nearest Neighbour Algorithm . . . 13

3.1.3 Random Forest . . . 15

3.1.4 Neural Network . . . 17

Neural Network Structure . . . 18

Neural Network Optimization . . . 20

Regularization . . . 21

Relevance for this Thesis . . . 22

3.1.5 Extreme Gradient Boosting . . . 22

Extreme Gradient Boosting Framework . . . 22

Additional Extreme Gradient Boosting Features . . . 24

Loss Function . . . 25

Relevance for this Thesis . . . 25

3.2 Performance Metrics . . . 25

3.2.1 Accuracy . . . 26

3.2.2 F-score . . . 26

3.2.3 Area Under the Receiver Operating Characteristic Curve . . . . 26

3.2.4 Matthews Correlation Coefficient . . . 27

3.2.5 Theil’s U . . . 27

4 Data 29 4.1 Explanatory Variables . . . 29

4.1.1 Characteristics of the Policy . . . 30

4.1.2 Characteristics of the Policyholder . . . 31

4.2 Data Preparation . . . 32

4.3 Descriptive Statistics . . . 33

5 Implementation and Estimation 37

5.1 Generalized Linear Models . . . 37

5.1.1 Logit Model . . . 39

5.1.2 Complementary Log-Log Model . . . 41

5.2 k-Nearest Neighbour Algorithm . . . 42

5.3 Random Forest . . . 43

5.4 Neural Network . . . 44

5.5 Extreme Gradient Boosting . . . 48

D ^R . K ^ESINA

A ^UGUST 27, 2020