M ASTER ’ S T HESIS
An Analysis of
Classification Models for Lapse Risk
Author:
E LLES VAN S ARK
S3115615
Supervisor:
D R . K ESINA
A UGUST 27, 2020
Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Dr. Kesina
Second assessor: Dr. Vullings
M ASTER ’ S T HESIS
An Analysis of Classification Models for Lapse Risk
E LLES VAN S ARK
Abstract
A high lapse ratio has several negative consequences for an insurance com-
pany. Namely, it could result in a liquidity problem. Furthermore, the insur-
ance company loses the ability to make future profits on the lapsed policies
and its reputation could be damaged. To mitigate these consequences it
is crucial that insurance companies are able to predict which policies will
lapse. Also, if insurers could accurately predict which policies will lapse,
strategies could be employed to retain these policies. In this thesis six clas-
sification models are compared on their out-of-sample predictive ability of
lapse. The six models examined are the logit model, the complementary
log-log model, the k-nearest neighbour algorithm, a random forest, a neural
network and the extreme gradient boosting algorithm. In this thesis car in-
surance data from a Dutch insurance company is used. The out-of-sample
predictive ability of the models is evaluated with five performance metrics,
namely the accuracy, the F-score, the area under the receiver operating char-
acteristic curve, Matthews correlation coefficient and Theil’s U. Based on
the performance metrics the random forest and the extreme gradient boost-
ing algorithm have a greater out-of-sample predictive ability than the other
models. Furthermore, the extreme gradient boosting algorithm performs
marginally better than the random forest.
Contents
1 Introduction 1
1.1 Context . . . . 1
1.2 Problem Statement . . . . 2
1.3 Structure of the Thesis . . . . 3
2 Literature Review 4 2.1 Lapse Research . . . . 4
2.1.1 Economic and Company Based Research . . . . 4
2.1.2 Policy and Policyholder Based Research . . . . 5
2.2 Classification Models . . . . 6
2.3 Contribution to the Literature . . . . 9
3 Methods 10 3.1 Models . . . 10
3.1.1 Generalized Linear Models . . . 10
Logit Model . . . 12
Complementary Log-Log Model . . . 12
3.1.2 k-Nearest Neighbour Algorithm . . . 13
3.1.3 Random Forest . . . 15
3.1.4 Neural Network . . . 17
Neural Network Structure . . . 18
Neural Network Optimization . . . 20
Regularization . . . 21
Relevance for this Thesis . . . 22
3.1.5 Extreme Gradient Boosting . . . 22
Extreme Gradient Boosting Framework . . . 22
Additional Extreme Gradient Boosting Features . . . 24
Loss Function . . . 25
Relevance for this Thesis . . . 25
3.2 Performance Metrics . . . 25
3.2.1 Accuracy . . . 26
3.2.2 F-score . . . 26
3.2.3 Area Under the Receiver Operating Characteristic Curve . . . . 26
3.2.4 Matthews Correlation Coefficient . . . 27
3.2.5 Theil’s U . . . 27
4 Data 29 4.1 Explanatory Variables . . . 29
4.1.1 Characteristics of the Policy . . . 30
4.1.2 Characteristics of the Policyholder . . . 31
4.2 Data Preparation . . . 32
4.3 Descriptive Statistics . . . 33
5 Implementation and Estimation 37
5.1 Generalized Linear Models . . . 37
5.1.1 Logit Model . . . 39
5.1.2 Complementary Log-Log Model . . . 41
5.2 k-Nearest Neighbour Algorithm . . . 42
5.3 Random Forest . . . 43
5.4 Neural Network . . . 44
5.5 Extreme Gradient Boosting . . . 48
6 Implementation and Estimation with a Balanced data set 50 6.1 Generalized Linear Models . . . 51
6.1.1 Logit Model . . . 51
6.1.2 Complementary Log-Log Model . . . 52
6.2 k-Nearest Neighbour Algorithm . . . 53
6.3 Random Forest . . . 54
6.4 Neural Network . . . 55
6.5 Extreme Gradient Boosting . . . 55
7 Results of the Out-of-sample Analysis 57
8 Conclusion and Discussion 60
A Appendix A 63
B Appendix B 71
List of Figures
3.1 An example of the 3-NN majority vote procedure. . . 13 3.2 An example of a problematic majority vote. . . 15 3.3 An example of a tree on the left and the corresponding regression sur-
face on the right (Efron and Hastie, 2016). . . 16 3.4 A schematic representation of a neural network with three hidden lay-
ers, p input units and nine output units (Efron and Hastie, 2016). . . . 18 4.1 Histograms of lapsed and non-lapsed policies according to the change
in four product categories. . . 35 4.2 Correlation matrix of the regressors. . . 36 5.1 Cross-validation estimates of accuracy and kappa for different values
of k for the k-NN algorithm. . . 43 6.1 Cross-validation estimates of accuracy and kappa for different values
of k for the k-NN algorithm using the balanced data set. . . 54 A.1 Histogram of lapse with comprehensive coverage. . . 63 A.2 Histogram of lapse with accident coverage. . . 63 A.3 Scatter plots of hyperparameters of the neural network and the accu-
racy on the validation set. . . 65 A.4 Scatter plots of hyperparameters in the XGBoost algorithm and the
accuracy on the validation set. . . 66 A.5 Scatter plots of hyperparameters of the neural network and the accu-
racy on the validation set using the balanced dataset. . . 69 A.6 Scatter plots of hyperparameters in the XGBoost algorithm and the
accuracy on the validation set using the balanced dataset. . . 70
List of Tables
2.1 Literature on Lapse considering Policy and Policyholder Characteristics. 7
4.1 Summary statistics. . . 34
5.1 Final logit model. . . 40
5.2 Final complementary log-log model. . . 41
5.3 Out-of-bag estimate of the error rate of the random forest for differ- ent values of the number of classification trees and the number of ex- planatory variables considered at each split. . . 44
5.4 Ranges for the hyperparameters of the neural network. . . 47
5.5 Ranges for the hyperparameters of the XGBoost algorithm. . . 49
6.1 Final logit model using the balanced data set. . . 52
6.2 Final complementary log-log model using the balanced data set. . . 53
6.3 Out-of-bag estimate of the error rate of the random forest for differ- ent values of the number of classification trees and the number of ex- planatory variables considered at each split using the balanced data set. . . 55
7.1 Performance metrics for the models using the out-of-sample data set. . 58
7.2 Performance metrics for the models using the balanced out-of-sample data set. . . 59
A.1 Optimal threshold value for the logit model with the training data. . . 64
A.2 Optimal threshold value for the complementary log-log model with the training data. . . 64
A.3 The hyperparameter values and accuracy on the validation set of the ten best performing neural network configurations. . . 65
A.4 The hyperparameter values and accuracy on the validation set of the ten best performing XGBoost configurations. . . 66
A.5 Performance metrics for the models using the training dataset. . . 67
A.6 Summary statistics of the balanced dataset. . . 67
A.7 Optimal threshold value for the logit model using the balanced train- ing data. . . 68
A.8 Optimal threshold value for the complementary log-log model using the balanced training data. . . 68
A.9 The hyperparameter values and accuracy on the validation set of the ten best performing neural network configurations using the balanced dataset. . . 68
A.10 The hyperparameter values and accuracy on the validation set of the
ten best performing XGBoost configurations using the balanced dataset. 69
A.11 Performance metrics for the models using the balanced training dataset. 70
B.1 Performance metrics. . . 71
Acronyms
k-NN k-Nearest Neighbour.
AC Accident Coverage.
AIC Akaike Information Criterion.
AUC Area Under the receiver operating characteristic Curve.
CART Classification And Regression Tree.
CC Comprehensive Coverage.
CLL Complementary Log-Log.
FN False Negative.
FP False Positive.
LSTM Long Short-Term Memory.
MCC Matthews Correlation Coefficient.
NCDL No-claim Discount Lost.
ROC Receiver Operating Curve.
SVM Support Vector Machine.
TN True Negative.
TP True Positive.
XGBoost EXtreme Gradient Boosting.
1 Introduction
The aim of this thesis is to identify the optimal classification model to predict lapse of insurance policies based on several performance metrics. First, a general intro- duction into this topic is provided. Afterwards, the research question is introduced.
Lastly, the structure of this thesis is outlined.
1.1 Context
Insurance companies have a crucial function. They provide a way in which indi- viduals and companies are able to exchange risk, often associated with an uncertain financial consequence, for regular fixed premium payments (Swain and Swallow, 2015). A bankruptcy of an insurance company could have detrimental effects on the financial system and the economy (Swain and Swallow, 2015). Therefore, an insurance company should be aware of and manage all types of risk it faces. To properly supervise insurers, a regulatory framework called Solvency II was imple- mented across Europe in 2016. Solvency II is based on three pillars, namely quan- titative requirements, qualitative supervision and transparency. Under Solvency II the solvency capital requirement of an insurance company depends on risks taken by the insurance company. If the insurance company takes more risk, the solvency capital requirement increases. Different types of risk are considered in the compu- tations, such as interest rate risk, spread risk and lapse risk. This thesis focuses on lapse risk.
Under Solvency II lapse risk accounts for all risk stemming from potential policy- holder options to "fully or partly terminate, surrender, decrease, restrict or suspend the insurance cover, but also the right to fully or partially establish, renew, increase, extend or resume the insurance cover" (Burkhart, 2018). An insurance policy is sur- rendered if the policyholder terminates the policy before the original end date (Dick- son et al., 2009). The policyholder could receive a cash payment, called the surrender value. Studies concerning lapse frequently do not distinguish between a lapse and a surrender and use the terms interchangeably. Occasionally, the distinction is made that in case of a lapse the policyholder does not receive a cash payment, while this does occur in case of a surrender. However, in this thesis the definition as in the Solvency II framework is used.
Next, the potential consequences of a lapse for the insurance company are exam-
ined. A high lapse ratio could entail several adverse consequences for an insurance
company. The lapse ratio is defined as the number of lapsed policies in a given time
period divided by the total number of policies. Eling and Kochanski (2013) describe
four of these consequences. Firstly, insurance companies could face a liquidity prob-
lem should many policies lapse since it might be obliged to pay the surrender values
to the policyholders. A company faces liquidity risk when it is not able to easily con-
vert its assets into cash in order to meet financial obligations. Liquidity risk forces
the sale of assets at a reduced price or entails paying interest on loans taken to meet
payments (Kamau and Nejru, 2016). Hence, there is a negative relationship between
liquidity risk and the return on equity of an insurer. Secondly, the insurer loses the
ability to make future profits on the policy. Particularly if lapse occurs shortly after the policy is established the insurer might not recover initial expenses, such as the acquisition and underwriting costs. Thirdly, the ability to lapse increases the risk of adverse selection, which will decrease the profitability of the insurer. Lastly, a high lapse rate could diminish the insurer’s reputation. This could cause even more lapse events and is detrimental for acquiring new customers. Kuo et al. (2003) iden- tify similar consequences. Furthermore, they find that insurers select a more liquid investment portfolio, which normally yields lower returns, to reduce the liquidity risk associated with lapse. This could lead to an increase in premiums, which fur- ther deteriorates the probability of obtaining new customers. A study about lapse rates of Chinese life insurance products confirms that high lapse rates could harm the financial status of an insurer and the ability to acquire new customers (Yu et al., 2019). Thus, lapse affects multiple aspects of an insurance company ranging from policy design and pricing to how risk managed (Eling and Kochanski, 2013).
1.2 Problem Statement
The above implications of lapse demonstrate the importance of predicting lapse for insurance companies. This allows them to diminish the negative effects of lapse. To identify which policies are at risk of lapsing, a classification model could be used.
If insurers could accurately predict which policies will lapse, strategies could be employed to retain these policies. Furthermore, identification of important lapse drivers allows an insurance company to target the preferred demographic group. A classification model predicts the status of a policy based on certain characteristics. In this thesis several classification models are used to identify policies at risk of lapsing.
There are numerous classification models available, such as generalized linear mod- els. Due to progress in electronic computation, it became feasible to estimate newer approaches such as neural networks and random forests. Of course, insurers should be aware of the advantages and disadvantages of each of these models. An exten- sive comparison of classification models is therefore useful for insurers. A crucial feature of these models is their out-of-sample predictive ability, which is assessed by several performance metrics. The aim of this thesis is to identify which model has a superior out-of-sample predictive ability when applied to lapse data.
From the problem statement the research questions follows:
Which model has the best out-of-sample predictive ability for lapse risk based on several performance metrics?
The models included in the comparison are 1. Logit model,
2. Complementary Log-Log (CLL) model, 3. k-Nearest Neighbour (k-NN) algorithm, 4. Random forest,
5. Neural network,
6. EXtreme Gradient Boosting (XGBoost) algorithm.
Furthermore, to compare the models the performance metrics used are given by 1. Accuracy,
2. F-score,
3. Area Under the receiver operating characteristic Curve (AUC), 4. Matthews Correlation Coefficient (MCC),
5. Theil’s U.
Before the performance metrics are computed, the optimal configuration for each model is found. For most models this involves determining good values for their hyperparameters. Furthermore, the company that provided the data is interested in the effect of a lapse in another product category, on lapse in the main product category in this thesis, which is car insurance. Often, policyholders purchase not just one insurance policy from an insurer. For example, a policyholder could own a home insurance policy and car insurance policy from the same insurance company.
It is interesting to examine whether a lapse of a policy in another product category affects the likelihood of lapse of the car insurance policy. Therefore, this will receive special attention in the remainder of this thesis.
1.3 Structure of the Thesis
The remainder of this thesis is structured as follows. In Chapter 2 the existing litera-
ture on lapse prediction is outlined. In Chapter 3 the models and performance met-
rics used in this thesis are discussed. Subsequently, Chapter 4 provides a description
of the data used. Afterwards, in Chapter 5 the implementation and estimation of the
final models are discussed. In Chapter 6 the models are estimated with an artificially
created data set, which is more balanced with respect to the response variable. This
allows for an additional comparison. In Chapter 7 the out-of-sample results are dis-
cussed. In Chapter 8 the main conclusion, discussion points and recommendations
for future research are provided. The appendix and bibliography are found at the
end of this thesis.
2 Literature Review
The potential causes of lapse have been of interest for a long time. This chapter provides an overview of relevant academic literature about lapse risk. Furthermore, papers that apply classification models to other problems are also reviewed. These are examined since there might be good classification methods which have not yet been applied to predict lapse.
2.1 Lapse Research
The existing empirical literature could typically be divided into two classes (Eling and Kochanski, 2013). These classes differ with respect to the types of explanatory variables examined. The first class considers characteristics of the insurance com- pany and macroeconomic variables. The second class examines characteristics of the policy and policyholder. Although the above division is convenient, the difference between classes is not absolute. Research that focuses on macroeconomic variables occasionally includes policy characteristics and vice versa. This thesis is part of the second class of models. However, since the distinction between the two classes is not absolute, the first class is discussed as well for completeness. Moreover, this allows the reader to place this thesis in a broader context.
2.1.1 Economic and Company Based Research
The first class focuses on company characteristics and macroeconomic variables.
Originally, research concentrated on the interest rate and unemployment rate. They are used to investigate the interest rate hypothesis and emergency fund hypothe- sis respectively. The interest rate hypothesis assumes that the lapse rate increases when the interest rate rises. The emergency fund hypothesises presumes that the lapse rate increases in times of economic downturn. Policyholders use the surrender value as an emergency fund or can no longer afford to pay the premiums. However, no general consensus exists. Outreville (1990) finds evidence to support the emer- gency fund hypothesis, but not the interest rate hypothesis. Kuo et al. (2003) confirm the explanatory power of the unemployment rate, but also find that, in the long run, the interest rate affects the lapse rate.
Several papers expand the above analysis, either by considering macroeconomic
variables besides the interest rate and unemployment rate or by considering com-
pany and policy characteristics. Kim (2005) examines the lapse rate of a Korean
based life insurance company by a logit model, a CLL model and an arctangent
model. He studies both macroeconomic variables and a policy characteristic, namely
the policy age since inception. He finds that this wider range of explanatory vari-
ables does affect the lapse rate. Furthermore, the logit model and CLL model gen-
erally outperform the arctangent model. Yu et al. (2019) consider several macroeco-
nomic variables, such as the unemployment rate and migration, and firm character-
istics like firm size and reputation. They confirm the benefit of including additional
explanatory variables in the analysis of lapse. Barsotti et al. (2016) incorporate the
effects of macroeconomic variables and lapse among other policyholders in their
study. Policyholders’ decisions could be highly correlated when facing adverse eco- nomic scenarios. Therefore, taking this effect explicitly into account provides more precise modeling of lapse.
2.1.2 Policy and Policyholder Based Research
The second class focuses on characteristics of individual policies and policyholders.
Studies typically include variables such as the payment frequency and gender of the policyholder. The number of papers belonging to the second class is limited. This is in part due to the confidential nature of the data needed. This thesis is part of the second class, therefore these papers are discussed elaborately.
Renshaw and Haberman (1986) examine life insurance lapses from several Scottish life insurance companies. Their explanatory variables are the age at entry of the policyholder, duration of the policy, the office and the type of policy. They estimate both a linear model with normal errors and a logit model. All four explanatory variables have a significant effect on lapse.
Kagraoka (2005) uses a negative binomial model to examine the number of insur- ance lapses for a Japanese insurance company. In addition to the change in the un- employment rate, several policyholder and policy characteristics are considered. He finds that the change in the unemployment rate and contract duration are important determinants of lapse.
Cerchiara et al. (2009) examine lapse for an Italian life insurance policy by using generalized linear models, in particular the Poisson model. They determine that policy duration, calendar year of exposure, product class and policyholder are are important drivers of lapse.
Milhaud et al. (2011) use a logit model and a Classification And Regression Tree (CART) model to examine lapse rates for endowment products. Their explanatory variables include the duration of the policy and policyholders age at entry. The most important drivers of lapse are the type of contract and duration. If the policy has reached the period in which the policyholder can terminate the policy without a financial penalty, the probability of lapse increases substantially. Furthermore, the logit model has a higher sensitivity than the CART model.
Pinquet et al. (2011) analyse lapse behavior in long-term insurance by a proportional hazard model. Explanatory variables considered include policyholders age at entry and calendar year. They find that younger policyholders are prone to lapse since they have lower wealth, their insurance needs are more likely to change and the losses they face due to lapse are lower. It is concluded that inadequate knowledge about the insurance policy is the main driver behind lapse.
Eling and Kiesenbauer (2014) use generalized linear models and a proportional haz- ard model to determine causes of lapse in life insurance in Germany. All their ex- planatory variables have a significant effect on lapse. They consider variables such as policyholder’s current age, policy age and premium payment frequency. The three main drivers of lapse are the calendar year, policy age and payment frequency of premiums.
Liefkens (2019) employs two generalized linear models, namely the logit model and
CLL model and a random forest to analyse lapse in the Netherlands. She considers
numerous explanatory variables, which can be found in Table 2.1. Yearly premium,
payment frequency, expired duration, contract duration, sex, medical raise and the
presence of a second policyholder are identified as important lapse drivers. Further- more, the random forest outperforms the generalized linear models with respect to its predictive ability.
Xong and Kang (2019) consider a logit model, the k-NN algorithm, a neural net- work and a Support Vector Machine (SVM) to examine life insurance lapse risk of a Malaysian based insurance company. Their explanatory variables include the pol- icyholders gender and age at entry. The four models are compared based on their classification accuracy and the AUC. They conclude that the neural network and the SVM outperform the logit model and the k-NN algorithm.
Table 2.1 provides an overview of the literature belonging to the second class. This leads to a few general observations. First, popular approaches to model lapse in- clude generalized linear models, such as the logit model, CLL model and the Poisson model. Moreover, survival analysis and machine learning techniques like a neural network and a random forest are frequently used. Furthermore, only three papers explicitly compare different methods to predict lapse, namely Milhaud et al. (2011);
Liefkens (2019); Xong and Kang (2019). They identify the logit model, a random for- est, a neural network and a SVM as good classification methods for lapse. Therefore, it would be interesting to also compare these methods in this thesis. Additionally, the papers that do compare classification methods all use different performance met- rics.
2.2 Classification Models
Classification models are used in many fields other than lapse research. There- fore, additional literature concerning classification models is examined. Min and Lee (2005) apply a SVM, multiple discriminant analysis, a logit model and a neu- ral network to the bankruptcy prediction problem. The SVM outperforms the other methods. Various papers likewise identify the high classification accuracy of the SVM (Bellotti and Crook, 2009; Huang et al., 2008). However, Karaa and Krichene (2012) conclude that a multilayer neural network is superior to a SVM for classifying credit risk based on its prediction accuracy and reduction of type I error. Further- more, a neural network is an effective classifier, which can easily be retrained to accommodate changes in the environment (Wójcicka, 2017).
Brown and Mues (2012) compare several classification algorithms such as a logit model, a neural network, a least square SVM and a random forest for loan default prediction. They conclude that a logit model is quite competitive compared to more complicated methods such as the random forest, even for imbalanced data. This result is confirmed by Gouvêa and Gonçalves (2007), who identify the slight superi- ority of the logit model to a neural network for credit models. They also consider a genetic algorithm, however this model is outperformed by the logit model and the neural network.
The XGBoost algorithm is a newer classification method, which has not yet been
used to predict lapse. It is introduced by Chen and Guestrin (2016). The XGBoost al-
gorithm has a good predictive ability and is generally faster than comparable meth-
ods (Chen and Guestrin, 2016). Hence, it would be fascinating to compare this
method to other classification methods.
T able 2.1: Literatur e on Lapse considering Policy and Policyholder Characteristics. Author(s) Country T ime Period Expl anatory V ariables Performance Metric(s) Model(s) Consider ed Pr eferr ed Model(s) Renshaw and Haberman ( 1986 ) Scotland 1976 Poli cyholder age at entry Linear model with normal err ors Contract age Lo git model Of fice Pr oduct type Kagraoka ( 2005 ) Japan 1993-2001 Poli cyholder gender Negative binomial model Policyholder age at entry Po isson model Seasonality Contract duration Change in the unemployment rate Heter ogeneity Cer chiara et al. ( 2009 ) Italy 1991-2007 Pr oduct Poisson model Calendar year of exposur e Contract Duration Y ear of policy inception Milhaud et al. ( 2011 ) Spain 1999-2007 Contr act type Sensitivity Logit model Logit model Policyholder age at entry Specificity CAR T model Contract duration Pr emium fr equency Saving pr emium Face amount Pinquet et al. ( 2011 ) Spain 1993-2006 Poli cyholder age at entry Pr oportional hazar d model Calendar year Contract age Health bonus-malus coef ficient Continued on next page. This table is adjusted fr om Eling and Kochanski ( 2013 ) and Eling and Kiesenbauer ( 2014 ).
T able 2.1: Literatur e on Lapse considering Policy and Policyholder Characteristics (continue d). Author(s) Country T ime Period Explanatory V ariables Performance Metric(s) Model(s) Consider ed Pr eferr ed Model(s) Eling and Kiesenbauer ( 2014 ) Germany 2000-2010 Policyholder curr ent age Poisson model Policyholder gender Binomial model Policy age Negative binomial model Pr oduct type Pr oportional hazar d model Pr emium fr equency Distribution channel Calendar year Supplementary cover Liefkens ( 2019 ) Netherlands 2015-2017 Pr ovision Theil’s U Logit model Random for est Surr ender ed amount CLL model Y early pr emium Random for est Death benefit Expir ed duration Lapse duration Contract duration Pr emium state Payment fr equency Policyholder gender Age gr oup Medical raise Indicator Second Policyholder Xong and Kang ( 2019 ) Malaysia 2017 Pr emium fr equency Accuracy Logit model Neural network Policyholder age at entry AUC k-NN algorithm SVM Policy term Neural network Sum assur ed SVM Policyholder gender This table is adjusted fr om Eling and Kochanski ( 2013 ) and Eling and Kiesenbauer ( 2014 ).
2.3 Contribution to the Literature
In general, this thesis contributes to the literature in two ways. First, this thesis makes a practical contribution. Insurance companies can use the findings of this thesis to select which models they use to predict lapse. Second, this thesis makes a theoretical contribution. The existing literature that focuses on characteristics of individual policies and policyholders is limited. Therefore, additional research on lapse prediction with these characteristics is valuable. Moreover, from Section 2.1 it follows that most of the papers on lapse risk identify lapse drivers, but do not ex- plicitly compare the out-of-sample predictive ability of different models. Accurate prediction of lapse is important since, as discussed in Chapter 1, a lapse has mul- tiple negative consequences for an insurance company. Furthermore, comparing different models is not only useful for lapse research, but also for other classification problems.
There are three papers in the literature that do compare the predictive ability of clas- sification models, namely Milhaud et al. (2011), Xong and Kang (2019) and Liefkens (2019). This thesis extends their work in several ways. First, more classification methods are compared in this thesis. In previous research only a limited number of models are compared. For example, there is no comparison between a random forest, a neural network and the XGBoost algorithm. These models are identified as good classification methods in the literature. Second, more performance metrics are used in this thesis. In this thesis five performance metrics are examined to de- termine which model is superior. In the existing literature classification methods are compared using only one or two performance metrics. Some performance metrics could indicate promising results, while the model might not be satisfactory. This oc- curs especially if the data is imbalanced. Lapse data is often imbalanced since most policies will not lapse. Therefore, these fallacies should be taken into account. Insur- ance companies often have thousands of policies, so even though a lapse does not occur frequently, it is still essential to examine.
Lastly, the effect of a lapse in another insurance category, on lapse in the main insur- ance category of this thesis, car insurance, is investigated. This has not been exam- ined in the existing literature. However, this is important for insurance companies, as it could affect how changes across different insurance categories are implemented.
For example, suppose a lapse of a home insurance policy increases the probability
that a policyholder also lapses on his or her car insurance policy. In this case, an
insurance company should take this effect into account when changing the price of
their home insurance policies.
3 Methods
From the literary review in Chapter 2 it follows that popular approaches to model lapse are generalized linear models and machine learning techniques. In this chap- ter the methods used in this thesis are discussed. First, the different models are outlined. All the models can be used to assess the relationship between a binary response variable and explanatory variables. Afterwards, the performance metrics are provided.
3.1 Models
The selection of the models is based on the literary review and more recent classifi- cation methods. Previous papers have also used count data models (Poisson model and negative binomial model) and survival analysis (proportional hazard model).
In this thesis these types of models are not considered. This is due to the fact that the focus of this thesis is the occurrence of a lapse of an individual policy. Count data models focus on the total number of lapsed policies and survival analysis examines the time until a lapse occurs. Furthermore, different data is required to estimate these models.
All models used in this thesis can be characterized as machine learning methods.
Two main areas of machine learning are unsupervised and supervised learning. In unsupervised learning the training set is not labeled in advance. The objective of unsupervised learning is to discover structures in or features of the data (Sutton and Barto, 2018). In supervised learning a training data set with labeled examples is available. The principle objective of supervised learning is to obtain an algorithm that is able to find a relationship between the input and output, such that the al- gorithm performs adequately with out-of-sample data (Sutton and Barto, 2018). In this thesis the actual status of the policy, lapsed or not lapsed, is known. Therefore, supervised learning is used in this thesis.
3.1.1 Generalized Linear Models
Two generalized linear models are considered in this thesis, namely the logit model and the CLL model. Therefore, this section starts with a short introduction into generalized linear models. Thereafter, the two specific models are discussed.
The normal distribution has played a central role in statistical analysis. A larger class of probability distributions, the exponential family, has certain properties in common with the normal distribution. These properties include straightforward ex- pressions for the score statistic and the information matrix. This realization lead to the introduction of the generalized linear model (Nelder and Wedderburn, 1972).
The generalized linear model expands upon ordinary linear regression since it al-
lows for a response variable with an error distribution different from the normal
distribution. Furthermore, in a generalized linear model the response variable is
related to a function of the explanatory variables.
A generalized linear model consists of three components (Dobson and Barnett, 2008).
Firstly, the response variables, Y 1 , ..., Y N , which are assumed to be independent and identically distributed according to a distribution from the exponential family. Sec- ondly, a set of parameters, β, and corresponding explanatory variables, x 1 , ..., x N , all of dimension p:
X =
x 1 T
.. . x T N
=
x 11 x 12 ... x 1p .. . .. . .. . x N1 x N2 ... x N p
. (3.1)
Thirdly, a monotone, differentiable transformation function F (·) that relates the con- ditional mean of the response variable to the explanatory variables
E [ Y i | Ω i ] = F ( x T i β ) , (3.2) where Ω i denotes the information set, consisting of explanatory and predetermined variables for i = 1, ..., N 1 .
In this thesis the response variable is binary and indicates whether or not a policy is lapsed:
Y i =
( 0 if the policy is not lapsed,
1 if the policy is lapsed. (3.3)
Let π i and denote the probability that Y i = 1 conditional on the information set Ω i . Then, 1 − π i denotes the probability that Y i = 0 conditional on the information set Ω i . This implies that, π i = Pr [ Y i = 1 | Ω i ] = E [ Y i | Ω i ] . Assuming independence, the joint probability function of Y 1 , ..., Y N is given by
∏ N i = 1
π Y i
i( 1 − π i ) 1 − Y
i, (3.4) which is a member of the exponential family. Therefore, a generalized linear model could be used. In the most elementary case, the probabilities π i can be modelled using the identity function as F (·) , which yields
π i = E [ Y i | Ω i ] = F ( x i T β ) = x i T β. (3.5) However, this could lead to values of π i larger than one or smaller than zero. As π i is a probability this is not reasonable. In order to ensure that π i is in the interval [ 0, 1 ] , some conditions should be placed on F (·) (Davidson and MacKinnon, 2004).
The function F (·) has the following properties
x →− lim ∞ F ( x ) = 0, lim
x → ∞ F ( x ) = 1, and f ( x ) ≡ dF ( x )
dx > 0. (3.6) These properties are the defining characteristics of a cumulative distribution func- tion. They guarantee that even though x T i β could be any real number, 0 ≤ F ( x T i β ) ≤ 1. These properties additionally guarantee that F (·) is a nonlinear function. There- fore, changes in the value of an element of x i affect π i differently depending on the magnitude of x i . The type of generalized linear model arises from the choice of F (·) .
1