• No results found

Modelling of Lapse: An Explanatory and Predictive Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Modelling of Lapse: An Explanatory and Predictive Analysis"

Copied!
43
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Modelling of Lapse:

An Explanatory and Predictive Analysis

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Prof. dr. L. Spierdijk

(3)

University of Groningen

Modelling of Lapse:

An Explanatory and Predictive Analysis

Author:

Eveline Liefkens, S2718456

Abstract

Better modelling of lapse is useful for insurance companies to avoid liq-uidity problems and loss of potential future profits on assets. Furthermore, analysis of lapse data can improve the retention policies of the insurers. In this thesis, lapse data of a large Dutch insurance company is used to pro-vide answers to two research questions. The explanatory analysis answers the first question: What is difference in the impact of the risk drivers on lapse between the logit and the complementary log-log model in terms of size, sign and significance of the estimated coefficients? The two generalized linear models provide similar results, hence there is almost no difference between the two models. Both models show that provision, yearly pre-mium, payment frequency, expired duration, contract duration, sex, medical raise and the presence of a second policyholder are significant risk drivers. The predictive analysis examines the second research question: What is the difference in out-of-sample predictive ability between the logit model, the complementary log-log model and the random forest approach based on the forecasting measure Theil’s U? The generalized models are used in this analysis as well, together with the random forest approach. The most parsimonious logit and complementary log-log model are selected based on AIC. The predictive ability of the random forest approach is better than the predictive abilities of the two generalized linear models, hence the random forest approach is recommended to the insurance companies.

(4)

1

Introduction

1.1

Definition of Lapse

The portfolio of an insurer evolves by the entry of new policies, the expiration of existing policies, mortality, and premature termination of existing policies. The latter is also referred to as lapse (Haberman and Renshaw, 1996). However, “lapse” has different definitions throughout the literature, which are highlighted by the re-search of Eling and Kochanski in 2003. For example, another definition in their article states that lapse is the premature termination of an insurance policy and loss of coverage due to failure of the policyholder in paying premiums. This defini-tion can be expanded such that lapse also includes the terminated policies whereby the policyholder receives a surrender value (Eling and Kochanski, 2013).

In Solvency II, lapse risk is defined as “the risk of loss or adverse change in li-abilities due to a change in the expected exercise rates of policyholder options”. These legal or contractual options are to fully or partly terminate, surrender, de-crease, restrict or suspend the insurance cover, as well as to fully or partly establish, renew, increase, extent or resume the insurance cover. Solvency II uses the term “lapse” for all these mentioned policyholders’ options (CEIOPS, 2010). Hence, the definitions of lapse are open for interpretation, implying that lapse is still a difficult variable to measure.

1.2

Consequences of Lapse

In case of life insurance, premiums are paid over a long period by the policyholders. Eventually, the insurance company is obligated to return a claim to policyholders. In order to pay these claims, insurers set aside reserves, which are considered as a liability on their balance sheet. These reserves are backed by assets and may result in making profits. Since premiums are paid long before a claim occurs, an insurer often invests in long term assets. In this way, the duration of the assets matches the duration of the liabilities (Paulson et al., 2012).

(5)

2012). In case of a mass lapse event, the insurer even has to sell assets, otherwise the liquidity can be threatened (Eling and Kochanski, 2013).

As mentioned, premiums are partly invested in assets which may result in prof-itability. However, when a policyholder terminates the payment of premiums, the potential future profits on assets will be lost. Furthermore, if there are more prema-ture terminations of policies than the entry of new policies, the insurer will obtain a smaller market share. Moreover, high acquisition costs make new policies less profitable than keeping current insurance policies (Visser, 2007). The uncertainty of the estimation of lapse rates affects also other calculations, for example the capital requirements stated in Solvency II. When the estimates of the lapse rates are too high, the insurer requires more regulatory capital than actually necessary, which could otherwise be used to take on more investment risk for investments (Michorius, 2011).

The observed lapse rates are one of the main indicators for customers about the product and service quality of life insurers. High lapse rates of an insurance com-pany could indicate that the prices of the insurance products are relatively high or that the competitors offer better services (Kiesenbauer, 2012). Since policyholders use lapse rates as a qualitative decision variable for choosing the right insurance company, it is beneficial for insurers to prevent high lapse rates.

1.3

Relevance of Lapse Modelling

Because of all these challenges caused by lapse, it is of great importance to analyze lapse appropriately. Better modelling of lapse rates is also necessary to avoid aforementioned liquidity problems and loss of potential future profits on assets. Furthermore, analysis of lapse data may improve the retention policy of insurers, because better insights about the causes of lapse behaviour of policyholders could be applied to insurance products or targeting the right customers (Visser, 2008).

1.4

Research Questions

The focus of this thesis will be on modelling lapse and determining relevant risk drivers of lapse. The economic interpretation of the effect of the risk drivers on lapse is interesting for insurers in order to adjust their policies.

(6)

Barnett, 2008). In more recent literature, machine learning algorithms are used (Babaoglu et al., 2017). One of the used machine learning techniques is the ran-dom forest method (Milhaud et al., 2011; Babaoglu et al., 2017).

The research question of this thesis relates to the risk drivers, using a dataset provided by a large Dutch insurance company. The research question consists of two parts:

1. Explanatory part

What is difference in the impact of the risk drivers on lapse between the logit and the complementary log-log model in terms of size, sign and significance of the estimated coefficients?

2. Predictive part

What is the difference in out-of-sample predictive ability between the logit model, the complementary log-log model and the random forest approach based on the forecasting measure Theil’s U?

1.5

Structure of this Thesis

(7)

2

Literature Review

2.1

Lapse Models

According to a study, lapse models can be divided in two categories, namely de-terministic lapse models and dynamic lapse models. Dede-terministic models are based on deterministic lapsation, which is not scenario specific. This means that deterministic lapse depends on external factors, for example, age, sex or product type. Scenario specific lapse is called dynamic lapse, leading to dynamic lapse models. Dynamic lapse is a result of an internal decision process, which is based on stochastic variables (Eling and Kochanski, 2013).

2.1.1 Deterministic Models

In the academic literature different types of deterministic models are discussed. Popular models are logit, tobit and complementary log-log models (Kim, 2005; Cox and Lin, 2006). Other models are standard time series regressions and coin-tegration techniques, which are often applied to macroeconomic variables, such as unemployment rates and interest rates. Time series models are often used for these macroeconomic variables, because observations of these variables are easy to collect for many years (Dar and Dodds, 1989; Kuo et al., 2003). Furthermore, other econometric models that have been used for lapse modelling are (negative) binomial models and Poisson models (Eling and Kochanski, 2013). However, all these deterministic models fall in the broad class of generalized linear models.

2.1.2 Dynamic models

(8)

The concluding remark of De Giovanni (2010) stated that the valuation of the sur-render option can be accomplished by a deterministic model or a dynamic model. In the deterministic model, the observed lapse data is used, which is an advantage in terms of goodness of fit of the model, while use of the American option pricing theory gives a powerful pricing tool, but is not adequate in explaining the observed data (Eling and Kochanski, 2013).

2.1.3 Machine Learning Methods

The research of Eling and Kochanski (2013) provided a review of more than 50 theoretical and empirical papers about lapse modelling. However, this research does not mention any new modelling techniques like machine learning algorithms. Machine learning algorithms can be useful for predicting lapse, but also for identi-fying lapse risk drivers (Jamal, 2017). In research about lapse, the random forest algorithm is used for predictive modelling and appears to perform better than the logistic regression model and another machine learning algorithm in terms of specificity and sensitivity measures. (Babaoglu et al., 2017).

2.2

Risk Drivers of Lapse

Qualitative research, like interviews or surveys, has shown that policyholders of-ten do not like to terminate their policies. However, they lapse because of their changed personal circumstances in residential or work situation. Another reason is that policyholders lapse because their financial situation has deteriorated. For the latter reason, policyholders may end up with payment problems for which they need direct money. This direct money can be received in the form of a surrender value. A small part of the policyholders lapses because of a better offer of a com-petitor (Visser, 2008).

(9)

In previous research on lapse the focus was mainly on macroeconomic variables, in particular interest rates and unemployment rates. Research concerning these variables examined the interest rate hypothesis and the emergency fund hypothe-sis. The interest rate hypothesis conjectures that life insurance savings depend on rates of return. Policyholders may lapse to benefit from higher interest rates or from lower premiums in the market in case of rising market interest rates. Hence, when policyholders own life insurance, increasing interest rates can behave as op-portunity costs (Kuo et al., 2003). The emergency fund hypothesis assumes that impairment of personal financial situation leads to lapse of policyholders, because they need the cash surrender value, which is then used as an emergency fund (Outreville, 1990). Furthermore, macroeconomic variables such as Gross Domes-tic Product (GDP), inflation, reference market rate and return on stock market have a significant value in lapse modelling, although the effect of these variables are also dependent on duration of policies (Michorius, 2011).

Besides macroeconomic variables, characteristics of the insurance company are also useful for lapse modelling. Both groups of variables are considered in the study of Kiesenbauer (2012), where the determinants of lapse in German life in-surance industry are examined. His dataset contains company characteristics of 133 German life insurers between 1997 and 2009. Five different insurance prod-uct categories were considered, namely endowment, annuity, term life, group and other. The “other” category contained mainly unit-linked products. The relevant macroeconomic variables were buyer confidence, current yield and GDP develop-ment. The deterministic company characteristics for lapse prediction consist of distributional focus, age of company and the participation rate spread.

(10)

3

Models

As mentioned in section 1.4, generalized linear models are popular due to the easy interpretation of the variables and outcomes (Michorius, 2011). Since the logit model and complementary log-log model have been used in the literature for predictive lapse modelling, these will be used to answer the research questions (Eling and Kochanski, 2013). Another predictive modelling approach is a machine learning algorithm, namely the random forest. The choice of this algorithm is based on research about lapse, where the random forest is also used for predictive modelling and appears to perform better than the logistic regression model and another machine learning technique (Babaoglu et al., 2017).

3.1

Introduction to Generalized Linear Models

Most analyses are based on linear models of the form

E(Yi) = µi = x0iβ, Yi ∼ N (µi, σ2), (1) where Yi, i = 1, ..., N , are independent random variables, xi are risk drivers, β are the corresponding coefficients. Furthermore, µi is the expected value conditional on the observed risk drivers of Yi. However, this model can be generalized, since response variables Yi may have another distribution than the Normal distribution. Besides continuous values, response variables can also consist of categorical values. As in Equation (1), the response variable is often based on the Normal distri-bution because of the convenient properties of this distridistri-bution. However, these convenient properties are also shared by a broader class of distributions, which is referred to as the exponential family. Therefore, when the response variable follows another distribution from the exponential family, it still has the same convenient properties when following the Normal distribution. The generalized linear model is based on these properties.

Furthermore, Equation (1) is of simple linear form, but the relation between re-sponse variables and covariates may appear in more complicated forms. Instead of the simple linear relation form, the expected value of the response variable can be related in a non-linear form to the linear component x0iβ, namely by

g(µi) = x0iβ, where E(Yi) = µi (2)

with g(·) as link function.

(11)

function, the generalized linear model can be defined by three components. The first component is the assumption that the response variables Yi follows a distri-bution of the exponential family. The second component of the generalized linear model is a set of parameters β and explanatory variables X, defined as

X =    x01 .. . x0N   .

The third component is the link function g, which is monotone in generalized linear models (Dobson and Barnett, 2008; Rossi, 2018).

3.2

Choice of Generalized Linear Models

The choice of an appropriate generalized linear model depends on the scale of the variables. The dependent variable lapse falls under the scale of nominal variables, because an individual can lapse or not. In other words, lapse is a binary variable. Other variables in the dataset are continuous or categorical. Hence, the model should fit a binary dependent variable with continuous and categorical explana-tory variables. A generalized linear model which deals with these variables is the logistic regression model. Furthermore, when modelling with a binary dependent variable and continuous explanatory variables, the probit model and other dose-response models suffice.

A binary random variable is defined as

Z = 1 if the outcome is a success, 0 if the outcome is a failure,

with corresponding probabilities Pr(Z = 1) = π and Pr(Z = 0) = 1 − π. Apply-ing this equation to lapse, a success corresponds with ‘lapse’ and a failure corre-sponds with ‘no lapse’. This random variable has the Bernoulli distribution. When there are n independent random variables, Z1, ..., Zn, with probabilities such that Pr(Zj = 1) = πj, j = 1, ..., n, then the joint distribution of these random variables is a member of the exponential family. In case all πj’s are the same, the sum of Zj results in the Binomial distribution. This sum is defined as Y =

Pn

j=1Zj and results in Y ∼ Bin(n, π). Hence, since binary random variables follow a distribu-tion in the exponential family, a generalized linear model can be applied to binary random variables.

(12)

the explanatory variables. Since the expected value of the binomial variable Y is E(Y ) = nπ, the expected value of the proportion of successes is E(P ) = π (Dobson and Barnett, 2008). These probabilities can be modelled as

g(π) = x0β,

whereof the simplest case is

π = x0β.

This simple case, whereby g(π) = π, is also called the identity link function (Rossi, 2018). However, using linear model results in fitted values that may be greater than one or less than zero. These are not valid values for probabilities. Therefore, the probability π will be restricted to the interval of [0, 1] by using a cumulative probability distribution

F (s) = Z s

−∞

f (s)ds,

with f (s) ≥ 1 and R−∞∞ f (s)dx = 1. By using a cumulative probability function with these constraints, different kinds of generalized linear models arise, such as the logit model and the complementary log-log model. These are defined as:

1. Logit model: log(1−ππ ) = x0β,

2. Complementary log-log model: log[− log(1 − π)] = x0β.

Hence, these models are suitable for modelling binary dependent variables (Dob-son and Barnett, 2008). The logit link function assumes that the underlying con-strained cumulative probability distribution is logistic. The logistic distribution has heavier tails than the normal distribution (Pregibon, 1980). The complemen-tary log-log link function is based on the extreme value distribution. This link function is, in contrary to the logit function, asymmetric. Therefore, the comple-mentary log-log function is often used if a probability of an event is very small or very large (Chu et al., 2010). These two generalized linear models will first be used for an explanatory analysis of the risk drivers. After which these models are used for predictive modelling of lapse.

3.3

Random forest

(13)

non-parametric classifier, which means that the random forest does not have as-sumptions on the data distribution (Babaoglu et al., 2017).

The random forest algorithm uses the decision tree as building block. Decision trees are known for the easy building, using and interpretation. However, deci-sion trees are not ideal for predictive learning, since they are not accurate. In other words, decision trees work well with the data used to construct the decision trees, but they are not flexible when they are used for classifying out-of-sample forecasting (Hastie et al., 2001). Therefore, the random forest is useful, since the random forest combines the simplicity of decision trees with flexibility to improve the accuracy.

The building block of the random forest is the decision tree. In general, a decision tree is defined by a series of questions leading to the class label of the dependent variable. These questions are in the form of nodes. The starting node is called the “root node”. At the end of the decision tree are “leaf nodes”. The leaf nodes give the classification of the sample. Nodes between the root node and the leaf nodes are the internal nodes. An example of the decision tree is shown in figure 1 (S´a et al., 2016). However, to create a decision tree, the nodes have to be defined. Therefore, the problem is how to grow a decision tree.

Figure 1: Structure of a Decision Tree

(14)

The variable that makes the most optimal decision should be used at a node. The most optimal decision is based on an impurity measurement. When at a node all the samples of the dataset belong to one class, the node is called pure. There-fore, the most optimal decision has the lowest impurity score. There are different ways to measure impurity. For example, by means of the Gini impurity and the information gain. The Gini impurity is often used for classification trees (Venables and Ripley, 2002). To decide which variables should be used for nodes, all the im-purity scores of the variables have to be calculated. A node becomes a leaf node if the impurity score of the node itself is lower than the impurity scores of variables. When a variable has a lower impurity score than the node, separating the data re-sults in an improvement. This procedure grows a decision tree (Hastie et al., 2001). As mentioned, the random forest combines the simplicity of the decision tree with flexibility to improve the accuracy of the predictive model. Random forest uses ‘bagging’, short for bootstrap aggregation, to reduce the variance of the decision trees.

Bootstrap aggregation consists of two steps. The first step is to create a boot-strapped dataset based on the original dataset. Bootstrapping is randomly select-ing samples of the original dataset with replacement. Replacement means that it is allowed to pick the same sample more than once. The second step is to build a decision tree based on the bootstrapped dataset, but only using a randomly chosen subset of variables for each decision node. This step is repeated for many times (Liaw and Wiener, 2002). Hence, the random forest consists of many decision trees based on many bootstrapped datasets.

Suppose a dataset contains N observations and each observation consists of p explanatory variables and a dependent variable. This is defined as (xi, yi) with explanatory variables xi = (xi1, xi2, ..., xip), and dependent variable yi, i = 1, ..., N . The random forest can be summarized in an algorithm according to Hastie et al. (2001), which is defined as following:

1. For b = 1, ..., B, with B a large number:

(a) Create a bootstrap dataset of the same size as the original dataset. (b) Based on the bootstrap dataset, grow a random-forest tree Tb. The tree

grows by recursively repeating the subsequent steps for each terminal node of the tree, until there is no improvement when splitting the node.

i. Select randomly m variables of all p variables.

(15)

2. Give the collection of trees {Tb}Bb=1.

In case of a new point x∗, a prediction can be made based on the random forest. Let ˆCb(x∗) be the class prediction of the bth random-forest tree. Then

ˆ

CrfB(x∗) = majority vote{ ˆCb(x∗)}Bb=1. (3)

(16)

4

Model Selection and Estimation

The explanatory analysis of the risk drivers is done by estimating a logit model and a complementary log-log model on the dataset. After the influence of the risk drivers is explained, the models can be optimized in terms of predictive abil-ity. First, the most parsimonious logit model is selected based on AIC. Then, AIC will also be used for selecting the most parsimonious complementary log-log model. Subsequently, the resulting logit model and complementary log-log model are com-pared with the random forest approach based on the Theil’s U criterion. Theil’s U criterion is suitable for non-nested models, whereby the statistic is between zero and one, with zero indicating a perfect forecast (Bliemel, 1973).

4.1

Model Criterion for Generalized Linear Models

In the case that generalized linear models only differ in terms of covariates, the models can be compared using the Akaike information criterion (AIC). The AIC uses the likelihood function to measure the goodness of fit, but the criterion gives a penalty for using too many parameters. Therefore, the AIC rewards the simplicity of a model.

The formula of the AIC is defined as

AIC = −2 · log ˆL + 2 · npar, (4)

where ˆL is the value of the likelihood function in the optimal parameters’ values and npar is the number of parameters used in the fitted model. The model with the minimum AIC value is the most parsimonious model (Dobson and Barnett, 2008).

4.2

Predictive Ability Measurement

When the most parsimonious models within a probability distribution of the ex-ponential families are chosen based on AIC, the models of different exex-ponential families and the random forest method can also be compared in terms of predictive ability. This comparison is based on Theil’s U, which is suitable for non-nested models.

The formula of Theil’s U is defined as

U = RM SE(Model)

(17)

where RM SE stands for the root mean squared error. This is defined as RM SE = v u u t 1 N N X t=1 (Fi− Ai)2,

where N is the number of observations, Ai is the actual value and Fi is the pre-dicted value with i = 1, ..., N . Equation (5) shows the simple construction of the Theil’s U statistic (Hauptmeier et al., 2009).

(18)

5

Data

The dataset used in this thesis is provided by a large Dutch insurance company, which combines different kinds of portfolios of the years 2015, 2016 and 2017, containing about 900,000 observations. In this given dataset characteristics of in-surance products and characteristics of policyholders are given. The analysis of variables relating to lapse will be based on one large product group of the dataset, since insurance products differ in characteristics. Hence, some variables of a cer-tain insurance product will not be relevant for another insurance product. The considered product group will be term life insurance, since this group is an active product group in the Netherlands.

In this thesis the dependent variable is lapse, which is a binary variable. The literature uses a rather general definition of lapse, but in this research lapse is defined as policies that are surrendered or expelled. This definition is based on the variable “Policy State” in the given dataset, which consists of levels such as “Active”, “Expired”, “Surrendered”, etc. Hence, a subset of the original dataset is taken, whereby only active and lapsed policies are selected in order to predict lapse.

5.1

Variables in dataset

(19)

Figure 2: Frequency in dataset per insurance product

The variables in the dataset can be divided in two groups, namely the character-istics of insurance products and the charactercharacter-istics of policyholders. The charac-teristics of insurance products differ between insurance products.

5.1.1 Characteristics of Term Life Insurance

Since term life insurance is examined in this thesis, the characteristics of this product group are defined as following:

1. Provision

Policyholders pay premiums to the insurer, which result in reserves. The amount of premiums paid by the policyholder is the provision, which is shown on the balance sheet of the insurer.

2. Surrendered Amount

(20)

for lapsed policyholders. Therefore, this variable will not be used in the models, but it gives more information about lapse.

3. Yearly Premium

Every year the policyholder pays premium, which is indicated with the vari-able yearly premium.

4. Death Benefit

Death benefit is the amount paid by the insurer in case the policyholder dies, which is stated in the policy.

5. Expired Duration

The expired duration is the difference between the starting date of the policy and the current date. Hence, this duration is the time the policy exists. 6. Lapse Duration

Lapse duration is the difference between the starting date of the policy and the date that the policy lapsed. This variable is also left censored.

7. Contract Duration

The contract duration is difference between the starting date of the policy and the contractual end date of the policy.

8. Premium State

Premium state indicates whether a policyholder pays premium, lump sum or nothing. The first option is that the policyholder pays premium on a regularly basis, which is indicated with 1 in the dataset. The second option is that the policyholder pays premium with an additional lump sum, which is indicated with 2. The third option is that the policyholder pays no premium at all, indicated with 3. This option refers to a non-contributory insurance policy. The fourth option is that the policyholder only pays a lump sum, indicated with 4. Hence, this variable contains four categories. For each category of the variable a dummy variable is created, when estimating the models.

9. Payment Frequency

(21)

5.1.2 Characteristics of the Policyholder 1. Sex

Sex indicates whether the policyholder is male or female. 2. Age Group

Based on the date of birth, an age variable is created. This age variable divides policyholders in age groups. The age variable contains the following age groups: 1. Pre-war generation: 1910-1930 2. Silent generation: 1930-1940 3. Protest generation: 1940-1955 4. Lost generation: 1955-1970 5. Pragmatic generation: 1970-1985 6. Generation Y: 1985-2000 7. Generation Z: 2000-2015

The seven groups are defined based on generations (Becker, 1992). This variable is ordinal, so for each age group a dummy variable is created when estimating the models.

3. Medical Raise

The indicator of medical raise states whether the policyholder had to pay an additional raise based on medical history. An additional raise has to be paid if the policyholder has risk of medical complications.

4. Indicator Policyholder 2

The indicator of policyholder 2 is a dummy variable, which indicates whether there is a second policyholder present or not. When the second policyholder is present, the value is 1, and when a second policyholder is not present, the value is 0. A second policyholder means that the policy holds for two lives. When one of the policyholders dies, the death benefit will be paid out to the remaining policyholder, after which, the cover will expire.

5.2

Data Preparation

(22)

dataset and only the active and lapsed policies were selected.

The given data originally consists of two datasets. The first set contains poli-cies with corresponding information for the years 2015 and 2016. The second set contains policies with corresponding information for the years 2016 and 2017. The policies in the two sets are not entirely the same, hence the datasets do not fully match, but have a large overlap.

The values of some variables of lapsed policies in the product group of term life insurance are set to zero. These variables are provision and death benefit. How-ever, setting the values of these variables to zero will present a distorted relation between lapsed policies and these variables. Therefore, in case that the lapsed policy was active in the previous year, the values of provision and death benefit in the previous year are assumed to be the actual values of provision and death benefit of the lapsed policy. Lapsed policies that do not have information about their active state in previous years are deleted. In this way, the variables provision and death benefit have been inputated for all policies. Using the observation of the previous year for a missing data point is called Last Observation Carried Forward (LOCF) and is a form of imputation. When applying this form of imputation, the assumption is made that the data stays constant at the last observation (Kenward and Molenberghs, 2009).

After combining the data of all three years, double policy numbers were deleted to prevent duplicates. By deleting the double policy numbers, the remaining set contains only policies that once have been lapsed or have been active in all years. Furthermore, only the term life insurance group is selected from the dataset. Since some life insurance products are based on two insured lives, characteristics are available of both partners. The date of birth and date of death of both poli-cyholders are given in the dataset. By converting the date of death to a dummy variable indicating whether an individual is dead or not, some mistakes in the dataset can be recognized. There are six active policies whereby the policyholder is dead and the second policyholder is not present. These remarkable policies are deleted, because the policy states are most likely not correct.

(23)

is made that the premium state of policies indicated with a zero means that the policyholder does not pay premium. This is the same definition as option 3 of the variable premium state. Hence, the policies indicated with a zero, are now indicated with 3 in the dataset.

5.3

Description of Term Life Insurance

In case of a term life insurance, the insurer pays out a pre-arranged amount upon death of the policyholder. This amount is paid to the assigned heirs of the poli-cyholder or the amount is used for repayment of the mortgage. In this way, the policyholder covers the financial risks at death. The pre-arranged amount is also called death benefit. The death benefit and the duration of the insurance are determined by the policyholder. Before the death of the policyholder, the policy-holder pays premium on regularly basis.

There are 83,136 policies available in the term life insurance group. In this prod-uct group are 5,583 lapsed policies, hence 77,553 policies are active. The summary statistics of the term life insurance policies are shown in table 1. As can be ob-served, yearly premium is mostly paid once per month. The minimum value of yearly premium is zero, which indicates a non-contributory policy. Furthermore, lump sums are not common for term life insurance, so the main premium state options are regularly premium or no premium at all.

(24)

Table 1: Summary Statistics of Term Life Insurance Policies

Statistic Mean St. Dev. Min Median Max Provision 328.667 772.274 −42,963.580 119.240 48,618.760 Surrendered Amount −8.740 109.381 −8,331 0 848 Yearly Premium 305.923 308.606 0 211.660 7,270.320 Death Benefit 113,796.300 86,088.820 0 100,000 3,791,667 Expired Duration 2.863 4.135 0 2 41 Lapse Duration 3.861 3.872 0 3.000 36.000 Contract Duration 22.044 7.152 0 23 59

Premium State Payment Frequency Sex Age Group Medical Raise Ind. Policyholder 2

1: 81916 0 : 1214 M:67523 1: 0 N: 81862 0: 42266 2: 1 1 : 5556 F:15613 2: 30 Y: 1274 1: 40870 3: 1219 2 : 182 3: 4061 4: 0 4 : 862 4: 29762 12: 75322 5: 34237 6: 15046 7: 0

This table displays summary statistics of the numeric risk drivers and frequency distributions of the categorical risk drivers.

The observations of the surrendered amount and lapse duration are only available for lapsed policies. Hence, these variables will not be used for lapse modelling. Furthermore, the premium state can also be derived from the variable payment frequency. To prevent multicollinearity, only payment frequency will be used in the predictive models and premium state is not included.

5.4

Expected Impact of Risk Drivers on Lapse

The risk drivers used for lapse modelling may have a significant impact on lapse. This impact will be tested in the explanatory analysis. After omitting the left censored or strongly correlated variables, a selection of risk drivers remains. The risk driver provision is expected to have a negative impact on the risk of lapse. Provision is the total amount of premium paid by the policyholder. A large provision means that the policyholder has paid a large amount of total premium, which will probably cause a large commitment to the policy. When a policyholder has paid a large amount of premiums, the policyholder may feel like passing the point of no return. This is referred to as the sunk cost fallacy (Arkes and Ayton, 1999). This feeling may prevent a policyholder from lapsation.

(25)

As discussed in the Literature Review, one of the reasons that policyholders lapse is because their financial situation has been deteriorated (Visser, 2008). The yearly premium may be too high for a policyholder with a deteriorated financial situa-tion, which may result to lapsation of the policyholder. Furthermore, a large yearly premium may demotivate policyholders to keep the policy, since paying premium does not have a direct reward for the policyholder. The reward is only paid upon death of the policyholder.

Death benefit is the reward paid upon death of the policyholder. The impact of death benefit on lapse is not examined in the recent literature. However, one may expect that death benefit has negative impact on the risk of lapse, because if the death benefit is large, the policy may be very important for the policyholder. A small death benefit may imply that the policy is not very essential for the poli-cyholder.

The risk driver expired duration is expected to decrease the risk of lapse. This expectation is based on the phenomenon of buyer’s remorse. The lapse rates are the highest in the first years of insurance policies (Visser, 2008). After these first years, the risk of lapse will probably decrease. Explanations could be that the remaining policyholders are truly interested in the insurance product or that the remaining policyholders have forgotten that they have the insurance policy. The latter could also be a reason that the risk driver contract duration may de-crease the risk of lapse. Furthermore, a long-term policy contract indicates often a term life insurance that is required when closing a mortgage. A Dutch mortgage is closed for 30 years, which will also be the duration of the term life insurance. Hence, a policyholder with a mortgage is required to keep this long-term policy, which prevents the policyholder from lapse.

The risk driver payment frequency is an ordinal variable. Dummy variables are cre-ated for each category of payment frequency, when estimating the models. When the payment frequency of zero is the base category, the higher frequencies are compared with this payment frequency. There is no literature about the impact of payment frequency on lapse. However, the expectation is that paying premium increases the risk of lapse. Furthermore, frequently paying a premium reminds the policyholder of the insurance policy. This may demotivate the policyholder to keep the insurance policy, since the policyholder has to pay premium more often and this may feel as a large expenditure.

(26)

higher probability of lapse than a female policyholder. An explanation may be that females have a higher risk aversion in financial matters than males. Females are less willing to buy an insurance product if they do not understand the prod-uct or if they are not sure whether they can pay all future premiums (Eling and Kiesenbauer, 2013).

The risk driver age group is expected to increase the probability of lapse, so young policyholders have a higher probability of lapse than older policyholders, when the base category is age group 2, the silent generation. This relation between age and lapse is also found in empirical research (Haberman and Renshaw, 1996). Term life insurance is probably more interesting for older policyholders, since the insurance pays out a pre-arranged amount upon death of the policyholder. Young policyholders may be less interested in financial matters after death than elderly policyholders. A reduced interest in the insurance product may lead to lapse. There is no empirical research on the impact of medical raise on lapse. However, the expectation is that the probability of lapse will decrease when a policyholder has to pay a medical raise. This raise has to be paid if the policyholder has an in-creased risk of medical complications. The medical raise may refrain policyholders from lapsing, because when a policyholder has an increased risk of medical compli-cations, the health conditions of this policyholder may deteriorate over time. This policyholder could lapse and might want to purchase a policy by another insurance company in the future. In that case the medical raise may even be higher than the current medical raise.

(27)

6

Explanatory Analysis

In the explanatory analysis the first research question will be answered. The co-efficients of the logit model and complementary log-log model are estimated by maximum likelihood. The maximum likelihood parameters are found by using the iteratively reweighted least squares (IWLS) method. The estimated coeffi-cients are shown in table 2. Using these results, the difference in the impact of the risk drivers on lapse between the logit and the complementary log-log model in terms of size, sign and significance of the estimated coefficients can be examined. Provision, yearly premium and death benefit contain large values in the given dataset. In order to maintain the estimated coefficients readable, the values of these variables are divided by thousand when estimating the models. Further-more, one dummy variable of each categorical or ordinal variable is omitted to avoid multicollinearity. Some dummy variables did not contain observations in the term life insurance product group, such as age group 1 and age group 7. These dummy variables are also omitted. The variables that are used for estimating the models can be observed in table 2.

The estimated coefficients of the logit model and the complementary log-log model seem to be similar at first sight. All the signs of the risk drivers are the same for both models. Furthermore, both models provide the same statistically significant risk drivers at a 0.01 significance level. Only provision is an exception, since this risk driver is significant at a 0.01 significance level in the logit model, while it is significant at a 0.05 significance level in the complementary log-log model.

(28)

Table 2: Logit model and Complementary log-log model

Lapse

Logit Complementary log-log

Provision ·103 −0.064∗∗∗ (0.017) −0.035∗∗ (0.014) Yearly Premium ·103 0.677∗∗∗ (0.061) 0.548∗∗∗ (0.055) Payment Frequency 1 1.870∗∗∗ (0.248) 1.768∗∗∗ (0.244) Payment Frequency 2 1.470∗∗∗ (0.392) 1.402∗∗∗ (0.378) Payment Frequency 4 1.688∗∗∗ (0.276) 1.615∗∗∗ (0.269) Payment Frequency 12 2.230∗∗∗ (0.245) 2.112∗∗∗ (0.241) Expired Duration 0.116∗∗∗ (0.003) 0.104∗∗∗ (0.003) Contract Duration −0.032∗∗∗ (0.003) −0.030∗∗∗ (0.003) Sex Female 0.194∗∗∗ (0.036) 0.186∗∗∗ (0.034) Age Group 3 7.675 (59.232) 8.691 (93.346) Age Group 4 8.160 (59.232) 9.111 (93.345) Age Group 5 8.886 (59.232) 9.791 (93.346) Age Group 6 9.362 (59.232) 10.238 (93.346)

Medical Raise Yes −1.324∗∗∗ (0.194) −1.246∗∗∗ (0.191) Death Benefit ·103 −0.0001 (0.0002) −0.0001 (0.0002) Ind. Policyholder 2 −0.099∗∗∗ (0.031) −0.081∗∗∗ (0.029)

Constant −13.343 (59.231) −14.161 (93.345)

Observations 83,136 83,136

Log Likelihood −19,691.030 −19,717.370

Akaike Inf. Crit. 39,416.070 39,468.750

This table displays the risk drivers and their estimated coefficients based on the logit model (left) and the complementary log-log model (right). The significance levels of estimated coefficients are indicated by: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01. Standard errors of the estimates are in parentheses.

6.1

Interpretation of the Risk Drivers

(29)

the log-odds of lapse increase with 0.194. When considering the complementary log-log model, the estimate means that if the policyholder is a female, the proba-bility of lapse increases with 0.186.

Although the exact interpretation of the estimated coefficients of the models dif-fers, the overall effect of the risk drivers is similar. For example, the effect of being a female is positive on lapse in both models. However, in both models the effect of the risk driver payment frequency on lapse is larger than the effect of sex on lapse. Hence, the difference in the impact of the risk drivers on lapse between the logit and the complementary log-log model is small in terms of size and signifi-cance. Moreover, there is no difference in the impact of the risk drivers on lapse between the two models in terms of sign. Only provision is an exception, since the significance level and size of the estimated coefficient of provision differs between the models.

The two models test whether the impact of the risk drivers on lapse is as ex-pected or not. It turns out that most of the expectations about the sign of the impact of the risk drivers on lapse are true. Only the expected signs of expired duration and sex are different from the estimated signs. Furthermore, the risk drivers death benefit and age groups appear to be insignificant.

An explanation for the positive impact of expired duration on the probability of lapse may be that the policyholder loses interest in the policy after a certain time.

(30)

7

Predictive Analysis

In the predictive analysis the difference in out-of-sample predictive ability between the logit model, the complementary log-log model and the random forest approach will be examined. The predictive ability will be measured by the forecasting mea-sure Theil’s U. The logit model and complementary log-log model were already used in the explanatory analysis, however, the models were not optimized in terms of predictive ability. Therefore, the first step in the predictive analysis is to opti-mize the logit model and the complementary log-log model in terms of predictive ability. Subsequently, the random forest approach will be applied to the data in order to predict lapse. After which the predictive ability of the three models will be compared based on the forecasting measure Theil’s U.

The dataset used for the explanatory analysis is split in two sets, namely a training set and a test set. The training set contains 75% of the original dataset and the test set contains the remaining 25%. All the models are fitted on the training set and their predictive ability is measured using the test set.

7.1

Interaction Terms

Some risk drivers may interact with each other, which results to interaction effects. Therefore, interaction terms are also included in the models to capture these ef-fects. The parameter of the interaction term in a generalized linear model presents the marginal effect of the interaction term (Ai and Norton, 2003).

An interaction term can be created for every pair of risk drivers. However, ten risk drivers and all possible interaction terms would provide billion combinations of variables that can be used for estimating models. Choosing an optimal model from all these combinations is a cumbersome task. Therefore, only a small selec-tion of interacselec-tion terms is considered to add to the regressions. This selecselec-tion is based on expected interaction effects.

The selection of interaction terms is as following: 1. Yearly Premium × Payment Frequency:

(31)

2. Yearly Premium × Contract Duration:

The combination of a high premium and a long contract duration may de-motivate policyholder to keep the policy. Therefore, this interaction may be relevant.

3. Expired Duration × Contract Duration:

The expired duration is proportional to the contract duration. A short tract duration makes the expired duration relatively longer than a long con-tract duration. Hence, the expired duration interacts with the concon-tract du-ration.

4. Sex × Age Group:

Males and females behave differently over time. For example, the mortality rate of a male adolescence is higher than the mortality rate of a female adolescence, because a male adolescence may behave more careless than a female adolescence. Hence, there exists an interaction between age and sex. 5. Death Benefit × Ind. Policyholder 2:

In case of a joint life policy, the death benefit will be paid out to the second policyholder if the first policyholder has been passed away. Therefore, death benefit may be more important when there are two policyholders, because one of them will receive the pre-arranged amount. Hence, the death benefit may interact with the presence of a second policyholder.

7.2

Results of the Generalized Linear Models and Random

Forest Approach

(32)

The most parsimonious complementary log-log model is selected using the same method as for the logit model selection. For every combination of the risk drivers and the interaction terms complementary log-log models are fitted on the training set. The complementary log-log model with the lowest AIC is selected and is the most parsimonious complementary log-log model. This model is shown in table 4. Just like the most parsimonious logit model, the complementary log-log model contains sixteen parameters.

In table 3 and table 4 is the risk driver age group an insignificant variable. How-ever, removing the risk driver from the models does not lead to a lower AIC. An insignificant risk driver in the two tables means that the p-value is at least higher than 0.10, implying that the null hypothesis cannot be rejected at this level. The null hypothesis states that the coefficient of a risk driver is equal to zero. In other words, the statement that the coefficients of the age groups are equal to zero can-not be rejected. The p-value tests the effect of the risk driver, while AIC measures the goodness of fit of a model. Hence, a model with the best AIC may contain insignificant variables.

The interaction terms are not included in the random forest approach. Inter-action terms are modelled by decision trees and since the random forest ex-ists of decision trees, the interaction terms are automatically modelled. Hence, adding interaction terms are not necessary when using the random forest ap-proach. However, the interaction terms in the random forest are local instead of global. A global interaction term is the multiplication of two variables, such as yearly premium · contract duration. A local interaction term is for example, I(yearly premium > 500) × I(contract duration > 2), where I(·) is an indicator function. This means that a local interaction term describes a region in the dataset. The selected models and the random forest approach predict the probability of lapse for each observation of the dataset. These predicted values are compared with the simple naive model using the forecasting measure, Theil’s U. The dataset is split into a training set and a test set. The models are fitted on the training set, hence the predictions based on the training set are called in-sample forecasts. Predictions based on the test set are called out-of-sample forecasts.

(33)

generalized linear models, while this difference for the random forest approach is relatively large. Focusing on the random forest approach, the Theil’s U takes in-sample a value of 0.236 and out-of-in-sample it equals 0.614. This difference is caused by overfitting the model on the training set.

Based on the out-of-sample forecasts, the values of Theil’s U for logit model and the complementary log-log model are both equal, namely 0.707. Hence, the random forest approach has a better out-of-sample predictive ability than the generalized linear models, since the Theil’s U for the random forest approach is lower than the Theil’s U for the generalized linear models.

Generalized linear models aim to find a linear separation of classes based on the given risk drivers. For example, in the logit model, the log-odds are a linear com-bination of the risk drivers. The same idea holds for the complementary log-log model. Hence, if the data is linearly separable, the generalized linear models would be suitable. The random forest is a non-linear classifier. Since the random forest approach performs better than the generalized linear models in terms of predictive ability, the data may not be linearly separable.

(34)

Table 3: Most parsimonious logit model Lapse Payment Frequency 1 2.127∗∗∗ (0.275) Payment Frequency 2 1.311∗∗∗ (0.479) Payment Frequency 4 1.855∗∗∗ (0.307) Payment Frequency 12 2.448∗∗∗ (0.270) Contract Duration −0.011∗∗∗ (0.004) Yearly Premium 0.667∗∗∗ (0.061) Expired Duration 0.309∗∗∗ (0.016) Provision −0.057∗∗∗ (0.020) Sex Female 0.214∗∗∗ (0.042)

Medical Raise Yes −1.362∗∗∗ (0.224)

Age Group 3 8.234 (69.056)

Age Group 4 8.781 (69.056)

Age Group 5 9.503 (69.056)

Age Group 6 9.934 (69.056)

Ind. Policyholder 2 −0.052 (0.035)

Contract Duration:Expired Duration −0.007∗∗∗ (0.001)

Constant −14.750 (69.056)

Observations 62,352

Log Likelihood −14,674.520

Akaike Inf. Crit. 29,383.040

(35)

Table 4: Most parsimonious complementary log-log model Lapse Payment Frequency 1 1.984∗∗∗ (0.269) Payment Frequency 2 1.241∗∗∗ (0.464) Payment Frequency 4 1.740∗∗∗ (0.299) Payment Frequency 12 2.278∗∗∗ (0.265) Contract Duration −0.015∗∗∗ (0.004) Yearly Premium 0.379∗∗∗ (0.109) Expired Duration 0.266∗∗∗ (0.014) Provision −0.035∗∗ (0.016) Sex Female 0.220∗∗∗ (0.038)

Medical Raise Yes −1.282∗∗∗ (0.220)

Age Group 3 9.223 (108.807)

Age Group 4 9.675 (108.806)

Age Group 5 10.347 (108.806)

Age Group 6 10.767 (108.807)

Contract Duration:Yearly Premium 0.010∗ (0.006) Contract Duration:Expired Duration −0.006∗∗∗ (0.001)

Constant −15.361 (108.806)

Observations 62,352

Log Likelihood −14,700.990

Akaike Inf. Crit. 29,435.990

This table displays the selected risk drivers and their estimated coefficients based on the most parsimonious complementary log-log model. The significance levels of estimated coefficients are indicated by: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01. Standard errors of the estimates are in parentheses.

Table 5: Theil’s U measurement per model

Theil’s U

In-Sample Out-of-Sample

Logit model 0.700 0.707

Complementary log-log model 0.700 0.707

Random Forest approach 0.236 0.614

(36)

8

Discussion and Conclusion

In this thesis the focus was on the impact of risk drivers on lapse and the predictive modelling of lapse. In the explanatory analysis two generalized linear models were used to examine the impact of the risk drivers on lapse, namely the logit model and the complementary log-log model. One of the risk drivers with a significant impact on lapse is payment frequency. The payment frequency of zero is the base category, which is also referred to as a non-contributory policy. A policyholder that has a non-contributory policy has either paid a lumpsum or has paid regu-larly premiums in the past. For a term life insurance policy, the latter option is the most common. In the regressions the higher frequencies are compared with the payment frequency of zero. The results show that paying premium in higher frequencies than the base category increases the risk of lapse.

A consequence of changing to a non-contributory policy is that the policyholder does not pay the future premiums to the insurance companies anymore. The same holds for lapsed policies. This consequence may lead to a threatened liquidity or lost on potential future profits. Since the base category of the risk driver payment frequency has the same undesirable consequence as lapse, payment frequency may be a questionable risk driver.

However, an insurance company prefers changing to a non-contributory policy over lapse. In case of a non-contributory policy, the policy is still active and the death benefit of the policyholder will be paid. Since the policyholder stops the premium payments, the death benefit of the policyholder is often lowered. In case of lapse, a policyholder often receives a surrender value, which is direct money and the policy is terminated. The amount of the surrender value cannot be invested in assets, missing out on potential future profits. Hence, lapse has a more negative effect than a change to a non-contributory policy for the insurance company. Other significant risk drivers are provision, yearly premium, expired duration, con-tract duration, sex, medical raise and the presence of a second policyholder. This thesis focuses on term life insurance. For other insurance products the significant risk drivers may differ, because the insurance products have other characteristics. Therefore, it is important that an insurance company distinguish between insur-ance products when examining lapse.

(37)

focusing on market segmentation. An insurer may promote joint life policies, since the presence of a second policyholder decreases the risk of lapse. Furthermore, policyholders who have to pay a medical raise have a decreased risk of lapse, so an insurer may want to target this group of policyholders with special premiums. However, this target group gives a higher mortality risk. Moreover, a small yearly premium has a lower risk of lapse than a large yearly premium. In case of a repay-ment mortgage with a required term life insurance policy, the insurer can decrease the death benefit in relation to the mortgage. Decreasing the death benefit leads to a reduced yearly premium. With these measures, the retention policy of an insurance company can be improved using the results of the explanatory analysis. A remarkable result in the explanatory and predictive analyses is that the ex-pired duration has a positive effect on the probability of lapse. This effect is not in line with the current literature (Visser, 2008). The literature states that the lapse rates are the highest for the first years of the insurance policies. After the first years the lapse rates decrease. The opposite effect of expired duration in this thesis can be caused by false values in the given dataset. This possible reason can be tested by applying the same models to another dataset and observing the estimated coefficient of expired duration.

The use of generalized linear models is popular in the literature on lapse. How-ever, in the predictive analysis the random forest approach has a better predictive ability than the generalized linear models based on Theil’s U. This result is in line with recent literature, where the random forest approach outperforms the logistic regression (Babaoglu et al., 2017). Therefore, the random forest approach can be recommended to insurance companies for predictive modelling. The algorithm of the random forest is implemented in many programming languages, such as R, Java and Python. The ‘randomForest’ package in R contains also a function that selects the most important risk drivers for prediction (Liaw and Wiener, 2018). One drawback is that the impact of these most important risk drivers on lapse is not explained, but the impact can be explained by a generalized linear model as in the explanatory analysis.

(38)

A model with this selection of risk drivers and a reasonable predictive ability is very interesting for an insurance company, since this model is easier to interpret and easier to implement.

In this thesis only one forecasting measure is used, namely Theil’s U. Another forecast measure is the receiver operating characteristic (ROC) curve. The ROC curve is used to show the predictive ability of a binary classifier model for different discrimination thresholds (Obuchowski, 2003). Since the models in this thesis are predicting probabilities for the binary classes, the ROC curve would be interesting to use because of the discrimination threshold. Furthermore, an extra forecasting measure gives more insight about the predictive ability of the models.

The dataset used in this thesis is provided by a large Dutch insurance company and contains about 900,000 observations. During the data preparation a lot of unusual observations were noticed. These unusual observations were probably caused by manual input. Obvious mistakes were removed from the dataset, but some mistakes can also look like realistic observations. Hence, the dataset might be contaminated with invalid data.

Panel data is often used in the literature to estimate lapse using time series mod-els (Dar and Dodds, 1989; Kuo et al., 2003). Furthermore, panel data is useful to detect the presence of concept drift. Concept drift is the tendency of chang-ing preferences of lapschang-ing and no-lapschang-ing policyholders over time (Babaoglu et al., 2017). A variable that shows concept drift should not be included in the model. Although, the given dataset in this thesis consists of three consecutive years, the dataset does not contain the same group of policyholders in the three years. Hence, time series regressions are not possible on the given dataset.

(39)
(40)

References

Ai, C. and Norton, E. (2003). Interaction terms in logit and probit models. Eco-nomics Letters, 80(1):123–129.

Arkes, H. and Ayton, P. (1999). The sunk cost and Concorde effects: Are humans less rational than lower animals? Psychological Bulletin, 125(5):591–600. Babaoglu, C., Ahmad, U., Durrani, A., and Bener, A. (2017). Predictive modeling

of lapse risk: An international financial services case study. 2017 IEEE Inter-national Conference on Systems, Man, and Cybernetics (SMC), pages 16–21. Becker, H. (1992). Generaties en hun kansen. Meulenhoff editie. Meulenhoff,

Amsterdam, The Netherlands.

Bliemel, F. (1973). Theil’s forecast accuracy coefficient: A clarification. Journal of Marketing Research, 10(4):444–446.

Brooks, C. (2002). Introductory Econometrics for Finance. Cambridge University Press, Cambridge, UK.

Cederkvist, H., Aastveit, A., and Naes, T. (2007). The importance of functional marginality in model building — A case study. Chemometrics and Intelligent Laboratory Systems, 87(1):72–80.

CEIOPS (2010). QIS5 Technical Specifications. Technical report, CEIOPS, Brus-sels, Belgium.

Chu, H., Guo, H., and Zhou, Y. (2010). Bivariate random effects meta-analysis of diagnostic studies using generalized linear mixed models. Medical Decision Making, 30(4):499–508.

Cox, S. and Lin, Y. (2006). Annuity lapse rate modeling: Tobit or not tobit? Technical report, Society of Actuaris, Schaumburg, US.

Dar, A. and Dodds, C. (1989). Interest rates, the emergency fund hypothesis and saving through endowment policies: some empirical evidence for the UK. Journal of Risk and Insurance, 56(3):415–433.

(41)

Eling, M. and Kiesenbauer, D. (2013). What policy features determine life in-surance lapse? An analysis of the German market. The Journal of Risk and Insurance, 81(2):241–269.

Eling, M. and Kochanski, M. (2013). Research on lapse in life insurance: what has been done and what needs to be done? The Journal of Risk Finance, 14(4):392–413.

Fang, H. and Kung, E. (2012). Why do life insurance policyholders lapse? The roles of income, health and bequest motive shocks. NBER Working Paper, (17899).

Giovanni, D. D. (2010). Lapse rate modeling: a rational expectation approach. Scandinavian Actuarial Journal, 2010(1):56–67.

Haberman, S. and Renshaw, A. (1996). Generalized linear models and actuarial science. Journal of the Royal Statistical Society. Series D (The Statistician), 45(4):407–436.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, US. Hauptmeier, S., Heinemann, F., Kappler, M., Kraus, M., Schrimpf, A., Trautwein, H., and Wang, Q. (2009). Projecting Potential Output: Methods and Problems. ZEW Economic Studies. Physica-Verlag Heidelberg, Mannheim, Germany. Jamal, S. (2017). Lapse risk modeling with machine learning techniques: An

application to structural lapse drivers. Bulletin Fran¸cais d’Actuariat, 17(33):27– 91.

Kenward, M. and Molenberghs, G. (2009). Last observation carried forward: A crystal ball? Journal of Biopharmaceutical Statistics, 19(5):872–888.

Kiesenbauer, D. (2012). Main determinants of lapse in the German life insurance industry. North American Actuarial Journal, 16(1):52–73.

Kim, C. (2005). Modeling surrender and lapse rate with economic variables. North American Actuarial Journal, 9(4):56–70.

Kuo, W., Tsai, C., and Chen, W. (2003). An empirical study on the lapse rate: The cointegration approach. Journal of Risk and Insurance, 70(3):489–508. Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest.

(42)

Liaw, A. and Wiener, M. (2018). Breiman and cutler’s random forests for classifi-cation and regression. Technical report.

Michorius, C. (2011). Modeling lapse rates: Investigating the variables that drive lapse rates. Master’s thesis, University of Twente, Faculty of Management and Governance.

Milhaud, X., Loisel, S., and Maume-Deschamps, V. (2011). Surrender triggers in life insurance: what main features affect the surrender behavior in a classical economic context? Bulletin Fran¸cais d’Actuariat, 11(22):5–48.

Muchlinski, D., Siroky, D., He, J., and Kocher, M. (2015). Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Political Analysis, 24(1):87––103.

Obuchowski, N. (2003). Receiver operating characteristic curves and their use in radiology. Radiology, 229(1):3–8.

Outreville, J. (1990). Whole-life insurance lapse rates and the emergency fund hypothesis. Insurance: Mathematics and Economics, 9(4):249–255.

Paulson, A., Rosen, R., Mohey-Deen, Z., and McMenamin, R. (2012). How liquid are U.S. life insurance liabilities? Chicago Fed Letter, (302):1–4.

Pregibon, D. (1980). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(1):15–24. Renshaw, A. and Haberman, S. (1986). Statistical analysis of life assurance lapses.

Journal of the Institute of Actuaries, 113(3):459–497.

Rossi, R. (2018). Mathematical Statistics: An Introduction to Likelihood Based Inference. John Wiley & Sons, Hoboken, US.

S´a, J., Almeida, A., Rocha, B. D., Mota, M., Souza, J. D., and Dentel, L. (2016). Lightning forecast using data mining techniques on hourly evolution of the con-vective available potential energy. 10th Brazilian Congress on Computational Intelligence, pages 1–5.

The Geneva Association (2012). Surrenders in the life insurance industry and their impact on liquidity. Technical report, The Geneva Association: The Interna-tional Association for the Study of Insurance Economics.

(43)

Visser, M. (2007). Afkoop: een ondergewaardeerd onderwerp. Master’s thesis, Universiteit van Amsterdam, Faculteit Economie en Bedrijfskunde.

Referenties

GERELATEERDE DOCUMENTEN

Like the well-known LISREL models, the proposed models consist of a structural and a measurement part, where the structural part is a system of logit equations and the measurement

The results with generalized linear models is compared with an already existing model: Network Enrichment Analysis Test6. The data analysis with generalized linear models gives

This structure ensures a considerably higher confinement of the mode optical power in the active material region, making the LR-DLSPP waveguide configuration much more amenable

ja, nou ja, nu leer ik ook misschien mensen kennen met wie ik in de kroeg kan gaan zitten” (Albert, persoonlijke communicatie, 13-05-2014). Iedereen wil graag op straat mensen

Hij doet het goed op de Veluwe en in Noord-Nederland maar daarbuiten gaat het minder goed. Het was mede daarom bijzon- der goed nieuws dat recent in een broeihoop de eerste

Aangesien hierdie studie op motoriese agterstande by die jong kind fokus, sal die volgende gedeelte ʼn meer breedvoerige bespreking van die aard en omvang, variasie

Zowel bij legsel- als kuikenpredatie bleek in onze studie de Zwarte kraai een veel gerin- gere rol te spelen dan vaak wordt veronder- steld: in geen van de onderzoeksgebieden was