• No results found

Master’s Thesis Actuarial Studies Modelling long term illness using a multiple state duration model incorporating frailty

N/A
N/A
Protected

Academic year: 2021

Share "Master’s Thesis Actuarial Studies Modelling long term illness using a multiple state duration model incorporating frailty"

Copied!
56
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis Actuarial Studies

Modelling long term illness using a multiple state

duration model incorporating frailty

(2)

Abstract

For an insurance company it is important to have insight in the future claims devel-opment of clients. Therefore it is of interest, when a client is in long term illness, to model how long this illness will last. In this research, long term illness is modeled based on health care insurance claims. The research question that will be answered is: ”What is the influence of observed and unobserved variables on the hazard rate to go from long term illness to deceased, and the hazard rate to go from long term ill to healthy?”

The question will be answered by estimating a multiple state proportional hazard model. An algorithm has been developed to model long term illness using claims data so that durations of long term illness can be observed. The proportional haz-ard model includes a discrete frailty term.

(3)

Table of Contents

1 Introduction 3 1.1 Problem . . . 4 1.2 Research question . . . 5 1.3 Literature review . . . 6 2 Methodology 8 2.1 Introduction to survival analysis . . . 8

2.2 The mixed proportional hazard model . . . 10

2.3 Multiple state models . . . 13

2.4 Establishing long term illness . . . 14

3 Data 17 3.1 Dataset 1: claims data . . . 17

3.2 Dataset 2: monthly insurance information . . . 21

3.3 Dataset 3: annual client information . . . 21

3.4 Estimating the MPH model . . . 23

4 Results 25 4.1 Healthy −→ long term illness . . . 26

4.2 Long term illness −→ healthy . . . 28

4.3 Healthy −→ deceased . . . 29

4.4 Long term illness −→ deceased . . . 31

4.5 Sensitivity analysis . . . 32

5 Conclusion and Discussion 34 6 Bibliography 36 7 Appendix A: Supplement to the Data section 40 7.1 Excluded claims . . . 40

7.2 Seasonality . . . 42

8 Appendix B: Supplement to the Results section 44 8.1 Results . . . 44

8.2 Sensitivity analysis: different frailties . . . 47

8.3 Sensitivity analysis: different thresholds in clustering algorithm . . . 51

(4)

1

Introduction

Over the years, worldwide health care costs have been rising significantly faster than GDP. The Netherlands is no exception with health care costs as percentage of GDP growing even faster than the worldwide average (World Health Organization, 2019). Given the ageing population in the Netherlands and the increased medical possibilities and complex-ity, this is expected to be a continuing trend (Ministerie van Volksgezondheid, Welzijn en Sport, 2004).

In order to keep health care affordable and accessible for everyone, in 2006 a new health care insurance framework called ”De Zorgverzekeringswet” was enacted by the Dutch government. This law ensures that everyone is free to choose the health care insurance company one prefers, and everyone is entitled to have coverage on a selection of treat-ments set by the government. This insurance is called ”De Basisverzereking”, i.e. basic insurance. Besides this type of insurance, most insurance companies also offer additional (optional) coverage for treatments that are not part of the basic insurance coverage such as dental care. Unlike the system before 2006 (Staatscourant, 2019), the law now obliges every Dutch citizen to be insured.

This research will focus exclusively on basic insurance, and the remarks about insurance that are made below are made in the context of this type of insurance. Health care insurance companies compete in a free market which gives them an incentive to work efficiently and reduce costs. For its basic insurance, a health care insurer is not allowed to refuse a client and should charge the same premium to everyone 1 (Ministerie van Volksgezondheid, Welzijn en Sport, 2019c). To avoid risk selection, insurance companies pay or receive a financial compensation from a national fund for individual clients, based on his or her characteristics like age or medical conditions. In theory this compensation should exactly offset the gains or losses from having a portfolio of clients that is different from the national average. However, this system is not perfect and is subject to contin-uous adaptation (Van Veen, 2016).

Like all European insurers, health care insurers are subject to the Solvency II directives. This implies that health care insurance companies should have sufficient capital to be able to meet their liabilities at any time, with a very high certainty (Hull, 2015). Health care insurers should therefore estimate their liabilities and corresponding uncertainties as accurately as possible. To reach this goal, a health care insurer should have a funda-mental understanding of the claims process.

In this research we will focus on long term illness. We will investigate to what extent observable and unobservable characteristics influence the duration of long term illness.

(5)

This may lead to a better insight on future claim payments for individual clients. In this research we will solely focus on the occurrence and duration of long term illness. The claim amount will not be modeled in this study. The research will be based on data pro-vided by Menzis, which is a health care insurance company operating in the Netherlands with a market share of around 13% (Vektis, 2018).

1.1

Problem

Health care insurers are informed annually about the number of clients they have in their portfolio with certain illnesses. This information is provided by the compensation fund and concerns illnesses for which health care insurers are compensated. It is not specified which specific clients this concerns. Changes in illness status during the year are also not reported to a health care insurer. It is therefore difficult to establish which clients are in long term illness. 2 However, a lot of information is provided to a health care insurer through claims. Claims contain information such as the claim amount and the treatment date. The first goal of this research is to establish, from this information, in which health state each individual has been in the past. Hence, a measure is needed that establishes in which state an individual was in the past. In this way a timeline can be established of, for instance, the monthly health status of an individual. The robustness of this measure can be studied by applying different measures for long term illness to the data and observe the variation in the resulting timeline.

A natural choice to model the duration of a state, such as long term illness, is to utilize a model from survival analysis. The advantage of a survival model is that the hazard to go to another state (the transition hazard) depends on the current state. It can be expected that someone who is currently healthy, has a lower probability to become long term ill in the next month than someone who is already in long term illness. Since the possibility that the client passes away must also be considered, there are three possible states in the context of long term illness: healthy, long term illness and deceased. Hence we will focus on a multiple state duration model to model long term illness. The states and the feasible transitions are visualized in Figure 1.

An advantage of duration models is that they can be adapted relatively easily to account for unobserved heterogeneity (Cameron and Trivedi, 2005). Unobserved heterogeneity means that there is a variation amongst sample members that cannot be explained by the observed variables that are included in the model. Blossfeld and Hamerle (1992), Bijwaard (2014) and Cameron and Trivedi (2005) all show that not accounting for

(6)

Healthy Long term illness Deceased λ01(t) λ10(t) λ12(t) λ02(t)

Figure 1: Multiple state model representing the three possible states of a client. The arrows represent the feasible transitions. λi,j(t) is the transition hazard from state i to j

which is proportional to the instantaneous probability to have a transition from i to j, conditional on having no transition up to time t.

geneity can lead to wrong conclusions about duration dependence and transition hazards. Especially in the context of health, unobserved heterogeneity is an important factor. In-dividuals with the same observed characteristics such as age and gender can still differ significantly in health due to differences in unobserved factors such as lifestyle and ge-netics. Empirical studies confirm this. In Govindarajulu et al. (2011) two examples are included that investigate the role of frailty (which is the factor that accounts for unob-served heterogeneity) in child mortality and age at death. Both studies show that frailty has a significant effect. Yu (2008) models hospital admission amongst people that were treated for colorectal cancer and also finds that frailty has a significant effect.

Multiple state survival models with a heterogeneity term have been applied to insurance claims before, by Spierdijk and Koning (2013) and Bijwaard (2014). However, no litera-ture was found in which this type of model is applied specifically to health care insurance. In this research we will apply this model to health care insurance data to investigate the influence of observed and unobserved variables on the duration of long term illness.

1.2

Research question

(7)

1.3

Literature review

Concerning the duration of long term illness, the most relevant study that was found in the literature is the article of Czado and Rudolph (2002). They consider a multiple state Cox proportional hazard model with three states including death as an absorbing state. This is similar to the setup in Figure 1. However, Czado and Rudolph (2002) consider the duration of long term care for people who need care on a daily basis instead of long term illness in general. Also it does not account for unobserved heterogeneity.

Incorporating frailty in duration models is a topic that has been discussed extensively, such as in Cameron and Trivedi (2005, Chapter 18) and by Lancaster (1990, Chapter 4). The most common models is the so called mixed proportional hazard (MPH) model, which is a proportional hazard (PH) model multiplied with a random frailty term (Van den Berg, 2001). Duration models are often used to model a single event. In this research, for a given individual, transitions can occur multiple times, for any possible transition in Figure 1. Hence we are interested in recurrent events and categorize indi-viduals in states according to their health status derived from claims. A comprehensive overview of multiple state duration models can found in Hougaard (2000). The dynamics of multiple state duration models can be modeled with the hazard rates of transitions between states.

Including frailty in multiple state duration models is a relatively new topic and has been applied by Bijwaard (2014) and Spierdijk and Koning (2013). Bijwaard (2014) provides an overview of multiple state duration models incorporating unobserved heterogeneity. The author discusses several possible frailty distributions and their effect on the hazard, e.g. the Gamma frailty distribution, the log-normal frailty distribution and the discrete frailty distribution. Also an empirical study is included where multiple state transitions concerning employment behaviour of Dutch migrants are modeled using three methods: a PH model without a frailty term, an uncorrelated MPH model and a correlated MPH model. The correlation in the last model is between the frailties of the different transitions. The article concludes that it is important to include frailty and to let these frailties be correlated.

(8)

other insurance markets and products where unobserved risk factors are a potential issue, such as health, long-term care..”. Both Bijwaard (2014) and Spierdijk and Koning (2013) use a Cox PH model with an individual specific frailty term that is transition specific.

(9)

2

Methodology

The aim of this research is to model the duration of long term illness through the hazard rates in the three state system: healthy, long term illness and deceased. As can be seen in Figure 1 there are four feasible transitions for which the hazard rate can be estimated. The model for the hazard rate will be a multiple state MPH model. In this section the multiple state MPH model will be discussed. Furthermore, it will be explained how long term illness is established in this research.

We will start with an introduction to survival analysis, where concepts such as the survivor function, the hazard rate and censoring will be discussed. Subsequently, the mixed proportional hazard model will be defined, and it will be shown how survival models can be applied to transitions between multiple states. Finally, it will be shown how long term illness is established with an algorithm we developed, using claims information as input.

2.1

Introduction to survival analysis

Survival models concern the time to some event. The classical example is the time until death. Survival models can be used to model virtually any event, such as the time until the occurrence of some disease, the time until some factory part breaks down or the time until an insurance claim arrives (Hougaard, 2000). Often the event is the transition from one state to another, such as from healthy to ill. However, this is not a necessity. For instance, when studying the time until the arrival of an insurance claim it is unnatural to think about states.

The time to event is a random variable, which we denote by T. The distribution of T is denoted by f (t), which obviously has a non negative support. The cumulative distribution function of f (t) is F(t). Hence we have

Pr[T < t] 

∫ t

0

f (u)du  F(t). (1) The complement of F(t) is the survivor function S(t), which is defined in equation 2. It is the probablity that the event has not yet occurred up to time t.

(10)

λ(t)  lim ∆↓0 Pr[t ≤ T < t + ∆t|T ≥ t] ∆t  f (t) S(t) (3) which is the probability that the event happens during the time interval [t, t + ∆t) condi-tional on survival up to time t, divided with the width of the the interval ∆t (otherwise this probability would be zero). By taking the limit of ∆t to zero we obtain an instanta-neous rate of occurrence. If it is assumed that, locally, the hazard rate is nearly constant, it can be interpreted as the probability to have an event between time t and t+ 1, given that there was no event up to time t.

The hazard function makes is easy to extend survival analysis to the modeling of multiple events. The interpretation of the hazard rate is slightly different. If the hazard rate would be constant it can be interpreted as the expected number of events between time t and t+ 1.

With some insight one can derive from 3 the expression

λ(t)  −dln(S(t))

dt . (4)

After integrating we can rewrite this as

S(t) exp(−

∫ t

0

λ(u)du). (5) As can be seen from equations 4 and 5 the hazard rate λ(t) implies a survivor function S(t) and vice versa. Moreover, S(t) is linked to f (t) as can be seen in equations 1 and 2. When modeling a hazard rate one must therefore choose an appropriate distribution for the duration f (t). Well known distributions in this context include the exponential distribution3 and the Weibull distribution. When durations are observed one can estimate the parameters of the chosen distribution for f (t). From this, a hazard rate follows.

It is also possible to estimate the survivor function and hazard rate nonparametrically using the so called Kaplan-Meier and Nelson-Aalen estimator. For an outline of this method see Cameron and Trivedi (2005, Chapter 17).

In survival analysis, durations of states are often not fully observed, which gives rise to censoring. This can be due to a limited observation time, or (as will be explained later) because a competing event occurs. In both cases the exact time of the event of interest is not observed.

(11)

Let t∗i be the time at which the event of interest occurs, and let c∗i be the time at which the observation period ended, which we call the censoring time. If

c∗i < t∗i (6)

ti∗ cannot be fully observed. Then it is only known that t∗i is at least the censoring time. This is right censoring (Cameron and Trivedi, 2005, Chapter 17).

It is also possible to have left censoring which means that only the upper limit of a duration is known. For instance, when the event of interest for an individual has already occurred when the observation started.

Finally, it is possible to have interval censoring which is a combination of left and right censoring. When a duration is interval censored, the upper limit and the lower limit of the duration are observed but the exact duration is unknown.

For completeness we also mention truncation, which occurs when some durations are not observed at all, because the value is too small or too large to be observed. Contrary to censoring, this is a missing data problem instead of an incomplete data problem. In this research truncation is not present, and will therefore not be considered further. When estimating the distribution of the duration, the contribution to the likelihood of an observed duration depends on the censoring type. For each censoring type the likelihood contribution is given in Table 1. These expressions are only valid if the censoring mechanism is non-informative. This means that the distribution of the censoring time c∗i, must be independent of the parameters of the distribution of the duration ti∗ (Cameron and Trivedi, 2005, Chapter 17).

Censoring Type Likelihood Contribution No censoring, duration ti f (ti)

Right censored at t  cri S(cri) Left censored at t  cli 1 −S(cli) Interval censored between cli and cri S(cli) − S(cri)

Table 1: The likelihood contribution for every censoring type

2.2

The mixed proportional hazard model

(12)

proportional hazard (MPH) model which has the basic form (Van den Berg, 2001)

λ(t|x, v)  λ0(t) ·θ(x) · v (7)

where it is common to specify

θ(x)  exp(x0

β). (8)

λ0(t) is called the baseline hazard, which specifies the shape of the hazard function and the distribution type of the duration. v is the frailty term which can account for unobserved heterogeneity. λ0(t) and v are both assumed to be non-negative. The exponential form of θ(x) is convenient because it makes sure that all outcomes for λ(t|x, v) are non-negative, given that the aforementioned assumptions are satisfied. Note that due to the exponential form of θ(x), the effect of the regressors is multiplicative. For instance, when the first component of x increases with ∆x1, λ(t|x, v) increases by a factor of e xp(β1· ∆x1).

When we take v  1 the MPH model reduces to the well known PH model developed by Cox and Walker (1972). However, often not all factors that influence the hazard rate can be observed. This gives rise to unobserved heterogeneity. Especially for health sciences, unobserved heterogeneity amongst individuals can be significant (Fitzmaurice, Laird, and James, 2011).

It is assumed that the true model for the hazard rate has the form

λ(t|x, v)  λ0(t) · exp(x 0

β + y0

γ), (9)

where the vector y contains the unobserved variables, and γ is the vector of the corre-sponding coefficients. Since y cannot be observed, a random multiplicative frailty term v is introduced of the form

v  exp(y0γ). (10) Combining Formula 9 and 10 gives

λ(t|x, v)  λ0(t) · v · exp(x 0

β) (11)

(13)

and spurious duration dependence due to frailty.

By definition, frailty cannot be observed. It is however possible to estimate the distribu-tion of the frailty. In order to do this, a distribudistribu-tion type for the frailty must be chosen. In this study we will use the discrete distribution, following the method of Spierdijk and Koning (2013). In econometrics, discrete frailty models are the most general models (Bijwaard, 2014). Moreover, Gasperoni et al. (2018) state that discrete frailty with an unspecified number of mass points can be seen as a nonparametric frailty model.

A discrete frailty implies that there is a discrete number of latent subpopulations, where individuals within a given subpopulation share the same frailty (Bijwaard, 2014). For instance, in the context of illness and health, one can imagine two different groups indi-viduals. Group 1, with people that have an unobserved unhealthy lifestyle. And group 2, which consists of people having an unobserved healthy lifestyle.

The number of latent subpopulations and therefore the number of points in the support of the discrete frailty distribution is unknown. The optimal number of points in the support can be established using information criteria, such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). These criteria optimize between the goodness of fit and simplicity of the model.

The MPH model can be estimated with maximum likelihood. Letθ be the set of param-eters to be estimated. Let T be the vector containing all observed durations ti, and let X be the matrix containing the observed regressors for each individual xi. If we assume that all durations ti are fully observed, the full likelihood of the MPH distribution for n observed durations is Lf ull(θ|X, T)  n Ö i1 f (ti, xi|θ)  n Ö i1 λ(ti, xi|θ) · exp(− ∫ ti 0 λ(u, xi|θ)du), (12)

where the second expression follows from equations 3 and 5.

When λ(ti, xi|θ) is made explicit by substituting the hazard from equations 9 and 10, we can obtain the full likelihood for the MPH model. Let v1, ..., vK be the points that form the support of the frailty distribution v. And let π1, ..., πK be the corresponding probabilities. Hence θ consists of β, v1, .., vK and π1, ...πK. The full likelihood of the MPH mode is then Lf ull(θ|X, T)  K Ö k1 πk n Ö i1  λ0(ti)exp(x0iβ)vk· exp(−exp(x 0 iβ)vk ∫ ti 0 λ0(u)du) δik, (13)

(14)

be solved with the expectation maximization (EM) algorithm.

Gasperoni et al. (2018) show how the likelihood functions with a discrete frailty distribution can be solved for β, v1, .., vK and π1, ...πK, without having to assume an explicit baseline hazard. The authors do this for a so called shared frailty model, where it is assumed that certain subgroups share the same frailty. In this study the approach of Gasperoni et al. (2018) is used for estimating the model. Since there are no observed subpopulations that can be assumed to share the same frailty, every individual is taken as a subpopulation. Note that this does not mean that individuals cannot share the same frailty. It is however not specified in the model which groups share the same frailty.

2.3

Multiple state models

The events modeled in survival analysis are often transitions from one state to another. The number of states that can be included in a model is not limited to only two states, but can be any number (Hougaard, 2000). The hazard rate of each transition in a system of states (the so called transition hazards), can be estimated by observing the length of time individuals spent in the different states and the transitions. In this research we will focus on a multiple state duration model with three states: healthy, long term ill and deceased. This model has been outlined in the introduction and in Figure 1.

Putter et al. (2007) note that multiple state duration models can be seen as a more general version of the competing risk model. Competing risks are events that result in leaving the current state, but are not the event of interest. Consider a medical study where one wants to estimate the hazard to die as a consequence of lung cancer among a certain group of patients. During the observation period, patients could also die from other causes, such as heart failure. Only one of the events can happen. Hence, death due to lung cancer and death due to heart failure can be regarded as events that are ’competing’ with each other. Since death due to lung cancer is the event of interest, death due to heart failure is considered as a competing event.

The usual way to handle competing events when estimating a hazard rate, is to consider the time on which the competing event occurs as the censoring time (Putter et al., 2007). The underlying theory is that the event of interest could have happened later, if the com-peting event had not occurred. Hence the duration of the event of interest is considered as right censored.

(15)

is a competing event for every transition of interest, which is shown in Table 2.

Instead of estimating the frailties separately for every transition it is also possible incorpo-rate correlation between frailties. Heckman and Walker (1990) suggest a model with a non transition specific individual frailty that with factors that transform the non-transition specific frailty into a transition specific frailty.

The MPH model will be estimated in R, where the package ”dicfrail” developed by Gasperoni (2019) will be utilized. It estimates the coefficients of a proportional hazard model with a discrete frailty term leaving the baseline hazard unspecified. The model is based on the approach of Gasperoni et al. (2018). The number of mass points of the discrete frailty distribution can be either specified or estimated. The estimation of the number of mass points can be based on either the Akaike Information Criterion, the Bayesian information criterion or Laird (1978).

Transition of interest Competing event transition Healthy → Long term illness HealthyDeceased Long term illness → Healthy Long term illness → Deceased

Healthy → Deceased HealthyLong term illness Long term illness → Deceased Long term illnessHealthy

Table 2: Every transition in the three state system gives rise to a competing event. When estimating the hazard rate for a certain transition of interest, the observed durations until the competing event, are considered as right censored observations.

2.4

Establishing long term illness

To estimate the parameters of the model with maximum likelihood, durations of states must be observed. Hence it must be established when a client is healthy, in long term illness or deceased. As explained in the problem section of the introduction, the most appropriate way to establish in which state a client was at a certain point in time, is to derive the states from claims information. An algorithm has been established that produces a timeline of clusters of monthly states for each individual based on claims. The algorithm will be explained below. In the remainder of this report this procedure will be termed: the ’cluster algorithm’.

We assume that there is an available dataset with monthly total claims amounts for every individual. Months where the total claims amount exceeds a certain threshold (that will be determined in the next chapter) are called high-claim months (HCM). In this way, the time series consist of HCM, non-HCM, and months where the client is dead.

(16)

that it determines the duration of illnesses that lasts for a long time, without assigning people with incidental treatments to the state ’long term illness’ immediately. Intuitively, the group of HCM has to extend over a significant number of months and the density of the HCM during this period has to be high enough. The US department of health and human services (2019) defines a chronic disease as a disease that lasts for three months or longer. Analogous to this definition, we define long term illness as an illness that lasts three months or more. To avoid the situation where the minimum length of HCM clusters differs from the minimum lenth of non-HCM clusters, we demand that the minimum length of non-HCM clusters is also three months. In this way the term ’cluster’ has the same meaning for HCM and non-HCM spells.

Now we look at the implementation of the above specification. Long term illness periods are determined as follows. A window of five months moves along the time series of monthly claim amounts. When three, four or five HCM fall into this window these months are marked. This results in a sequence of marked months, possibly interspersed with non-HCM. In a subsequent step, in the series of marked HCM, non-HCM gaps of length one or two are replaced by marked HCM. In this way, clusters of HCM, with a length of three or more are created. As a result, every cluster has a duration of at least three months, and the density of HCM in the original time series falling in these clusters is at least 50%. As a last step, gaps between clusters are filled with non-HCM.

Of the resulting time series, the first two and the last two entries are removed. This is done because the states of the first two column cannot be established with certainty using the algorithm. This will be illustrated with an example: let an HCM be represented with 1 and a non-HCM be represented with 0. Consider the following true sequence of HCM and non-HCM months: ..1011[001011011].., where the sequence within the square brackets is observed. The resulting clusered sequence is: [001111111]. This suggests that the client was healthy at the start of the sampling period and became long term ill after two months. From the true sequence it can be seen that in reality the client was ill the whole period. By removing the first and last two months from the data set after applying the cluster algorithm, this problem is eliminated.

(17)

0

10

20

30

40

50

month

0

1

2

3

4

5

client nr

Figure 2: Graphical representation of the process of clustering high-claim months to long-term illness periods, of five persons over 50 months. For every person the three

phases of the clustering process are shown. Synthetic data is used here.

We will now make the clustering process more concrete by looking at Figure 2. In this figure, clustering is done for five persons, where for each person a synthetic time series of 50 months is used. Persons are numbered 1-5, months are numbered 0-49. Three rows of dots are shown for every person. The upper row represents the original time series, the second row represents marked months, and the third row is the final result of the clustering process. As can be seen in the first row of individual 3, this individual died in month 29, indicated by x-marks. This individual started with two non-HCM (in months 0 and 1), followed by three HCM. This is indicated by two green dots, followed by three red dots. In the second row only marked HCM are shown. In the third row, gaps of length one or two have been filled with HCM, and the remaining gaps are filled with non-HCM.

(18)

3

Data

To answer the research question we must estimate the parameters of the frailty distribu-tion v and β, where we use the notation of equation 11. In this chapter we will describe how the data from Menzis is used to estimate the coefficients of this model. The data consists of three datasets that will be named dataset 1-3. In Figure 3 a flow diagram is given, which gives an overview of how the three datasets are used to estimate the model parameters. The datasets will be described in detail below. After the data description it will be explained how this data is used to estimate the coefficients of the model, and which assumptions are made in the estimation.

3.1

Dataset 1: claims data

This dataset contains over 258 million insurance claims from the period starting in 2012 up to and including 2018. It contains every claim that was payed by Menzis in this period for the mandatory basic health care insurance. For every claim the amount, the treatment month, a personal number and the category in which the claim was made is given.

The category indicates the treatment type. In Table 3 the categories are shown. Not all categories are appropriate for this research. In Appendix A it is explained for each category why it was included or excluded.

Claim category Included % of included claims General practitioner No

Pharmaceutical Yes 49.8% Nursing Yes 2.7% Dental No

Obstetric No

Medical specialist Yes 32.7% Physiotherapy Yes 5.3% Medical accessories Yes 9.4% Transport No

Mental No

Geriatric rehabilitation Yes 0.1% Maternity No

Other No Cross-border No

Table 3: The available claims categories, with for each category indicated whether it is included in the model. Furthermore the percentage claims count is given relative to

(19)

Dataset 1: In-surance claims Dataset 2: Monthly insurance information Dataset 3: Annual client information Table with monthly claims for each individual

List of individuals that were insured

during the whole period and not in a care home

Cluster algo-rithm

Matrix with monthly states for

each individual

Four tables with durations & censoring

indicator relevant for each specific transition

Four tables with durations,

censor-ing indicator and client information

Estimate MPH model

(20)

As can be seen in Figure 3 the claims data is converted into a table with monthly claims amounts per individual. Hence the claims amounts for every individual are aggregated per treatment month. This is then used as input for the cluster algorithm that has been discussed in the previous chapter.

For several reasons, some claims in dataset 1 need to be excluded from the input of the cluster algorithm. In Appendix A it is explained which claims are excluded and why. Claims are only used as input in the cluster algorithm if they meet all of the following conditions:

• The treatment month of the claim lies in the interval [12-2012, 01-2017]

• The individual was was insured at Menzis the whole period [01-2012 - 12-2017], or was insured the whole period until death.

• The individual was not living in a care home

• The claims category is included in the model according to Table 3

For all claims meeting the above criteria, a histogram is given visualizing the distribution of the monthly claims amount. See Figure 4. This figure is included to get a feeling for the monthly claims sizes and to determine the threshold value used in the clustering algorithm.

(21)

Figure 4: Histograms of the monthly claims amount for all included individuals and claims. The red histogram is the complete population, Blue is the subset of individuals

with an illness indication as explained in the text.

The threshold in the cluster algorithm must be set in such a way that the algorithm assigns as many ill people to the state long term illness as possible, while minimizing the number of healthy people assigned to the state long term illness. Intuitively, a vertical line must be drawn that maximizes the blue surface area on the right side, while minimizing the red surface area on the right side. Visually a threshold of a 100 euros seems appropriate. 100 Euros is a round number that does not give a false sense of precision. The implication of the threshold size on the parameter estimates will be tested with a sensitivity analysis.

(22)

3.2

Dataset 2: monthly insurance information

Dataset 2 contains monthly information for every individual on whether the individual was insured or not in that month. If a client died in the observation period, the month in which the client died is also given.

The dataset with monthly information contains the insurance status (insured or not) for each customer that was insured in some period on the interval [01-2012, 12-2018]. It also contains an indicator variable for living at an address that is known to be a care home and the month of death if applicable. This data is used to select the individuals that meet the criteria:

• The individual was insured at Menzis the whole period [01-2012 - 12-2017], or was insured the whole period until death.

• The individual was not living in a care home

The total dataset contains information about 3.0 million individuals. 1.5 million of these individuals were insured the whole period [01-2012 - 12-2017]. 0.4% percent of these individuals live at an address that is known to be a care home. According to Vektis (2019), about 2% of the Dutch population lives in a care home. Hence we can only filter out a fraction of all people living in a care home.

Of the 3.0 million clients, 117.000 clients died in the period [01-2012 - 12-2017], representing about 4% of the clients. This is in line with the national average.

3.3

Dataset 3: annual client information

Dataset 3 contains annual client information concerning personal characteristics. This includes age, gender, information on whether the client has additional insurance etc. The dataset with annual client information contains information on many personal characteristics. Amongst these characteristics, the variables have been chosen that are relevant in the context of the duration of long term illness, and the hazard to die. Some variables have been aggregated in larger groups to avoid too many regressors in the model which may result in overfitting. We will discuss the grouped variables that are considered in the model below. In Table 4 it is shown how many individuals are contained in each group relative to the whole population of insured individuals in 2012.

(23)

of constant regressors more realistic and avoids overfitting.

Deprived area: For every insured individual the postal code is known. Hence it can be established if the individual lives in an area that is deprived, where the criteria of NZA (2019) are used.

Additional insurance: Besides basic insurance, Menzis also offers several types of addi-tional insurance, in the form of ’dental’ insurance and ’extra’ insurance, for treatments that are not covered by the basic insurance. Contrary to basic health care insurance, these types of policies are not mandatory and hence it is likely that there is adverse selection for additional insurance. Therefore it might be informative about the health of an individual. We distinguish four categories with ascending level of additional insurance: No additional insurance, which means that the individual only has a basic insurance policy and additional insurance 1-3. The number is defined as max(level of dental insurance, level of extra insurance). Within dental insurance and extra insurance there are also three ascending levels of insurance coverage 4.

Voluntary no cliam: Every client has a mandatory deductible of a certain amount set by the government. Which means that all claims 5 have to be payed by the client until the total amount reaches the no-claim amount. After the no-claim amount has been reached, the health insurer pays for the remaining amount. It is possible to voluntarily increase the deductible amount in exchange for a reduction on the premium. The additional deductible amount is at least 100 euro, and at most 500 euro. Since the number of individuals with a deductible is relatively is small, all individuals with a nonzero deductible are included in one group, which is named ’deductible’.

Pharmaceutical cost group, multi-year high cost group, diagnosed cost group and one person address are all (non exclusive) groups that are defined by the compensation fund (Van Veen, 2016). Health care insurers are informed about the number of clients contained in their portfolio that fall in each category. Which specific clients this concerns is not specified. Hence the aforementioned characteristics are indicated with a probability for each client and are often not certain. Sometimes it can be established with certainty. For instance: when it is known that there are multiple clients with a policy on a certain address one knows for sure that these clients do not fall into the category one person address. Another example: if it is known that a client claimed a certain type of medication in the previous year that is coupled to a pharmaceutical cost group indication, it can be established with certainty that this person falls into that category.

An individual belongs to the category pharmaceutical cost group if he/she used certain

(24)

specified types of medication for a specified length of time in the previous year (Zorgin-stituut Nederland, 2019).

Individuals are categorized in the multi-year high costs group when they were part of the top 15% clients with the highest annual total claims amount, for at least the last three years (Zorginstituut Nederland, 2019).

The diagnosed cost group contains individuals that received a treatment coupled to a certain diagnosis in the previous year. There is a number of so called diagnosis treatment combinations that is specified by the compensation fund (Zorginstituut Nederland, 2019).

Personal characteristics % of population in 2012 Age group 0-17 22.9 Age group 18-30 19.6 Age group 31-50 27.7 Age group 51-70 21.1 Age group 71+ 8.4 Female 49.3 Deprived area 6.1 No additional insurance 19.3 Additional insurance 1 18.3 Additional insurance 2 41.2 Additional insurance 3 21.2 deductible 4.2 Pharmaceutical cost group 19.0 Multi-year high cost group 8.0 Diagnosed cost group 3.0 One person address 14.3

Table 4: percentage of individuals contained in each group relative to the whole population of insured individuals in 2012.

3.4

Estimating the MPH model

(25)

censoring is considered. Technically there is also interval censoring since the states are discretized on a monthly basis, while in reality every state can be entered and left at any continuous point in time. However, according to (Cameron and Trivedi, 2005, Chap-ter 17) it is usually assumed that the effect of inChap-terval-censoring is sufficiently minor so that interval-censoring can be ignored. The duration of the state in the first month and the duration of the state in the 68th month are right censored.

Every cluster (apart from deceased) contains information that is relevant for the estima-tion of two of the four hazard rates. Consider observing a cluster of the state healthy that ends in death. Obviously, the observed duration of this state is relevant for estimating the transition hazard from healthy to death. However, the observed duration is also relevant for estimating the hazard from healthy to long term illness. Illness did not occur because of the competing event death (see Table 2). Hence this duration is considered as right censored for the estimation of the hazard rate from healthy to ill. Every cluster (apart from deceased) can end in one of the two states that compete with each other, and hence leads to an additional line in two of the four tables with durations that are mentioned in Figure 3.

After the four tables with the relevant durations and censoring indicator have been es-tablished, they will be combined with the client information. The client information that was known just before the start of 2012 will be used as regressors in the MPH model. Since this is before the start of the measurement period of the durations these covariates are exogenous. It is assumed that the covariates are fixed during the six year period. For some covariates this assumption is true. For instance, it is fairly safe to assume that the covariate gender is fixed over time. Many other regressors can change over time. But as Lancaster (1990) notes, if they change at a sufficiently slow pace relative to typical durations they can be treated as constant for practical purposes. Assuming constant regressors greatly simplifies the estimation and interpretation of the model.

(26)

4

Results

In this chapter the estimated coefficients of the MPH model for each transition are presented and discussed. Furthermore a sensitivity analysis is done to assess the robustness of the model. Different model choices that were made are changed to study the implications of the assumptions on the estimated coefficients.

The computations of the cluster algorithm and especially the estimation of the MPH model are too time consuming to include all data. Therefore a random subset of 10.000 individuals is taken for the estimation. By setting different seeds, different samples from the whole population (ca 1.5 million individuals) can be drawn. For all transitions, 5 different samples are drawn, and hence 5 different sets of coefficients are computed for each transition. In appendix B the results for each random draw are shown separately. In the sections below the average and standard deviation of the estimated coefficients are shown. This gives a bootstrap type estimate of the standard deviation of the estimated coefficients, which is similar to the procedure used in Heckman and Singer (1984a). Ideally, many samples should be drawn to get a reliable error term. But since the calculations are time consuming we limit ourselves to 5 random draws. Heckman and Singer (1984a) use a histogram of the parameter estimates obtained from the bootstraps to determine the distribution of the parameters estimates. Given the small number of bootstraps in this research, this is not possible. Hence we will not perform formal hypothesis tests to determine the significance of the estimated parameters. However, the proportionality between the estimated standard errors and the parameter value will give an indication of its significance and is therefore interesting to assess.

For all four transitions, the optimal number of mass points of the discrete frailty distribution according to the AIC and BIC was 1. This implies that the model reduces to the standard Cox proportional hazard model. However, AIC and BIC are not the standard procedures for determining the optimal number of mass points of the frailty distribution. The method of Heckman and Singer (1984a) is the usual procedure to estimate the optimal number of mass points. Given its complexity and the limited time we have not implemented Heckman and Singer (1984a) but instead fixed the number of mass point of the frailty distribution to 2. In the sensitivity analysis the implications of this assumption will be assessed.

(27)

the model. In the thesis of Agosti (2017) it is also shown that patient specific frailty gives unreliable estimates for the frailties in the model of Gasperoni (2019). The large number of populations in relation to the small number of transitions is possibly the reason for this. Since Gasperoni (2019) is the only R package that is available on discrete frailties in survival models, and time was too limited to develop an own algorithm, we decide to report the possibly wrong frailties. In the sensitivity analysis it will be investigated what the effect of frailty is on the estimated coefficients, and hence large the impact of this error is.

For all tables below the coefficients of the proportional hazard model are shown relative to the reference group. The reference group consists of males in the age group 18-30, not living in a deprived area, with no additional insurance, no deductible, not in the pharmaceutical cost group, multi-year high cost group, diagnosed cost group and living with multiple people on an address. Besides the β coefficients of the MPH model, the exponent of the coefficient is also given. Note that all regressors in the model are indicator variables for certain groups. The interpretation of the exponent of the the coefficient is that it is the ratio between the hazard rate of the group to which the coefficient belongs and the reference group. The standard deviation, indicated with ’stdv’ is the standard deviation of the beta coefficient and not its exponent.

4.1

Healthy −

→ long term illness

In Table 5 the estimated coefficients for the transition from healthy to long term illness is shown. It shows the effect of different personal characteristics on the hazard rate to become long term ill, where the effect may or may not be causal.

The signs of most coefficients are intuitive. Older individuals have a larger hazard to become long term ill relative to the reference group, and the effect gets lager and more significant as age increases.

Women are almost 14% more likely to become long term ill than men. In a study by CBS (2017) it was found that women are more likely to take sick leave from work. Possible reason are pregnancy related complications. Even though maternity care and obstetric care are not included in the estimation of the model, there can still be treatments related to pregnancy that are in one of the included claims categories.

(28)

An individual with a deductible is more than 60% less likely to become long term ill. This strong effect is probably also due to self selection. Only if an individual is not ex-pecting to have high costs it is rational to take a deductible. Another effect is that it is financially unattractive for individuals with a deductible to see a specialist. Hence they are incentivised to avoid claims.

Pharmaceutical cost group and multi-year high cost group are both associated with a higher risk to become long term ill. This is not surprising since the individuals in those groups can be expected to be relatively unhealthy, and therefore have an increased prob-ability to become long term ill.

Individuals in the diagnosed cost group are also more likely to become long term ill. For the same reason as with the pharmaceutical cost group and multi-year high cost group this is not surprising. However, the effect is not very strong compared to Pharmaceutical cost group and multi-year high cost group.

The coefficient for one person address does not seem to be significant. In (Zorginstituut Nederland, 2018) it can be seen that the compensation amounts from the compensa-tion fund for individuals living on their own is small, which suggests that its effect on healthcare costs and therefore long term illness is minor. Hence this is in line with expectations. β exp β stdv Age group 0-17 -0.04 0.961 0.057 Age group 31-50 0.29 1.336 0.097 Age group 51-70 0.698 2.01 0.087 Age group 71+ 0.884 2.421 0.107 Female 0.128 1.137 0.035 Deprived area 0.09 1.094 0.133 Additional insurance 1 -0.054 0.947 0.123 Additional insurance 2 0.098 1.103 0.081 Additional insurance 3 0.23 1.258 0.107 Deductible -0.954 0.385 0.101 Pharmaceutical cost group 0.686 1.986 0.076 Multi-year high cost group 0.602 1.826 0.052 Diagnosed cost group 0.264 1.302 0.13 One person address 0.018 1.018 0.1 frailty avg stdv

p1 0.782 0.004

p2 0.218 0.004

v2/v1 1131.8 96.835

(29)

4.2

Long term illness −

→ healthy

The coefficient estimates of this transition (shown in Table 6) are of special interest since this study is focused on the duration of long term illness. Hence it is of interest to analyze what personal characteristics improve the rate of recovery, i.e. the transition hazard from long term illness to healthy.

Again, most results are logical. An increasing age has a negative effect on the rate of recovery. Women have a higher rate of recovery than men although this effect does not seem to be significant.

Just like in the previous section living in a deprived area does not seem to have a signif-icant effect.

Additional insurance is associated with a lower recovery rate which makes sense because of self selection. But again, compared to the standard deviation the effect is small Individuals with a deductible do not seem to have a significantly different recovery rate than the reference group, based on the size of the standard deviation compared to the coefficient. this is an interesting result since there is a very strong effect of on the transi-tion from healthy to long term illness. Hence it seems like individuals with a deductible are less inclined to become long term ill, but once they are long term ill they do not have a significantly higher recovery rate.

The pharmaceutical cost group, multi-year high cost group and diagnosed cost group give the strongest results. All categories have a significantly lower recovery rate than the reference group. Especially in the multi-year high cost group where the recovery rate is more than 35% percent higher than the reference group.

(30)

β exp β stdv Age group 0-17 0.05 1.051 0.043 Age group 31-50 -0.004 0.996 0.053 Age group 51-70 -0.048 0.953 0.044 Age group 71+ -0.184 0.832 0.060 Female 0.022 1.022 0.028 Deprived area -0.056 0.946 0.051 Additional insurance 1 -0.02 0.98 0.043 Additional insurance 2 -0.02 0.98 0.072 Additional insurance 3 -0.028 0.972 0.074 Deductible 0.072 1.075 0.137 Pharmaceutical cost group -0.1625 0.850 0.024 Multi-year high cost group -0.438 0.645 0.071 Diagnosed cost group -0.144 0.868 0.034 One person address -0.01 0.99 0.051 frailty avg stdv

p1 0.572 0.004

p2 0.428 0.004

v2/v1 299.986 17.776

Table 6: MPH model coefficients for the transition: long term illness −→ healthy.

4.3

Healthy −

→ deceased

As mentioned in the Data section, 4% of the population died during the observation period. Therefore, there are not many events in the data used to estimate the coefficients of the MPH model, leading to many censored observations. This may lead to inaccurate estimates. This is indeed what the standard errors suggest, which are for both the transition healthy −→ deceased, and long term illness −→ deceased, much larger than for the transitions in the previous sections.

(31)

Living in a deprived area increases the hazard to die. Which is intuitive. But this coefficient is also of the same order of magnitude as the standard deviation.

An additional insurance is associated with a lower hazard to die, which is the opposite of what one would expect considering self selection. Possibly individuals with an additional insurance are more focussed on their heath and therefore more healthy. However, for all additional insurance classes the coefficient is smaller than the standard errors.

A deductible is associated with a smaller hazard to die, which in line with the theory that there is self selection amongst healthy individuals. The coefficient is large but the standard deviation is also large.

Individuals in pharmaceutical cost group, multi-year high cost group and diagnosed cost group all have an increased hazard to die. Since this concerns the hazard rate from healthy to deceased, this includes individuals that were not ill before death, but were in diagnosed or treated before the observation period started in 2012. Hence it might be the case that these individuals were ill but not treated anymore just before they died and hence not classified as ill. Another possibility is that the individuals were ill but not classified as ill in the clustering algorithm because the treatment costs were too low or irregular.

(32)

β exp β stdv Age group 0-17 2.636 13.957 11.461 Age group 31-50 12.612 300138 5.58 Age group 51-70 14.216 1492555 6.157 Age group 71+ 16.212 10984546 6.165 Female -0.276 0.759 0.227 Deprived area 0.364 1.439 0.337 Additional insurance 1 -0.116 0.890 0.594 Additional insurance 2 -0.29 0.748 0.647 Additional insurance 3 -0.184 0.832 0.412 Deductible -3.732 0.024 6.853 Pharmaceutical cost group 0.434 1.543 0.16 Multi-year high cost group 1.446 4.246 0.925 Diagnosed cost group 0.382 1.465 0.616 One person address 0.802 2.23 0.56 frailty avg stdv

p1 0.992 0.007

p2 0.008 0.007

v2/v1 1022.674 1139.158 Table 7: MPH model coefficients for the transition: healthy −→ deceased.

4.4

Long term illness −

→ deceased

In Table 8 the parameter estimates for the transition long term illness −→ deceased are shown. As mentioned before, the standard errors are large for the parameter estimates in this transition. Hence most coefficients do not seem to be significant. Nevertheless, we will assess the coefficients to see if their sign and order of magnitude makes sense. The age effect is again in line with CBS statline (2018), apart from the negative coefficient of age group 0-17. However, the standard deviation is larger than the coefficient so it is difficult to draw conclusions from this result.

Women in long term illness have a lower hazard rate to die than men, which is in line with the higher life expectation of women. moreover, as we have seen before, women have a larger hazard rate to become long term ill which might be explained by pregnancy related complications. These complications are usually not terminal so this might contribute to a lower hazard to die for women in long term illness.

Just like in the previous section, individuals living in a deprived area are more likely to die.

(33)

the coefficients this seems to be a structural effect.

There is again a negative coefficient for the deductible, which is logical because of self selection. The effect is less strong than in the previous section. For the transition long term ill −→ healthy, and healthy −→ long term ill, we also observed that the influence on the hazard rate is the most significant for individuals that were initially healthy.

Individuals that are long term ill and in pharmaceutical cost group, multi-year high cost group or diagnosed cost group do not seem to differ significantly from the reference group in terms of hazard to die.

similar to the result in the previous section, individuals living on their own are twice as likely to die than healthy individuals living with multiple people at one address. Although the standard error is considerably larger than in the previous section in which makes the result less significant.

β exp β stdv Age group 0-17 -3.738 0.024 5.544 Age group 31-50 3.192 24.337 6.449 Age group 51-70 5.376 216.156 6.049 Age group 71+ 6.462 640.34 6.037 Female -0.48 0.619 0.623 Deprived area 0.292 1.339 0.446 Additional insurance 1 -0.068 0.934 0.396 Additional insurance 2 -0.646 0.524 0.644 Additional insurance 3 -0.728 0.483 0.58 Deductible -0.472 0.624 0.992 Pharmaceutical cost group -0.152 0.859 0.408 Multi-year high cost group -0.074 0.929 0.458 Diagnosed cost group 0.13 1.139 0.723 One person address 0.69 1.994 0.946 frailty avg stdv

p1 0.996 0.005

p2 0.004 0.005

v2/v1 2386.498 1402.94

Table 8: MPH model coefficients for the transition: long term illness −→ deceased.

4.5

Sensitivity analysis

(34)

The results in the previous section are produced using a discrete frailty term with 2 mass points. First we will check the variation in results if we change the number of mass points to 4 or if we choose a gamma distribution instead of a discrete distribution for the frailty. The resulting coefficient estimates are shown in Table 13-Table 16. Heckman and Singer (1984b) show that different specifications of the frailty distribution can lead to significantly different results. However, the variation in outcomes in this research is not extremely large. Hence we conclude that the model is not very sensitive for the chosen frailty distribution.

(35)

5

Conclusion and Discussion

For all feasible transitions between the states: healthy, long term ill and deceased, the influence of a selection of personal characteristics on the transition hazards has been studied. An attempt has been done to incorporate frailty and to assess its influence on the estimates.

The research question of this research is: What is the influence of observed and unobserved variables on the hazard rate to go from long term illness to deceased, and the hazard rate to go from long term ill to healthy? The question is answered in by the coefficients in results section 4.2 and 4.4. The results are mainly in line with what one might expect, as explained in aforementioned sections. The parameters concerning the influence of personal characteristics on the hazard rate to go from long term illness to deceased have high standard deviations, which makes it difficult to draw conclusions about their influence. possibly this problem can be solved by taking larger samples then 10.000 individuals.

The influence of frailty in the model (i.e. unobserved variables) seems to be small since the choice of frailty distribution did not have a large influence on the parameter estimates according to the sensitivity analysis. Possibly, this is the result of the fact that the frailties were estimated independently between transitions. Heckman and Walker (1990) suggest a model with a non transition specific individual frailty that with factors that transform the non-transition specific frailty into a transition specific frailty. This ensures that there is correlation between the frailties. Another factor that complicates the frailty estimation is the fact that the there are not many events per individual. An interesting result in this research was the β coefficient for the regressor deductible in the transition healthy −→ long term illness, and healthy −→ deceased, was large. A logical explanation is self selection of healthy clients. Observing the deductible as an insurer reduces the information asymmetry. Spierdijk and Koning (2013) note that information asymmetry might lead to unobserved heterogeneity, and hence knowledge about the deductible may reduce the significance of a frailty term.

(36)
(37)

6

Bibliography

Agosti (2017). Applications of nonparametric frailty models for the analysis of long term survival in Heart Failure patients. https://www.politesi.polimi.it/ bitstream/10589/140100/1/2018_04_Agosti.pdf. Online; accessed 28 June 2019.

Van den Berg, G.J. (1997). Association measures for durations in bivariate hazard rate models. Journal of Econometrics (79), 221–245.

Van den Berg, G.J. (2001). Duration models: specification, identification and multiple durations. in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 5, chapter 55 , 3381–3460.

Bijwaard, G.E. (2014). Unobserved heterogeneity in multiple-spell multiple-states dura-tion models. Demographic Research (30), 1591–1620.

Blossfeld, H.P. and A. Hamerle (1992). Unobserved heterogeneity in event history models. Quality and Quantity (2), 157–168.

Cameron, A.C. and P.K. Trivedi (2005). Microeconometrics methods and applications. Cambridge.

CBS (2017). Verschillen in ziekteverzuim tussen bedrijfs-takken. https://www.cbs.nl/nl-nl/achtergrond/2017/27/ verschillen-in-ziekteverzuim-tussen-bedrijfstakken. Online; accessed 28 June 2019.

CBS statline (2018). Overledenen; geslacht, leeftijd, burgerlijke staat, re-gio. https://opendata.cbs.nl/statline/#/CBS/nl/dataset/03747/ table?fromstatweb. Online; accessed 11 June 2019.

Cox, D.R. and J.R. Walker (1972). Regression models and life-tables. Journal of the Royal Statistical Societ (2), 187–218.

Czado, C. and F. Rudolph (2002). Application of survival analysis methods to long-term care insurance. Insurance: mathematics & economics (31), 395–413.

Elbers, C. and G. Ridder (1982). True and spurious duration dependence: The identifi-ability of the proportional hazard model. Review of Economic Studies (49), 403–409. Fitzmaurice, G.M., N.M. Laird, and h.W. James (2011). Applied Longitudinal Analysis.

(38)

Gasperoni, F. (2019). Package‘discfrail’. https://cran.r-project.org/web/ packages/discfrail/discfrail.pdf. Online; accessed 23 May 2019.

Gasperoni, F., F. Ieva, A.M. Paganoni, C.H. Jackson, and L. Sharples (2018). Nonpara-metric frailty cox models for hierarchical time to event data. Biostatistics.

Govindarajulu, U.S., H. Lin, L. Lunetta, and R.B. D’Agostino (2011). Frailty models: Applications to biomedical and genetic studies. Statistics in medicine (30), 2754–2764. Heckman, J. and B. Singer (1984a). A method for minimizing the impact of distributional

assumptions in econometric models for duration data. Econometrica (2), 271–320. Heckman, J. and B. Singer (1984b). A method for minimizing the impact of distributional

assumptions in econometricmodels for duration data. Econometrica (52), 271–320. Heckman, J.J. and J.R. Walker (1990). The relationship between wages and income and

the timing and spacing of births: Evidence from swedish longitudinal data. Economet-rica (58), 1411–1441.

Hougaard, P. (2000). Analysis of multivariate survival data. Springer.

Hull, J.C. (2015). Risk Management and Financial Institutions. John Wiley & sons. Kaas, R., M. Goovaerts, J. Dhaene, and M. Denuit (2008). Modern Actuarial Risk Theory.

Springer.

Laird, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association (364), 805–811.

Lancaster, T (1990). The Econometric analysis of Transition Data. Cambridge.

Ministerie van Volksgezondheid, Welzijn en Sport (2004). Zorgverzekeringswet. https: //zoek.officielebekendmakingen.nl/kst-29763-3.html. Online; ac-cessed 27 February 2019.

Ministerie van Volksgezondheid, Welzijn en Sport (2019a). Verpleging en verzorging. https://www.informatielangdurigezorg.nl/volwassenen/ verpleging-verzorging. Online; accessed 9 June 2019.

Ministerie van Volksgezondheid, Welzijn en Sport (2019b). Wet langdurige zorg. https: //wetten.overheid.nl/BWBR0035917/2019-04-02. Online; accessed 9 June 2019.

(39)

//wetten.overheid.nl/BWBR0018450/2019-01-01. Online; accessed 27 February 2019.

NZA (2019). Prestatie- en tariefbeschikking huisartsenzorg en multidisciplinaire zorg 2019 - TB/REG-19619-03. https://puc.overheid.nl/nza/doc/PUC_247681_ 22/1/. Online; accessed 24 June 2019.

Putter, H., M. Fiocco, and R.B. Geskus (2007). Tutorial in biostatistics: Competing risks and multi-state models. Statistics in medicine (26), 2389–2430.

Spierdijk, L. and R.H. Koning (2013). Estimating outstanding claim liabilities: The role of unobserved risk factors. Journal of Risk & insurance (4), 803–830.

Staatscourant (2019). Vaststelling loongrens ziekenfondsverzekering. https://zoek. officielebekendmakingen.nl/stcrt-2004-209-p18-SC67117.html. Online; accessed 11 June 2019.

Stichting Farmaceutische Kengetallen (2018). Data en feiten 2018. https://www. sfk.nl/publicaties/data-en-feiten/data-en-feiten-2018. Online; accessed 9 June 2019.

US department of health and human services (2019). Summary Health Statistics for the U.S. Population: National Health Interview Survey,2012. https://www.cdc.gov/ nchs/data/series/sr_10/sr10_259.pdf. Online; accessed 3 June 2019. Van Veen, S.H.C.M. (2016). Evaluating and Improving the Predictive Performance of Risk

Equalization Models in Health Insurance Markets. Erasmus University Rotterdam. Vektis (2018). Verzekerden in beeld 2018. https://www.vektis.nl/uploads/

Publicaties/Zorgthermometer/Zorgthermometer%20Verzekerden% 20in%20Beeld%202018.pdf. Online; accessed 21 May 2019.

Vektis (2019). Minder mensen maken gebruik van Wlz. https://www.vektis.nl/ nieuws/minder-mensen-maken-gebruik-van-wlz. Online; accessed 11 June 2019.

World Health Organization (2019). Current health expenditure. https://data. worldbank.org/indicator/SH.XPD.CHEX.GD.ZS. Online; accessed 27 Febru-ary 2019.

Yu, B. (2008). A frailty mixture cure model with application to hospital readmission data. Biomedical journal (50), 386–394.

(40)

//www.zorginstituutnederland.nl/financiering/publicaties/ publicatie/2018/10/11/wor-930-rapport-normbedragen-2019. On-line; accessed 28 June 2019.

(41)

7

Appendix A: Supplement to the Data

section

7.1

Excluded claims

Below it it is explained for each category of the claims data (dataset 1, in the Data chapter) why it was included or excluded.

• General practitioner costs: not included, because claims from a General prac-titioner are usually the result of a consult or a minor treatment. Hence these costs are usually not related to long term illness.

• Pharmaceutical costs: included because medication is often related to chronic diseases and long lasting conditions Stichting Farmaceutische Kengetallen (2018). A can be seen in Table 3 pharmaceutical claims are a very large proportion of all claims.

• Nursing costs: included. Clients only receive nursing if they are seriously ill or have a high risk to become seriously ill due to weakness Ministerie van Volksge-zondheid, Welzijn en Sport (2019a).

• Dental/oral care: not included. For children all dental care is included in the basic insurance. For adults only specific care such as oral surgery is included in the basic insurance. This complicates the interpretation of claims in this category. Furthermore oral care is usually not related to long term illness.

• Obstetric care: not included because it is (obviously) not related to long term illness.

• Medical specialist costs: included, because it may indicate log term illness. After pharmaceutical costs it is the category with the highest number claims that are included in the estimation.

• Physiotherapy costs: included. It is important to note that the physiotherapy claims mentioned here only include claims that are covered by basic insurance. This means that the claims included generally originate from serious conditions.

• medical accessories: included. Only certain medical accessories relating to med-ical conditions are covered which suggests long term illness.

(42)

• Mental healthcare claims: not included. This research will focus on physical conditions only. In recent years the laws concerning the coverage of claims relating to mental illness have changed in such a way that it cannot be assumed that the data generating process has been constant over time in the dataset.

• Geriatric rehabilitation costs: included. This concerns care for elderly people in a medical rehabilitation center. This can be seen a part of an illness period. • Maternity care: not included, for the same reason as obstetric care.

• Claims in the category ’other’: not be included since it is not clearly defined what this category contains. Also the definition of the category has changed over the years.

• Cross-border care: excluded. We assume that most of the care that takes place outside of the Netherlands, is due to incidents not relating to long term illness. In addition to the claims from the excluded categories, certain other claims are excluded from this research. Below it is explained which claims this concerns and why they are excluded.

• Claims from individuals living in a care home: We exclude all claims from individuals living at an address that is known to be a care home. This is because the care that people receive in a care home is not payed for by a health care insurer because this care is funded by the government directly. This is a consequence of the ’Wet Langdurige Zorg’ (Ministerie van Volksgezondheid, Welzijn en Sport, 2019b). Hence an individual entering a care home will usually have less claims than just before entering the care home. Including the claims data of individuals entering a care home as input for the cluster algorithm might lead to a false observation of a transition from long term illness to healthy. One could off course define living in a care home as an extra state, which would imply a four state system. However, it is not known for every client whether they live in a care home or not, since the list of addresses of care homes is not comprehensive. Hence this state could not be established with certainty. After excluding the people that are known to live in a care home, there will still be people in the dataset that are living in a care home. But given the information that is accessible to a health care insurer, this cannot be avoided.

(43)

have not yet been reported to the insurance company (Kaas et al., 2008). When establishing the state of an individual in a certain month based on claims, it is important that all claims that originated in that month are known. Otherwise an individual might be incorrectly be classified as healthy while in reality there might be one or more IBNR claims that would have resulted in long term illness classification. To avoid this problem, only claims that originated before 2018 are included in the model. The distribution of the delay time between occurrence and reporting is such that, for the purpose of this research, it can be assumed that all claims have arrived after one year. Hence the data that is used are claims from 2012 up to and including 2017.

• Claims from individuals that were not insured at Menzis the whole pe-riod [01-2012 - 12-2017], or were not insured the whole pepe-riod until death: Only claims from individuals that were insured the whole period from 2012 up to and including 2017, or people that were insured in this period until death will be included in the estimation. This is to reduce the number of short, incomplete, observations. Since the data set is extremely large and not all observations can be included anyway, this selection is convenient and has little disadvantages. The underlying assumption is that the individuals in this subgroup do not significantly differ from the group that was excluded, apart from observed covariates.

7.2

Seasonality

Figure 5 shows the average monthly claims amount from 2012 until 2018. This graph will be used to investigate seasonality. It is based on claims data up to and including 2018. There is a sharp drop in the claims amount for treatment months nearing the end of 2018 the year due to the unknown IBNR claims. It is also interesting to see that the health care costs are rising. Which is consistent with the statements in the instruction about rising health care costs.

Referenties

GERELATEERDE DOCUMENTEN

Deze wordt ingezet tijdens huisbezoeken en andere  contactmomenten met ouders, die een vraag hebben rondom de ontwikkeling van het  kind.. De app helpt op drie

This research explains on the role of group dynamics in IT and business alignment and the particular focus is on the influences of team roles in the alignment process of

3.1.2 The use of spot yields versus historical averages to estimate the market cost of debt Under the forward-looking approach the cost of debt could be estimated using evidence on

The total costs of ownership in the purchasing matrix of the M&amp;G group can be reduced by; • Scanning the invoices for the invoice booking process;.. • Reducing the number

miscommunication, trust, maintenance of communication relations throughout the project, maintenance of communication relations after project completion, effective and enjoyable

Abstract We calculated the first and second ionization potentials and electron affinities of group 5 and 10 elements and predicted the ground state electronic configuration of

The main goal of the present study is to assess effectiveness and cost-effectiveness of MCGP-CS, compared to supportive group psychotherapy (SGP) and to care as usual (CAU)

Although we are able to prove that the cost shares computed by our algorithm are 2-budget balanced, they do not correspond to a feasible dual solution for any of the known