• No results found

Performance of hierarchical prognostic model in cardiac surgery based on administrative data

N/A
N/A
Protected

Academic year: 2021

Share "Performance of hierarchical prognostic model in cardiac surgery based on administrative data"

Copied!
43
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MASTER THESIS

PERFORMANCE OF A

HIERARCHICAL

PROGNOSTIC MODEL IN

CARDIAC SURGERY

BASED ON

ADMINISTRATIVE DATA

Alexander Scheffer

Septermber 2nd 2015

(2)

2

PERFORMANCE OF A HIERARCHICAL

PROGNOSTIC MODEL IN CARDIAC

SURGERY BASED ON ADMINISTRATIVE

DATA

Student Alexander Scheffer Student ID: 5800978 E-mail: a.scheffer@performation.com Mentor W. van Dijk, MSc X-IS Tutor Prof. Dr. A. Abu-Hanna

Head of department, professor

Department of Medical Informatics, AMC-UvA

Locations

Performation Healthcare Intelligence B.V. Sweelincklaan 1

3723 JA Bilthoven

Department of Medical Informatics

Academic Medical Center - University of Amsterdam Meibergdreef 15

1105 AZ Amsterdam The Netherlands

Period

(3)

3

Contents

Summary ... 5 Samenvatting ... 6 1. Introduction ... 7 1.1 Context ... 7

1.2 Goal & approach ... 7

1.3 Outline ... 7

2. Preliminaries ... 9

2.1 Cardiac surgery ... 9

2.2 Quality of care ... 9

2.3 Prognostic models ... 10

2.3.1 What are prognostic models? ... 10

2.3.2 Development of prognostic models ... 11

2.3.3 Performance measurement of prognostic models ... 12

2.3.4 Validation... 13

2.4 Benchmarking ... 14

2.4.1 Data and data quality ... 15

2.4.2 Gaming ... 15

3. Methods ... 17

3.1 Data ... 17

3.1.1 Candidate predictive variables ... 18

3.2 Model building and evaluation... 18

4. Results ... 19

4.1 Data ... 19

4.1.1 Selection of candidate variables ... 19

4.2 Prediction model ... 20

4.3 Simulation ... 23

5. Discussion ... 25

5.1 Main findings ... 25

5.2 Strengths & weaknesses ... 25

5.3 Implications of the study ... 26

5.4 Future research ... 26

References ... 27

(4)

4

Appendix A – Definition of candidate variables ... 31

Angina pectoris ... 31

Cardiac arrhythmia ... 31

Myocardial infarction ... 32

Renal function ... 32

Endocarditis ... 33

Diabetes & insulin pump therapy ... 33

COPD ... 33

Heart failure ... 34

Peripheral artery disease... 34

Cerebrovascular disease ... 34

CVA en CVA-timing ... 34

Hypercholesterolemia ... 35

Hypertension ... 35

Previous PCA ... 35

History of CABG operation ... 35

History of valve operation ... 35

History of cardiac operations ... 35

History of AICD ... 39

History of pacemaker ... 39

Aantal arteriën... 39

Number of valves ... 39

(5)

5

Summary

Introduction

Benchmarking is used in quality of care assessment by comparing the actual outcomes of a care provider to the expected outcomes. A prognostic model predicts the expected outcomes by using patient characteristics (case mix). However, a prognostic model based on Dutch administrative data, which are easily available, in cardiac surgery does not currently exist. In this thesis we develop and evaluate such a model.

Methods

We used the administrative data of three Dutch hospitals to create a multilevel logistic regression model. For the selection of candidate variables, we considered the Society of Thoracic Surgeons’ Data specifications used for their data collection and risk models.

The model was created using stepwise regression. We assessed the model's performance using the Area Under the ROC Curve (AUC), the p-value of the Hosmer Lemeshow (HL) C-statistic, and the Brier Skill Score (BSS) while correcting for optimism using bootstrapping. Finally, we simulated the effect of changing patients’ severity of illness on the outcomes.

Results

We included 9729 patients into our dataset; 73 patients were excluded due to missing outcome values. The model obtained had 30 fixed effects. The model had an AUC of 0.79, p-value of the HL C-test of 0.75, and BSS of 0.07. The simulation indicated a stable corrected outcome for the three hospitals when increasing the patients’ severity of illness and an unstable corrected outcome for lower severity of illness.

Discussion

Our model performed well on discrimination. However, it did not provide much added value on accuracy as measured by the BSS. Simulation showed that poor accuracy was particularly present in patients who had a low predicted probability of mortality. Due to the low accuracy, use of the model in benchmarking purposes should be done with caution. Further improvements to the model are advised, as well as a comparison with existing prediction models based on clinical data.

(6)

6

Samenvatting

Introductie

Benchmarking wordt gebruikt in de evaluatie van de kwaliteit van zorg door werkelijke uitkomsten met verwachte uitkomsten te vergelijken. Een prognostisch model voorspelt de verwachte

uitkomsten door patiëntkarakteristieken (case mix) te gebruiken. Een prognostisch model, gebaseerd op relatief eenvoudig te verkrijgen Nederlandse administratieve data, voor case mix correctie bij hartchirurgie bestaat tot op heden niet.

Methoden

We gebruikten administratieve data van drie Nederlandse ziekenhuizen om een multilevel logistisch regressiemodel te maken. Voor de selectie van kandidaatvariabelen gebruikten we de

dataspecificaties uit The Society of Thoracic Surgeons’ Data specifications, welke ze gebruikten voor data collectie en prognostische modellen.

Het model werd gemaakt op basis van stepwise regression. We stelden de prestaties van het model vast door middel van de Area Under the ROC Curve (AUC), de p-waarde van de Hosmer Lemeshow (HL) C-statistic en Brier Skilled Score (BSS), waarbij we corrigeerden voor optimisme met behulp van bootstrapping. Tenslotte simuleerden we het effect van een veranderende patiëntzwaarte op de uitkomst.

Resultaten

We includeerden 9729 patiënten in onze dataset; 73 patiënten warden geëxcludeerd vanwege ontbrekende uitkomsten. Het uiteindelijke model had 30 verklarende variabelen. Het model had een AUC van 0.79, p-waarde van de HL C-statistic van 0.75 en BSS van 0.07. De simulatie impliceerde een stabiele gecorrigeerde uitkomst voor de drie ziekenhuizen bij een verhoogde patiëntzwaarte, maar een onstabiele gecorrigeerde uitkomst voor een lagere patiëntzwaarte.

Discussie

Ons model presteerde goed voor discriminatie. Het model gaf echter weinig toegevoegde waarde op nauwkeurigheid, zoals de BSS liet zien. De uitgevoerde simulatie liet zien dat de lage nauwkeurigheid vooral het geval was bij patiënten met een lage voorspelde kans op overlijden. Vanwege de lage nauwkeurigheid kan het model slechts met voorzichtigheid gebruikt worden voor benchmarking. Verdere verbeteringen aan het model worden geadviseerd, net als een vergelijking met bestaande prognostische modellen, gebaseerd op klinische data.

(7)

7

1. Introduction

This thesis is part of and result of the Scientific Research Project, a mandatory part of the master’s program Medical Informatics of the University of Amsterdam. During this research project, the student has to carry out his research, write a thesis and defend his/her results. The research project of this thesis took place in the Academic Medical Center (AMC) in Amsterdam and Tragpi (Performation as of 2013) in Bilthoven. The following paragraphs will present the context, research question and general approach of this research project and thesis.

1.1 Context

Prognostic models are mathematical models that predict a particular outcome. They are widely used in medicine today, for a wide range of purposes. One of these purposes is benchmarking, in which performance of participant units are compared among each other. The idea is that the models’ predictions for a given unit are compared to the actual outcomes of that unit. In this sense the model represents the expected outcomes of these patients if they were treated by a virtual unit with average quality of care. Prognostic models may use clinical data, but the use of administrative data [1, 2, 3, 4] is more attractive as administrative data is easier to obtain.

A prognostic model based on Dutch hospital administrative data to assist in case mix correction for benchmarking cardiac surgery does currently not exist. Furthermore, most prognostic models do not correct for clustering of patients within hospitals and specialists, making benchmarking of individual specialists unjustified [5].

1.2 Goal & approach

The objective of this thesis is to develop a prognostic model for mortality in conventional CABG and valve surgery using Dutch administrative data and assess the model’s predictive value. The main research question in this thesis is:

“What is the predictive value of a multilevel prediction model for mortality of conventional CABG and valve surgery based on Dutch administrative data (e.g. age, gender, diagnoses and care activities)?”

We have translated this research question into several sub questions:

1. How can we create a multilevel prediction model for mortality of conventional CABG and valve surgery based on Dutch administrative data?

2. What is the predictive value of this model?

3. What is the model’s performance for a change in patient severity?

To answer the first question, we looked at existing prediction models for cardiac surgery for relevant candidate variables. Data was gathered from the database of Performation, a company providing quality and cost management solutions in healthcare, and candidate variables were checked for usability. Finally, a model was created using the statistical environment R and a modified stepwise regression method. For the second question, three performance measures were used to assess the model’s predictive value. We then carried out a simulation to see how the model would perform and what the hospitals benchmark results could look like to answer the last question.

1.3 Outline

The following chapter will provide background information about the clinical setting, the assessment of quality of care, (the development of) prognostic models and benchmarking. Next, we will describe

(8)

8

the methods used (chapter 3), results found (chapter 4) and will discuss the main findings, strengths and limitations of this study and recommendations for future research in the discussion (chapter 5).

(9)

9

2. Preliminaries

In this chapter we will provide background information about the setting and methods relevant to this thesis. First, we will describe cardiac bypass operations and valve surgery. Next, we discuss the assessment of quality of care. We will then describe prognostic modeling to support the assessment of quality of care and finally, we will discuss benchmarking in the healthcare setting.

2.1 Cardiac surgery

In the Netherlands, cardiovascular disease is the number one cause of death for females and the number two for males [6]. It includes cardiac diseases, like coronary heart disease, arrhythmias, heart infections or congenital diseases, vascular diseases of the brain and kidneys and peripheral vascular disease [7]. Within this group, coronary artery disease is the most common.

Coronary artery disease is a disease where arteries of the heart get narrowed, leading to angina, loss of heart tissue and function, and death. To relieve a patient of angina and reduce the risk of death when the arteries get too narrow (stenosis), it is possible to use medication, do a percutaneous coronary intervention (PCI) or coronary artery bypass graft (CABG). In a CABG operation, arteries or veins are taken from the patient’s body and grafted to the coronary arteries to bypass narrow or blocked arteries and restore blood flow to the myocardium. Normally, the heart is stopped during operation and circulation and oxygenation are taken over by a heart-lung machine, although it is currently possible to operate on a beating heart. A PCI is used to restore blood flow to the heart without open-heart surgery by inserting a catheter into a blood vessel and widen and/or stent the narrowed artery, usually by the inflation of a small balloon and the use of a stent.

When stenosis occurs at one or more of the heart valves, valve repair or replacement can become necessary. Using valvuloplasty, a stenotic valve can be widened using a balloon catheter. Another possibility is to make small incisions in the valve. This technique is called valvutomy. If a valve becomes too loose (regurgitation) and does not close properly, valve replacement may be necessary, which can either be done using a mechanical valve or a biological valve.

The first open heart surgeries were performed in the beginning of the fifties of the last century. Dr. F. John Lewis performed the first open heart surgery by using deep hypothermic arrest in 1952 and Dr. John Heysham Gibbon reported the first successful operation using a heart-lung machine in 1953 [8]. Open heart surgery in the Netherlands is currently performed in 16 hospitals [9]. Together, they performed 16128 open heart surgeries on adults in 2009, of which 69% involved CABG surgery and 39% involved valve surgery.

2.2 Quality of care

Quality of care can be evaluated by structure, process or outcome indicators [10]. To evaluate these last two, process or outcome, there are five methods: three implicit methods and two explicit methods. In the first three methods, a healthcare professional reviews a medical record to assess whether the care process was adequate (first method), better care could have improved the outcome (second method) or whether the overall quality of care was sufficient (third method). The fourth method uses predefined criteria to assess what proportion of a criterion is met. The last method measures outcome proportions using predefined criteria, for example a 30-day mortality rate after operation. However, since the characteristics of patient populations vary across providers, a validated prognostic model is needed to assess how the measured outcomes compare to the expected outcomes.

(10)

10

2.3 Prognostic models

In medicine, it can be very useful to be able to predict the outcome of a particular event or treatment, whether it’s to predict the patients chance of survival, select the best treatment for a particular patient or to compare hospitals or providers by (case-mix adjusted) mortality. To be able to make an objective prediction, we need a prognostic model.

2.3.1 What are prognostic models?

Prognostic (from the old Greek words ‘pro’, before, and ‘gnoscere’, to know) models are tools to predict a particular outcome. To make predictions, we normally use multiple variables, associated with the outcome, to determine the probable outcome. To acquire the probabilities of an outcome for a model using these variables, we use statistical methods like Bayesian analysis or logistic regression. Regression analysis using logistic regression is one of the most commonly used approaches in medicine and will be discussed here.

Logistic regression

In regression analysis, we try to estimate the effect of the value of an independent variable X on the dependent variable Y. The value of an independent variable is derived from the unknown parameters of the models, which are also called the coefficients, or β values.

𝑌 ≈ 𝑓(𝑋, 𝛽)

For example, we could estimate the strength, or weight, of the effect (𝛽) of salt intake (X) on the blood pressure (Y) and predict the blood pressure given the amount of salt intake. Of course, salt intake is not the only independent variable for blood pressure. For a better prediction, we can also include the variables age and gender. The term regression refers to predicting a continuous variable Y. The term linear prediction refers to a regression problem in which the relationship between the coefficients and the outcome is linear. A linear model looks like: 𝛼0+ 𝛽1× 𝑥1 + … + 𝛽𝑛× 𝑥𝑛 where 𝛼0 is the

intercept, the 𝑥𝑖’s are the various independent variables, and 𝛽𝑖’s their coefficients.

However, when applying prediction models in healthcare, we often have a binary outcome, e.g. dead or alive. A linear regression model would not be appropriate, since this could result in an outcome greater than 1 (death) or lower than 0 (survival). Furthermore, in a healthcare setting it would be more useful to explicitly predict the chance of an event. We therefore use logistic regression models, which use the logistic transformation function on 𝛼0+ 𝛽1× 𝑥1 + … + 𝛽𝑛× 𝑥𝑛 to restrict its value to the

interval (0,1)

p(𝑦 = 1) = 𝑒

𝛼0+ 𝛽1×𝑥1 + … + 𝛽𝑛×𝑥𝑛

1 + 𝑒𝛼0+ 𝛽1×𝑥1 + … + 𝛽𝑛×𝑥𝑛

Here, 𝑝 is the probability estimated by the regression function. To get the 𝛼0+ 𝛽1× 𝑥1 + … +

𝛽𝑛× 𝑥𝑛 , which is called the linear predictor, from this formula one applies the logit function

Logit(𝑃(𝑦 = 1)) = 𝑙𝑛 ( 𝑝

1 − 𝑝) = 𝛼0+ 𝛽1× 𝑥1 + … + 𝛽𝑛× 𝑥𝑛

If for an 𝑥𝑖 the corresponding 𝛽𝑖 is greater than one, we have an increased risk for the event 𝑦 = 1,

and if 𝛽𝑖 is lower than one, we have a decreased risk of outcome 𝑦. To obtain the odds ratio of 𝑥𝑖 one

(11)

11

Hierarchical regression models

Using logistic regression to create a model of the real world in healthcare result in a couple of problems. One of the problems with logistic regression models is that patient outcomes are correlated within the natural hierarchy of patients treated by certain specialists within certain hospitals, and even more complex structures may exist [11]. Ignoring this hierarchy leads to a violation of the assumption of independence made when creating a model. This could for example artificially increase the influence of very large hospitals in comparison to small ones.

To account for repeated measurements within the same unit (many patients treated by the same surgeon, many surgeons working in the same hospital etc) one may resort to mixed effects models. For a prediction model, one is usually only interested in the fixed effects, i.e. the effect of comorbidities (e.g. COPD and diabetes) and treatment type on the outcome. However, the effects of these factors on the outcome can differ between the specialists treating their many patients due to other factors like experience of the specialist, treatment volume or treatment process. These factors may not be of interest to the researcher or cannot be put in a model easily, but can affect the predictive value of the model if not accounted for. In addition all patients of the same surgeon will appear in the model of that surgeon, which means that repeated measurements (patients) within the same surgeon are accounted for. We can add one or more terms to account for these random effects. When adding random effects to a fixed effect model, we get a mixed effects model. One of the possible options is to add a random intercept to a logistic model:

Logit(𝑃(𝑦 = 1)) = 𝛼𝑗𝑘+ 𝛽1× 𝑥1 𝑖𝑗𝑘 + … + 𝛽𝑛× 𝑥𝑛 𝑖𝑗𝑘

Where 𝑖 is the patient, 𝑗 is the surgeon and 𝑘 is the hospital. While the coefficients 𝛽 are the same (fixed) for each patient, surgeon or hospital, the intercept 𝛼 differs between surgeons and hospitals. Random slopes (where the coefficients vary), group-level variables and cross-level interactions are also possible, but are not used in this thesis.

2.3.2 Development of prognostic models

After gathering and checking data validity, one can start developing the prediction model. The main two developmental strategies are using the full model (and thus including every variable) or using a stepwise method for selecting the most important variables. No consensus currently exists about which of these models is best [12].

Stepwise selection

For stepwise regression, we start with a model and either add or remove specific variables if the addition or removal leads to an improvement in the model. For selection of the parameters, Akaike’s information criterion (AIC) is often used, which measures the relative goodness of fit of a model, but also penalizes for model’s complexity. This means that one strikes a balance between model complexity (which could lead in overfitting the data) and goodness of fitness. Selection of variables, and thus reducing complexity, will generally lead to a more interpretable model with lower variance of the estimates and reduce the chance of overfitting at the cost of a small bias in the estimated variables. A backward selection approach is preferred, since it increases the chance of keeping correlated variables in the model and it is forced to test the effects for all variables simultaneously. While the stepwise regression method has the advantage of reducing the model size and thus decreasing its complexity, it, just as other model selection methods, also has some drawbacks. Since we determine the model based on the patient characteristics in the sample, the selection of predictors for the model can be unstable; if we have a slight change in the sample population, the stepwise

(12)

12

regression method could select another set of predictors. This instability will be larger with a smaller patient sample or a greater number of candidate variables. Another problem is that selection methods induce a bias to coefficient estimations. Since the coefficients of “removed” predictors are effectively lowered to 0, the coefficients of the remaining predictors are biased away from 0 and thus overestimated. A third problem is that we both formulate and test a model using the same dataset, which causes (model) selection bias [12].

2.3.3 Performance measurement of prognostic models

After the development of a prediction model, it is important to assess its quality; the model has to predict well on both current and future data. Many tests exist, but these can generally be categorized as tests for discrimination, precision and accuracy.

Discrimination

Discrimination of a prediction model, i.e. the ability to distinguish between patients with a positive or negative outcome, can be measured by calculating the area under the Receiver Operating Characteristic (ROC) curve [12]. This area corresponds to the probability of providing a randomly selected patient with the event a higher probability of getting the event than a randomly selected patient without the event. The ROC curve is a plot of the true positive rate (sensitivity) versus the false positive rate (1-specificity) of the model, using consecutive cutoff points for the probabilities. For example: we start with a cutoff point at p=0% and treat patients with 0% of survival or higher as positive. The sensitivity will be 100%, since all positive patients are marked as positive. The specificity will be 0%, since no patients are categorized as negative. If we move to the next cutoff point, we get a lower sensitivity and higher specificity. A non-informative model, like a coin-flip, would get a line of 45° and an area under the curve (AUC) of 0.50, while a perfectly discriminative model

would get a line with a 90° angle and an area under the curve of 1. Normally, the following classification is used: >0.90 excellent, >0.80 good, > 0.70 fair, > 0.60 poor, > 0.50 fail.

Precision (Calibration)

Precision is the degree of correspondence between the estimated and true probabilities. Two of the tests for precision are the Hosmer Lemeshow (HL) statistics, which assess the model’s calibration. The HL-statistics assess the model’s calibration by comparing the predicted number of events to the real number of events of grouped patients based on the model’s predictions. The predicted number of events can be simply calculated by summing the predicted probabilities. The H-statistic divides the patient groups into intervals of estimated probabilities of equal width, resulting in unequal group sizes.

FIGURE 2.1Three ROC curves with their AUC values: a non-discriminative model (AUC = 0.5), a fair model (AUC = 0.75) and a perfect model (AUC = 1)

(13)

13

The C-statistic divides the patient groups into equal group sizes based on percentiles. A common number of groups is 10. The statistic is calculated by:

𝐻𝐿 = ∑(𝑂𝑔− 𝑁𝑔𝑝𝑔)

2

𝑁𝑔𝑝𝑔(1 − 𝑝𝑔) 𝑛

𝑔=1

Where 𝑂𝑔is the observed number of events in the group, 𝑁𝑔 the number of patients, 𝑝𝑔 the mean

probability of events in that group, and 𝑛 the number of groups. The probabilities of both statistics can be calculated by using the chi-squared probability function, using 𝑛 − 2 degrees of freedom for the training set. The higher the difference between observed and expected number of events, the worse the probability. If 𝑝 < 0.05, the observed event rates do not match the predicted event rates. Another visual method to test calibration is the calibration plot, where we plot the smoothed mortality y against the predicted mortality 𝑝. If they are equal, then the model is perfectly calibrated. For binary outcomes like mortality, we can use a smoothing method like LOESS (LOcal regrESSion), where a low-degree polynomial is fitted for each data point and its surrounding data points. Data points closer to the measured data point are given a higher weight than data points further away, using weighted least squares.

Other performance measurements

The Brier score [12] is a quadratic scoring rule, and measures aspects of both the discrimination and calibration of a model. It averages the squared difference between the predicted outcome and the actual outcome: 𝐵𝑟𝑖𝑒𝑟 = 1 𝑁∑(𝑝𝑖− 𝑜𝑖) 2 𝑁 𝑖=1

Where 𝑝𝑖 is the predicted outcome, 𝑜𝑖 the actual outcome and N the number of outcomes. The closer

the Brier score to 0, the better the model. The disadvantage of the Brier score is that the maximum of the score, and thus its interpretation, depends on the incidence of the outcome. For example: if the mean mortality in a population is 50% and we predict everyone will have a chance of dying of 50%, the Brier score will be 0.25, while the same non-informative model for an incidence and a probability of 25% will be 0.19. A simple solution is to scale the Brier score by its maximum value: the score for the non-informative model. This is called the Brier Skill score:

𝐵𝑟𝑖𝑒𝑟𝑠𝑘𝑖𝑙𝑙= 1 −

𝐵𝑟𝑖𝑒𝑟𝑚𝑜𝑑𝑒𝑙

𝐵𝑟𝑖𝑒𝑟𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒

where 𝐵𝑟𝑖𝑒𝑟𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 is the Brier score of the non-informative model where 𝑝 = 𝑚𝑒𝑎𝑛(𝑜). Now, a

Brier Skill score indicates the extent to which the model is better than a non-informative model.

2.3.4 Validation

When a model is created for outcome prediction, its validity towards new, unknown patients should be tested. Many validation approaches exist, which can be divided into three types: internal validation, temporal validation and external validation. Internal validation uses (a part of) the same dataset on which the model was built, using techniques such as cross-validation or bootstrapping. Temporal validation uses a new dataset from the same hospitals and/or specialists, or a part of the dataset which

(14)

14

wasn’t used to build the model, split by time. Finally, external validation uses new data from different patients and hospitals and/or specialists [12].

In some cases the preferred way of validation is the external validation. However, since a new dataset for validation is not always available at the time of development, an internal validation method like bootstrapping can be appropriate.

Bootstrapping is a method to estimate the sampling distribution of any statistic by resampling with replacement from the original sample. In other words: it mimics the process of taking samples of a population. In bootstrap validation, a prediction model is developed on various bootstrap samples. The performance of this model on the bootstrap samples is then compared to the performance of the same model on the original dataset. The difference in performance measurements is called the optimism. We can then subtract this optimism from the performance measurements of the original model on the original dataset to acquire the optimism corrected performance measures. This corrected measure provides an unbiased estimate of the model on an unseen dataset originating from the same population. This is the procedure followed to obtain the optimism corrected performance measures:

1. Construct a model (M) using the original data (D) and determine the apparent performance (AP) of the model on the data.

2. Draw a sample (S) with replacement of the original data, using the same size.

3. Construct a new model (MB) using sample S and the same model specification steps as used in step 1.

4. Determine the performance of model MB on the sample S (PMBS).

5. Determine the performance of the same model MB on the original data D (PMBD).

6. Calculate the optimism (O) as the difference between the performance measures of model

MB on S and D: O = PMBS – PMBD.

7. Repeat steps 2–6 N times. To obtain a stable estimate of the optimism, at least hundred repeats are needed.

8. Subtract the mean optimism from the apparent performance to calculate the optimism-corrected performance estimate: 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 𝐴𝑃 −𝑁1∑ 𝑂

The bootstrap method has various advantages over other internal validation techniques. First, the method uses the same number of patients as the development set, as opposed to cross validation. Furthermore, the bootstrap method is able to adjust for all sorts of model uncertainty, including variable selection [13], and performs better in automated variable selection methods than cross validation [12].

A disadvantage of the bootstrap method and other resampling methods like cross validation, is that the model’s selection method needs to be fully automated; one needs to use either a full model without selection, or an automated approach like stepwise regression. Manual modifications, such as manually regrouping categories, cannot be used.

2.4 Benchmarking

One of the uses for prognostic models, and the possible use of the prognostic model created in this thesis, is in benchmarking. It allows for outcome comparison between healthcare providers by correcting for case mix. However, choosing the right technique to benchmark might be just as

(15)

15

important as choosing the right methodology to create and validate a prognostic model. Two popular methods of benchmarking are for example funnel plots and CUSUM charts.

Funnel plots were initially developed to detect publication bias. In a funnel plot, an indicator like (case mix adjusted) mortality can be plotted for each hospital against the number of patients in each center over a monitoring period. Confidence limits can be plotted to detect possible outliers. While funnel plots overcome problems like ranking, their main disadvantage is that they are not useful as a tool for detecting quick or small changes in a process.

To detect these smaller changes, or CUSUM (CUmulated SUM) charts can be used for continuous measurement of binary events, like mortality or morbidity. There are different types of CUSUM charts, but all of them use single cases, like patients, on the x-axis, sorted by ascending date. One can use a non-risk-adjusted chart, where we take a fixed value for rewarding and punishment, and shift the line in the graph up or down according to the outcome of the next patient. Using a fixed value for rewarding and punishment can cause some problems, especially in healthcare, where the chance of death or morbidity can differ substantially between patients, even within specific populations. We can use the same method, but using a different value for rewarding and punishment per patient based on e.g. a prognostic model. This way, the line will drop less in case of a negative outcome of a high-risk patient, while giving a higher reward when the same patient has a positive outcome. This chart is often called a Variable Life Adjusted Display, or VLAD, when monitoring mortality, showing lives saved on the y-axis.

2.4.1 Data and data quality

Before assessing the quality of care, benchmarking, or making a prognostic model, one needs data. Dutch administrative medical data mainly consists of Diagnosis Treatment Combinations (DTC, or DBC in Dutch), and performed healthcare activities (Zorgactiviteiten). DTC’s are a variation on Diagnosis-Related Groups (DRG’s), with notable differences as described by Hasaart [14]. First, the DTC was introduced as a tool for managed market competition, whereas the DRG was mostly used for cost control in healthcare. Second, the structure of a DTC differs, as it combines both diagnosis and treatment as well as inpatient and outpatient care, whereas DRG’s only focus on the inpatient care. Third, as DTC’s are more complex and detailed than DRG’s, Hasaart found the risk for upcoding to be higher in the Dutch DTC system [14].

Data Quality in the Dutch DTC System

Lack of data quality, or trust therein, is a problem when presenting benchmarking data [15]. To assess the data quality of the Dutch DTC system, Stijn did a qualitative (process based) study of the data quality in DIS (DBC Information System), which is both the name of the organization which governs the Dutch reimbursement data as of the database for the national reimbursement data itself [16]. Stijn found problems in population completeness, as only mandatory data relevant to declarations is stored. History of changes in the declarable product, for example a change of treatment, were also not available, since the DIS data only consisted of DBC products after they were declared. Furthermore, some detail of performed activities was lost when aggregating the performed activities to national activity codes [16].

2.4.2 Gaming

When data is used for benchmarking, a possibility exists for “gaming” of the reporting system. One can either upcode preoperative comorbidities, change the operative class or transfer critically ill patients to other (extended) care facilities.

(16)

16

Upcoding

Upcoding of preoperative comorbidities is a well-known form of gaming [17]. One can either add more preoperative comorbidities or change the severity of a specified comorbidity. Either adding comorbidities or increase the severity of comorbidities will result in a higher expected mortality, thus leading to a lower mortality adjusted for case-mix. Although excessive coding for comorbidities would be less likely for administrative databases, changing the severity of comorbidities is more rewarding, since there could also be a financial benefit.

Change of operative class

Another type of gaming is changing the operative class. It is possible to change from one type of operation to the other by minor changes, either at the operating room or on paper, to move the patient to a different category. Just as upcoding of comorbidities, changing the operative class could result in a higher financial gain and result in a different adjusted mortality.

Transferring critically ill patients

A third method of gaming is to transfer critically ill patients to other external facilities postoperatively. By doing this, the outcome status of the patient cannot always be measured correctly; when measuring the in-hospital mortality, a patient could have a good outcome after a short admission, which could change shortly after transferal. A solution for this type of gaming is to monitor the patients’ outcome at a specified time interval, e.g. 30 days after surgery, regardless of location.

(17)

17

3. Methods

3.1 Data

To create the prediction model, administrative data from 2009 to 2011 from three Dutch hospitals available at Performation Healthcare Intelligence B.V. (Performation) were used. This administrative data was normally used for cost price calculation and benchmarking, and included basic patient information such as gender and age, care packages (Diagnosis Treatment Combinations, or DBCs), healthcare activities (CTG/Zorgactiviteiten), an OR- and an admission registration. Three other hospitals performing open heart surgery were excluded due to unavailability of patient outcomes. We used the healthcare activities as a basis for patient inclusion. Patients with healthcare activity codes for CABG or valve operations were included in our dataset (see Table 3.1). A historic profile with comorbidities of the last few years was built using healthcare activities for performed activities (e.g. implantation of a pacemaker) and DBCs for the patients’ diagnoses (e.g.: myocardial infarction or loss of kidney function) with a start date before the operation. We excluded patients who were below the age of 18 at the time of operation and patients with a former heart- or lung transplantation.

Activity code

Definition

33080 Open commissurotomy with repair or replacement of a single valve

33081 Open commissurotomy with repair or replacement of two valves

33082 Open commissurotomy with repair or replacement of three valves

33083 Valve replacement in the same operation as another healthcare activity with extracorporeal circulation

33087 Combined valve and CABG operation

33100 Single coronary artery bypass graft

33101 Coronary artery bypass graft in the same operation as another healthcare activity with extracorporeal circulation

33102 Multiple coronary artery bypass grafts (two or three) or a single coronary artery bypass graft with endarterectomy

33103 Multiple coronary artery bypass grafts (four or more) or multiple coronary artery bypass grafts (two or three) together with a single coronary artery bypass graft with endarterectomy

TABLE 3.1–Healthcare activities leading to inclusion of a patient in the dataset.

For the analysis, only the first conventional CABG or valve operations during admission were used; reoperations after a failed percutaneous operation where also included and the previous percutaneous valve or –bypass operation were added as possible binary predictors. Not all hospitals and patients had a complete admission registry. Also, a new admission could be registered when the patient moved from one location within the hospital to another. Therefore, we derived hospital admission periods from the registered healthcare activities and compared them to the registered admissions. The start and end of the hospital admission were determined by the first and last registered admission activities in the row of admission activities leading up to or following the operation date. The derived admission periods were validated using the available admissions in the admission registry. The outcome of the operation (alive or dead) was determined by the worst outcome of either the registered DBC or the registered admission(s) belonging to the care activity/activities of the operation, falling within the duration of the hospital admission with a maximum of 30 days starting from the day of operation.

(18)

18

3.1.1 Candidate predictive variables

For the selection of candidate variables, we considered the Society of Thoracic Surgeons’ Data specifications used for their data collection and risk models [18]. All variables contained in the dataset were checked for usability using the following rules:

1. The data used to construct a candidate variable had to be present in Dutch administrative data, or the data should match the candidate variable already. For example, the severity of chronic lung disease cannot be measured, so instead the diagnosis “COPD” was used as a binary variable.

2. The timing of the candidate variable in relation to the operation should be apparent in the administrative data. A variable like “Resuscitation within one hour before surgery” could not be used, since performed care activities were only recorded by date and an accurate sequence of events could not be determined.

In addition, the candidate variable for history of a tumor was added. Tumor-related DBCs were grouped by location/specialism to form additional candidate variables. Age was added as a restricted cubic spline in order to accommodate a non-linear relationship between age and log-odds of the outcome. Since we only had information about events in a patient’s history that did happen and not about events that did not, “missing” and “absent” variables were regarded the same, and treated as missing. For example, a patient without a DBC for diabetes in the same hospital was given the value “unknown”.

3.2 Model building and evaluation

From the candidate variables a three-level multilevel prognostic model was developed using the lme4 R package [19, 20]. For the model, we used the hospital and surgeon for varying slope random effects. The final model was selected using forward stepwise regression based on AIC [21], using a modified version of a function by Rense Nieuwenhuis [22].

The AUC was used as a measure of discrimination. Because of the low mortality rate of CABG and valve surgery, and to allow for comparability with models fit on different base rates, the Brier Skill Score was used as a measure of overall performance instead of the Brier score [12]. Next, we used the Hosmer Lemeshow statistic and made a calibration plot to assess the model’s precision (also called calibration). Since we used the same dataset for model building and validation, we used bootstrapping to correct for optimism and acquire confidence intervals for the AUC and Brier Skill Score. We included the same selection methods used for the final model and used them a hundred times to adjust for model selection bias.

Finally, we performed a small simulation to see how the model would react to different patient samples by taking 500 samples, with the same sample size as the hospital and with resampling, using 𝑃(𝑦 = 1)𝑖 as weight for each sample, where 𝑃(𝑦 = 1) was the predicted outcome. As i increases, the chance of a patient with a high predicted chance of death being selected also increases, and the chance of a patient with a low predicted chance of death being selected decreases. We then calculated the observed/expected mortality ratio, also called the Standardized Mortality Ratio (SMR), and the observed/mean mortality ratio for each hospital. We stopped when one of the hospitals had an average of less than 20% unique patients in the bootstrap samples to prevent the simulation to estimate the SMR using only a small number of patients.

(19)

19

4. Results

4.1 Data

Three hospitals were selected to create a model, containing 9729 patients who had a CABG or valve surgery during a period of three years (2009-2011). Seventy three patients had an unknown outcome and were excluded, leaving 9656 patients to create the model. Observed mortality across hospitals ranged from 2.6% to 3.5%. The median duration of admission based on care activities was 8 days (IQR 5). Of the 8701 patients with a registered admission in a separate registry, 13 (0.15%) patients had a different duration compared to their admissions based on healthcare

activities. The average difference in admission duration of these 13 patients was 4.5 days (maximum 13 days).

4.1.1 Selection of

candidate variables

All variables from the Society of Thoracic surgeons’ dataset were screened for usability in our prediction model. Thirty-nine variables were found usable for our model. Age was added as a spline function with seven knots. Next to a variable for known history of a tumor, eleven variables for tumor location were added based on location or performing specialty: brain and nerve, breast, gastrointestinal, gynecologic, head and neck, hematologic, skeleton and soft tissue, skin, thorax, urologic (including testis and

prostate), and other or

unspecified. The definition of each variable can be found in Appendix A – Definition of candidate variables, while the full list of candidate variables with their associated mortality can be found in Appendix B – List of candidate variables and associated mortality. A small list of candidate variables is shown in Table 4.2.

Variable Included Excluded

Total 9656 73 Male 6903 71.5% 55 75.3% Age (mean / SD) 67.4 (10.2) 63.4 (11.4) Urgent 591 6.1% 8 11.0% Elective 7100 73.5% 64 87.7% CABG-operation 7643 97.1% 59 80.8% Valve repair/replacement 3500 36.2% 26 35.6%

TABLE 4.1–Gender, age, urgency of the operation and operation of included and excluded patients

Variable N % Mortality, % Total 9656 100.0 3.81 Sex Male 6903 71.5 3.07 Female 2753 28.5 5.67 CABG-operation

Yes (including combination valve/other) 7643 79.2 3.57

No 2013 20.8 4.72

Valve repair/replacement

Yes (including combination CABG/other) 3500 36.2 6.20

No 6156 63.8 2.45

Concomitant (cardiac) operation

Yes 766 7.9 6.27 No 8890 92.1 3.60 Number of bypasses 0 2013 20.8 4.72 1 172 1.8 2.91 2-3 3792 39.3 2.06 4 2145 22.2 2.66 unknown 1534 15.9 8.67 Number of valves 0 6156 63.8 2.45 1 1774 18.4 3.72 2 231 2.4 11.69 3 36 0.4 16.67 unknown 1459 15.1 8.09 Urgency Urgent 591 6.1 11.00 No urgency or unknown 9065 93.9 3.34

TABLE 4.2–Seven candidate variables. The full list can be found in Appendix A

(20)

20

4.2 Prediction model

Using the stepwise variable selection method, a prediction model with 30 fixed effects was selected. The 30 fixed effects resulted in random effects of 0 for all hospitals and surgeons.

The selected model had a discriminative ability with an AUC of 0.82. The bootstrap corrected AUC using 100 samples was 0.79 (95% CI 0.76 to 0.82). A boxplot of the predicted probabilities in both groups is shown in Figure 4.2. Fixed Effect OR CI 2.5% CI 97.5% Gender Male 1.00 Female 1.43 1.11 1.85 Age - Urgency Elective or unknown 1.00 Urgent 2.77 1.88 4.06 Number of valves Unknown 1.00 0 1.39 0.71 2.69 1 1.07 0.36 3.15 2 3.93 1.23 12.60 3 4.06 0.93 17.77 Kidney function Absent or unknown 1.00

Reduced kidney function 13.04 4.49 37.94

On dialysis 11.37 6.16 20.97

Concomitant aortic surgery

No 1.00

Yes 6.24 3.78 10.28

Concomitant balloon pump

No 1.00

Yes 11.61 5.71 23.60

Previous CABG surgery

No 1.00 Yes 7.13 2.78 18.31 Number of arteries Unknown 1.00 0 0.36 0.12 1.06 1 0.33 0.11 1.04 2 or 3 0.19 0.10 0.38 4 0.27 0.13 0.53 CVD Absent or unknown 1.00 Present 2.49 1.38 4.49 Myocardinfarct <1 day Absent or unknown 1.00 <1 day 4.99 2.51 9.90 2-7 days 1.52 0.72 3.16 8-21 days 1.72 0.81 3.66 >21 days 1.69 0.85 3.35 COPD Absent or unknown 1.00 Present 2.12 1.27 3.55 Heart failure Absent or unknown 1.00 Present 4.23 2.02 8.83 Angina pectoris Absent or unknown 1.00 Angina, stable 1.66 0.80 3.42 Angina, unstable 2.26 1.24 4.15 Diabetes with pump

Absent or unknown 1.00

Present 24.55 1.99 303.54

TABLE 4.3

Estimated odds ratio’s for mortality in CABG and valve surgery using mixed effects logistic regression model.

FIGURE 4.1

The ROC curve for our model (purple line) with 𝐴𝑈𝐶 = 0.82, together with ROC curves when splitting the dataset into ten random groups, with 𝐴𝑈𝐶𝑚𝑖𝑛= 0.74

and 𝐴𝑈𝐶𝑚𝑎𝑥= 0.83

FIGURE 4.2

A boxplot showing the distribution of predicted probabilities by outcome.

(21)

21

The significance of the Hosmer Lemeshow H- test and C-test were 0.01 and 0.75, respectively, indicating both a bad and a good calibration. Figure 4.3 displays the calibration plots, showing observation groups based on fixed points (A) and percentiles (B). Note the low number of patients in the higher probability groups in calibration plot (A).

Finally, the Brier Skill Score was 0.10 and the bootstrap adjusted Brier Skill Score 0.07 (95% CI 0.04 to 0.09).

(22)

22

FIGURE 4.1

The calibration plots for our model, together with grouped observations based on fixed points (A) and percentiles (B). The mortality for each group in A is displayed as a ratio.

(23)

23

4.3 Simulation

The simulation plot for each individual hospital (A, B, C) and the three hospitals combined (D) can be found in Figure 4.2. On the x-axis, we find weight 𝑃(𝑦 = 1)𝑥 used for patient inclusion into each bootstrap sample, based on their predicted chance of death. The mean mortality of the bootstrap samples is displayed at the top of the four plots for each value of 𝑥. At 𝑃(𝑦 = 1), all patients have an equal weight and the samples thus represent a normal bootstrap sample from the dataset. The solid lines represent the observed/expected mortality for each of the three hospitals, while the dashed lines represent the corresponding 95% confidence intervals using 500 bootstrap samples. All three hospitals show a SMR near 1 at 𝑃(𝑦 = 1). The SMR of hospital A and C drops as patients with a lower predicted chance of death receive a higher weight and therefore gain a higher chance to be selected in the dataset.

The points in the simulation plot indicate the hospital’s mortality as a ratio of the observed mortality in the bootstrap samples, divided by their true mean mortality. The points in each plot therefore start near 0 at 𝑃(𝑦 = 1)−3, when patients with a low predicted chance of death are more likely to be selected and the mortality of the bootstrap samples is lower than the true mortality. As patients with a higher predicted chance of death are more favorable beyond 𝑃(𝑦 = 1), the ratio of bootstrap mortality versus true mortality rises also.

(24)

24

FIGURE 4.2

Simulation plots for each hospital individually (A,B,C) and a simulation plot of the three hospitals combined (D). All plots show the SMR for 500 bootstraps, using weight 𝑃(𝑦 = 1)𝑥, where 𝑃(𝑦 = 1) is the predicted chance of death using our model (solid lines). Dotted lines represent the 95% CI according to the bootstrap samples. In plot A, B and C, the unadjusted mortality is visible as individual points as the ratio 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝⁄𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙. Finally, the

unadjusted mean mortality of all hospitals according to the bootstrap samples is shown in the top x-axis.

A B

(25)

25

5. Discussion

5.1 Main findings

In this thesis we described the development and validation of a multilevel prognostic model for the outcome of open-heart surgery based on administrative data. The optimism-corrected AUC of our model was 0.79 and the Brier Skill score 0.07. If we apply the widely used guidelines for interpreting AUC scores, our model has a fair (bordering good) discriminative performance (0.5 ≤ AUC < 0.6 = fail; 0.6≤ AUC <0.7 = poor; 0.7≤ AUC <0.8 = fair; 0.8≤ AUC <0.9 = good; 0.9≤ AUC = excellent). The Brier Skill score of 0.07 was very low and indicates the predictive performance of the model is only slightly better than a non-informative model. The Hosmer Lemeshow C-statistic did not show evidence for miscalibration. However, the H-statistic indicated a poor calibration, probably due to the small groups of patients with high probability of mortality.

With this low accuracy and over-prediction of high-risk patients one should be reluctant in using this prediction model to adjust for case-mix for benchmarking hospitals or care providers. One way to improve model performance could be to link patient data from different hospitals together to form a more accurate history of each patient. Another solution would be to include some basic clinical variables in the model; since all variables, except age, from the administrative database were of dichotomous or categorical nature, and hence a lot of predictive value is lost. A third option is to include more patients to improve the accuracy.

In a simulation we have shown the model adjusts fairly well when severity of illness (and hence mortality) increases. This would indicate the model would correct well for a case-mix with patients with a high chance of death after the operation. The SMR for the smallest hospital (C) dropped for bootstrap samples which favored patients with a high predicted mortality. However, this is probably due to the small number of patients with a predicted mortality > 20% (19).

When simulating a decrease in severity of illness the model showed less stability and large uncertainty in SMR. The large confidence intervals and the drop in SMR for hospital A and C and rise in SMR for hospital B might be due to missing effects in the prognostic model. This would result in unidentified patients with higher chance of death and overestimation of patients with lower chance of death, explaining the over-estimation of the SMR value for hospital A and C and rise in the SMR of hospital B.

5.2 Strengths & weaknesses

Our model performed similarly to current models for cardiac surgery based on clinical data, where high discrimination but low calibration were achieved [23, 24, 25, 26]. Our model used a hierarchical regression method to adjust for within-group correlation, although this seemed to have little to no effect. We also applied bootstrapping to correct for optimism and carried out simulation to understand the effect of severity of illness on the SMR.

In this study, we did not assess the quality of the administrative data. As the administrative data used in this study was comparable in structure to the national DIS data, we assumed similar data quality between the systems. Although avoidance of high-risk patients is not expected for an administrative dataset, it is at least theoretically possible that some patients had received a more severe operative class (e.g.: from one bypass to two) than required. This could result in increasing revenue and thus bias the dataset used for model development. Finally, 30-day mortality of discharged or transferred patients was not monitored.

(26)

26

Data from only three hospitals were available to create a model. Although this resulted in a fair amount of patients suitable for inclusion, we believe the model’s accuracy could have been increased if more hospitals were available. The limited number of hospitals could also be an explanation of the absence of random effects.

To create a model, we chose to use mixed-effect logistic regression, while other, more sophisticated methods like Bayesian analysis, support vector machines, or neural networks could have been used. However, current model performance in the CABG domain is similar between these methods [27, 28]. In addition the potential of various sophisticated models are limited with only categorical data.

5.3 Implications of the study

The model created in this study could be used for benchmarking purposes, either by hospitals to benchmark themselves in comparison to others, or by Performation to benchmark multiple hospitals at once by using case mix correction and give an indication of a hospital’s performance on CABG and valve surgery. However, caution is advised due to the low accuracy. We believe the model created is not accurate enough to be used for other purposes than benchmarking at the hospital level.

To get an understanding the model’s performance in changing patient severity, we used a bootstrap approach to simulate the case mix adjusted outcome of the three hospitals in this study. The model proved to be unstable for patients with a low estimated probability of death. The use of a similar method could be useful tool while making other prediction models, although validity of this method needs to be further explored.

5.4 Future research

Although our model performed relatively well on discrimination, but had poor accuracy like similar models in cardiac surgery, no comparison could be made on the same dataset due to the lack of availability of clinical data. Future research could be done on the true added value of prediction models in cardiac surgery based on administrative data, or lack thereof.

The simulation used to get an understanding of changing patient severity is not a conventional method to measure a model’s performance. However, further research on the validity and usefulness in using this kind of simulation for prognostic model testing could be interesting, as it not only showed problems in accuracy, but the location of these problems.

(27)

27

References

[1] E. Hannan, H. Kilburn Jr, M. Lindsey and R. Lewis, "Clinical versus administrative data bases for CABG surgery. Does it matter?," Medical Care, vol. 30, no. 10, pp. 892-907, 1992.

[2] P. Aylin, A. Bottle and A. Majeed, "Use of administrative data or clinical databases as predictors of risk of death in hospital: comparison of models," British Medical Journal, vol. 334, no. 7602, p. 1044, 2007.

[3] S. Brinkman, A. Abu-Hanna, A. van der Veen, E. de Jonge and N. de Keizer, "A comparison of the performance of a model based on administrative data and a model based on clinical data: effect of severity of illness on standardized mortality ratios of intensive care units," Critical Care

Medicine, vol. 40, no. 2, pp. 373-378, 2012.

[4] H. Krumholz, Y. Wang, J. Mattera, Y. Wang, L. Han, M. Ingber, S. Roman and S. Normand, "An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with an acute myocardial infarction," Circulation, vol. 113, no. 13, pp. 1683-1692, 2006.

[5] D. Shahian, S. Normand, D. Torchiana, S. Lewis, J. Pastore, R. Kuntz and P. Dreyer, "Cardiac surgery report cards: comprehensive review and statistical critique," Ann Thorac Surg, vol. 72, no. 6, pp. 2155-2168, 2001.

[6] I. Vaartjes, I. van Dis, Visseren, F. L. J and M. L. Bots, "Hart- en vaatziekten in Nederland," Mouthaan Grafisch Bedrijf, Papendrecht, 2011.

[7] V. Fuster and B. B. Kelly, Promoting Cardiovascular Health in the Developing World: A Critical Challenge to Achieve Global Health, Washington, D.C.: National Academies Press, 2010. [8] L. H. Cohn, "Fifty Years of Open-Heart Surgery," Circulation, vol. 107, no. 17, pp. 2168-2170,

2003.

[9] L. v. Herwerden, L. Noyez, H. Takkenberg, M. Versteegh and L. Wijgergangs, "De Nederlandse Dataregistratie Hartchirurgie: Resultaten van samenwerking tussen 16 Nederlandse

hartchirurgische centra," Utrecht, 2012.

[10] A. Donabedian, "The Quality of Care: How Can It Be Assessed?," JAMA, vol. 260, no. 12, pp. 1743-1748, 1988.

[11] H. Goldstein and D. J. Spiegelhalter, "League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance," Journal of the Royal Statistical Society. Series A

(Statistics in Society), vol. 159, no. 3, pp. 385-443, 1996.

[12] E. W. Steyerberg, Clinical Prediction Models - A Practical Approach to Development, Validation, and Updating, Springer, 2009.

(28)

28

[13] E. Steyerberg, S. Bleeker, H. Moll, D. Grobbee and K. Moons, "Internal and external validation of predictive models: A simulation study of bias and precision in small samples," Journal of Clinical

Epidemiology, vol. 56, no. 5, pp. 441-447, 2003.

[14] F. Hasaart, "Incentives in the Diagnosis Treatment Combination," Maastricht, 2011. [15] S. v. d. Veer, N. d. Keizer, A. Ravellli, S. Tenkink and K. Jager, "Improving quality of care. A

systematic review on how medical registries provide information feedback to healthcare providers," international journal of medical informatics, vol. 79, p. 305–323, 2010. [16] P. Stijn, "Data Quality of the Dutch DBC Information System," Utrecht, 2012.

[17] S. Siregar, R. Groenwold, M. Versteegh, L. Noyez, W. ter Burg, M. Bots, Y. van der Graaf and L. van Herwerden, "Gaming in risk-adjusted mortality rates: effect of misclassification of risk factors in the benchmarking of cardiac surgery risk-adjusted mortality rates," J Thorac

Cardiovasc Surg., vol. 145, no. 3, pp. 781-789, 2013.

[18] The Society of Thoracic Surgeons, "Data Collection | STS," 24 August 2007. [Online]. Available:

http://www.sts.org/sts-national-database/database-managers/adult-cardiac-surgery-database/data-collection. [Accessed 21 January 2013].

[19] R Core Team, R: A Language and Environment for Statistical Computing, Vienna, 2013.

[20] D. Bates, M. Maechler and B. Bolker, lme4: Linear mixed-effects models using S4 classes, 2012. [21] H. Akaike, "A new look at the statistical model identification," IEEE Transactions on Automatic

Control, vol. 19, no. 6, pp. 716 - 723, 1974.

[22] R. Nieuwenhuis, "R-Sessions 32: Forward.lmer: Basic stepwise function for mixed effects in R," [Online]. Available: http://www.rensenieuwenhuis.nl/r-sessions-32/.

[23] S. Nashef, F. Roques, L. Sharples, J. Nilsson, C. Smith, A. Goldstone and U. Lockowandt,

"EuroSCORE II," European Journal of Cardio-Thoracic Surgery, vol. 41, no. 4, pp. 734-745, 2012. [24] D. Shahian, S. O'Brien, G. Filardo, V. Ferraris, C. Haan, J. Rich, S. Normand, E. DeLong, C.

Shewan, R. Dokholyan, E. Peterson, F. Edwards and R. Anderson, "The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 1--coronary artery bypass grafting surgery.,"

The Annals of Thoracic Surgery, vol. 88, no. 1 Suppl, pp. S2-S22, 2009.

[25] S. O'Brien, D. Shahian, G. Filardo, V. Ferraris, C. Haan, J. Rich, S. Normand, E. DeLong, C. Shewan, R. Dokholyan, E. Peterson, F. Edwards and R. Anderson, "The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 2--isolated valve surgery.," The Annals of

Thoracic Surgery, vol. 88, no. 1 Suppl, pp. S23-S42, 2009.

[26] D. Shahian, S. O'Brien, G. Filardo, V. Ferraris, C. Haan, J. Rich, S. Normand, E. DeLong, C. Shewan, R. Dokholyan, E. Peterson, F. Edwards and R. Anderson, "The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 3--valve plus coronary artery bypass grafting surgery.," The Annals of Thoracic Surgery, vol. 88, no. 1 Suppl, pp. S43-S62, 2009.

(29)

29

[27] R. L. D. Shahian, "Coronary artery bypass risk prediction using neural networks," The Annals of

Thoracic Surgery, vol. 63, no. 6, pp. 1635-1643, 1997.

[28] R. Orr, "Use of a probabilistic neural network to estimate the risk of mortality after cardiac surgery.," Medical desicion making, vol. 17, no. 2, pp. 178-185, 1997.

(30)

30

List of abbreviations

AIC Akaike information criterion

AUC Area Under the receiver operating characteristic Curve

BS Brier Score

BSS Brier Skill Score

CABG Coronary Artery Bypass Graft

CUSUM CUmulated SUM

DBC Diagnosis Treatment Combination

HL Hosmer Lemeshow

IQR Interquartile range

LOESS LOcal regrESSion

ROC Receiver Operating Characteristic

PCI Percutaneous Coronary Intervention

(31)

31

Appendix A – Definition of candidate variables

Angina pectoris

The patient receives the highest value applicable for type of angina pectoris: 0. No angina pectoris present or known

1. Angina pectoris, unspecified

The patient has a DBC diagnosis 202 of specialism 20 which started before or at the day of operation and hasn’t been closed before the day of operation

2. Angina, stable

The patient has a DBC diagnosis 203 of 7 of specialism 20 which started before or at the day of operation and hasn’t been closed before the day of operation

3. Angina, unstable

The patient has a DBC diagnosis 3 of 4 of specialism 20 which started before or at the day of operation and hasn’t been closed before the day of operation

For history of angina pectoris, the patient receives one of the following values: 0. No angina pectoris

1. Angina unspecified, stable or unstable, as specified above, which started before or at the day of operation.

Cardiac arrhythmia

The patient receives the highest value applicable for ventricular arrhythmia: 0. No ventricular arrhythmia

1. Ventricular arrhythmia

The patient has a DBC diagnosis 403 of specialism 20 which continued to at least two weeks before operation.

The patient receives the highest value applicable for atrial arrhythmia: 0. No atrial arrhythmia

1. Atrial arrhythmia

The patient has a DBC diagnosis 404 of specialism 20 which continued to at least two weeks before operation.

The patient receives the highest value applicable for junctional arrhythmia: 0. No junctional arrhythmias

1. Junctional arrhythmia

The patient has a DBC diagnosis 401 of 402 of specialism 20 which continued to at least two weeks before operation.

The patient receives the highest value applicable for other arrhythmias: 0. No other arrhythmia

(32)

32 1. Other arrhythmias

The patient has a DBC diagnosis 409 of specialism 20 which ended within two weeks before or on the day of operation.

The patient receives the highest value applicable for history of arrhythmia: 0. No history of arrhythmia

1. History of arrhythmias

The patient has a DBC diagnosis 401, 402, 403, 404 of 409 of specialism 20 which ended within two weeks before or on the day of operation.

Myocardial infarction

The patient receives the value for myocardial infarction-timing corresponding to the most recent myocardial infarction.

0. No myocardial infarction 1. Myocardial infarction, same day

The patient has a DBC diagnosis 9, 11, 13, 204 of 205 of specialism 20 which started at the day of operation

2. Myocardial infarction, 1-7 days

The patient has a DBC diagnosis 9, 11, 13, 204 of 205 of specialism 20 which started the day before operation, or up to 7 days before operation.

3. Myocardial infarction, 8-21 days

The patient has a DBC diagnosis 9, 11, 13, 204 of 205 of specialism 20 which started 8 to 21 days before operation

4. Myocardial infarction, >21 days

The patient has a DBC diagnosis 9, 11, 13, 204 of 205 of specialism 20 which started more than 21 days before operation.

The patient receives the value for myocardial infarction corresponding to the most recent myocardial infarction:

0. No myocardial infarction

1. Myocardial infarction, unspecified

The patient has a DBC diagnosis 9, 11 of 13 of specialism 20 which started before or at the day of operation

2. Myocardial infarction, no ST-elevation

The patient has a DBC diagnosis 205 of specialism 20 which started before or at the day of operation

3. Myocardial infarction, ST-elevation

The patient has a DBC diagnosis 204 of specialism 20 which started before or at the day of operation

Renal function

(33)

33 0. Normal renal function or unknown

1. Reduced renal function

The patient has a DBC diagnosis 321,323,324 of 325 of specialism 13 which started before or at the day of operation and hasn’t been closed before the day of operation

2. Patient is on dialysis

The patient has a DBC diagnosis 322,326,327,328,331,332,333,334,335,336,337 of 338 of specialism 13 which started before or at the day of operation and hasn’t been closed before the day of operation

Endocarditis

The patient receives the highest value applicable: 0. No endocarditis

1. Treated endocarditis

The patient has a DBC diagnosis 432 of specialism 13, DBC diagnosis 3407 of specialism 16 or DBC diagnosis 57 of 702 of specialism 20 which ended before the operation.

2. Active endocarditis

The patient has a DBC diagnosis 432 of specialism 13, DBC diagnosis 3407 of specialism 16 or DBC diagnosis 57 of 702 of specialism 20 which started before or at the day of operation and hasn’t been closed before the day of operation

Diabetes & insulin pump therapy

The patient receives the highest value applicable: 0. No diabetes

1. Diabetes, unspecified

The patient has a DBC diagnosis 223 of specialism 13, DBC diagnosis 7104,7113 of 7114 of specialism 16, DBC diagnosis 902 of specialism 18 or DBC diagnosis 222 of specialism 35, which started before or at the day of operation

2. Diabetes, without complications

The patient has a DBC diagnosis 221 of specialism 13 which started before or at the day of operation and hasn’t been closed before the day of operation

3. Diabetes, with complications

The patient has a DBC diagnosis 222 of specialism 13 which started before or at the day of operation and hasn’t been closed before the day of operation

0. No chronic insulin pump therapy

1. Diabetes with chronic insulin pump therapy

The patient has a DBC diagnosis 223 of specialism 13 or DBC diagnosis 7113 of specialism 16 which started before or at the day of operation

COPD

The patient receives the highest value applicable: 0. No COPD

Referenties

GERELATEERDE DOCUMENTEN

For procedures in the Courts of Appeal, the Supreme Court and the special tribunals, it has not been possible to build a model that explains the appeal ratios using the selected

The conditions discussed in this study are the following; obtaining exemption from mandatory quality indicators from requesting parties, register efficiently with

The project called Knowledge management follows the scientific, actual, and policy developments of a large number of road safety

Leaching EAFD with sulphuric acid allows a significant portion of the highly reactive zinc species (up to 78% of the total zinc in the fumes) to be leached, while limiting

It is important to understand that the choice of the model (standard or mixed effects logistic regression), the predictions used (marginal, assuming an average random

It is important to understand that the choice of the model (standard or mixed effects logistic regression), the predictions used (marginal, assuming an average random

To estimate these invisibly present errors using a latent variable model, multiple indicators from different sources within the combined data are used that measure the same