Citation/Reference Wynants L., Collins G. (2016),
Key steps and common pitfalls in developing and validating risk models: a review
BJOG: An International Journal of Obstetrics and Gynaecology (in press)
Archived version Author manuscript: the content is identical to the content of the submitted manuscript, before refereeing
Published version insert link to the published version of your paper http://dx.doi.org/10
Journal homepage http://www.bjog.org/view/0/index.html
Author contact laure.wynants@esat.kuleuven.be + 32 (0)16 32 76 70
IR url in Lirias https://lirias.kuleuven.be/handle/123456789/xxxxxx
(article begins on next page)
Key steps and common pitfalls in developing and validating risk models: a review
12
Ms. Laure WYNANTS, Msc; Leuven, Belgium; KU Leuven, Department of Electrical
3Engineering-ESAT, STADIUS Center for Dynamical Systems, Signal Processing and Data
4Analytics; KU Leuven, iMinds Medical IT Department; Kasteelpark Arenberg 10, Box 2446,
5Leuven 3001, Belgium
6Dr. Gary S COLLINS , PhD; Oxford, UK; Centre for Statistics in Medicine, Nuffield Department
7of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre,
8University of Oxford; Windmill Road, Oxford OX3 7LD, UK
9Dr. Ben VAN CALSTER, PhD; Leuven, Belgium; KU Leuven, Department of Development and
10Regeneration; Herestraat 49 Box 7003, Leuven 3000, Belgium
11Corresponding Author:
12
Ben Van Calster
13Address: KU Leuven Department of Development and Regeneration, Herestraat 49 box
147003, 3000 Leuven, Belgium
15Work phone number: +32 16 37 77 88
16E-mail: ben.vancalster@med.kuleuven.be
17Running title: Risk models: key steps and common pitfalls
18Word count: abstract: 86 ; main text: 3904
1920
21
22
Abstract
2324
Risk models to estimate an individual’s risk of having or developing a disease are abundant
25in the medical literature, yet many do not meet the methodological standards that have
26been set to maximise generalisability and utility. This paper presents an overview of ten
27steps from the conception of the study to the implementation of the risk model and
28discusses common pitfalls. We discuss crucial aspects of study design, data collection, model
29development, and performance evaluation, and we discuss how to bring the model to
30clinical practice.
31
Key words: clinical prediction model, logistic regression, model development, model
32reporting, model validation, risk model
3334
35
36
37
Introduction
38In recent years we have seen an increasing number of papers about risk models.
1The term
39risk model refers to any model that predicts the risk that a condition is present (diagnostic)
40or will develop in the future (prognostic).
2Risk prediction is usually performed using
41multivariable models that aim to provide reliable predictions in new patients.
3, 4Risk models
42may aid clinicians in making treatment decisions based on patient-specific measurements. In
43that sense risk models may fuel personalised and evidence-based medicine, and enhance
44shared decision making.
545
To maximise their potential it is imperative that risk models are developed carefully and
46validated rigorously. Unfortunately, reviews have demonstrated that methodological and
47reporting standards are often not met.
1, 6-8We present a roadmap with ten steps to
48summarise the process from the conception of a risk model to its final implementation. We
49address these steps one by one and highlight common pitfalls (Table 1). Technical details on
50the methods for the development and evaluation of risk models are presented in boxes for
51the interested reader.
52
53
Case studies
54Three case studies are presented in supplementary material that illustrate the ten steps
55(Table S1). The first concerns the development of the ADNEX model for the preoperative
56diagnosis of suspicious ovarian tumours.
9The ADNEX model predicts whether a tumour is
57benign, borderline malignant, stage I ovarian cancer, advanced stage (II-IV) ovarian cancer,
58or secondary metastatic cancer. The second involves the development of a prediction model
59to assess the risk of operative delivery.
10The model predicts the need for instrumental
60vaginal delivery or caesarean section for fetal distress or failure to progress. The third deals
61with the external validation of two models developed in the Unites States for the prediction
62of successful vaginal delivery after a previous caesarean section.
11The models’
63
transportability to the Dutch setting is investigated.
64
65
Step I: Before getting started
66It is pivotal to define up front what clinical purpose the risk model should serve,
12as well as
67the target population and clinical setting. A mismatch with clinical needs is not uncommon
68and makes the model useless (Table 1, pitfall 1). Next, ask whether a new model needs to be
69developed. Often risk models for the same purpose already exist.
1If so, the characteristics
70of these models should be checked (e.g., the context and setting they were developed for,
71the included predictors), and the extent to which these models have been validated. Search
72strategies for finding relevant existing risk models are available.
2, 13Systematic reviews
73indicate that risk models are often developed in vain because there was no clear clinical
74need, several other models already exist, and/or the models are not validated.
14-17 7576
Step II: Study design and setup
77We encourage investigators to discuss the study setup beforehand, and to write a protocol
78that covers each of the ten steps.
18For complete transparency, consider to publish the
79protocol and register the study.
19The preferred study design is a prospective cross-sectional
80cohort design for diagnostic risk models and a prospective longitudinal cohort design for
81prognostic risk models. Both designs study a group of patients which should be
82representative of the population in which the model is intended to be applied. By
83prospective, we mean that the data is collected with the primary aim to develop a risk
84model. Box 1 presents alternative designs and data sources with their limitations. In order to
85ensure an unbiased sample, a consecutive sample of patients is preferred. To enhance
86generalisability and transportability of the model, multicentre data collection is
87recommended. It is important to carefully identify the predictor variables for inclusion in the
88study, such as established risk factors, useful or easily obtainable variables in the clinical
89context of the models’ intended use,
20and promising variables of which the predictive value
90has yet to be assessed. The definitions of these variables should be unambiguous and
91standardised, and measurement error should be avoided (see Box 1). The outcome to be
92predicted should be clearly defined in advance (see Box 1). Finally, bear in mind that the
93potential predictors should be available when the risk model will be used during patient
94care. For example, when predicting the risk of a birth defect based on measurements at the
95nuchal scan, it makes no sense to include birth weight in the risk model.
96
However carefully the study is designed, data are rarely complete. A common
97mistake is to automatically exclude patients with missing data, and perform the analysis
98only on the complete cases (Table 1, pitfall 2). Besides reducing the sample size, this may
99lead to a biased sample resulting in a model with poor generalisability, because there is
100nearly always an underlying reason for missing values.
21The preferred approach is to
101replace (‘impute’) the missing data with sensible values. Preferably several plausible values
102are imputed (using ‘multiple imputation’) to acknowledge that imputed values are
103themselves estimates and by definition uncertain.
22, 23104
Box 1. Study set-up and design: technical details
105Alternative designs
106Risk models can be developed as a secondary analysis of existing data, for example, from a
107registry study or a randomised clinical trial, or retrospectively based on hospitals’ medical
108health records. A common limitation of using existing data is that they were collected for a
109different purpose and that potential or even established risk factors may not have been
110collected in a suboptimal fashion or not at all (i.e., plenty of missing data). This can seriously
111affect the performance and credibility of the model. For data from randomised trials,
112generalisability is a concern due to strong inclusion/exclusion criteria or volunteer bias.
113
Furthermore, in the presence of an effective treatment, some thought also needs to be
114given to how treatment should be handled in the analysis.
24If the outcome to be predicted
115is rare, it may be useful to identify sufficient cases and recruit a random sample of non-
116cases, using a case-cohort or nested case-control design.
25The prediction model needs to be
117adjusted for the sampling frequency of non-cases to ensure reliable predictions.
26 118Measurement error
119Measurement error is common for many predictors and may influence model
120performance.
27Random measurement error tends to attenuate or dilute estimated
121regression coefficients, although the opposite is also observed.
28It may be useful to set up
122an interobserver agreement study, or at least discuss the likely interobserver agreement of
123potential predictors, particularly when subjectivity in determining the predictor’s values is
124present.
29When a large multicentre dataset is available, the amount of systematic
125differences between centres can be assessed for each predictor variable.
30These
126differences can be caused by several factors, such as systematic differences in
127measurements, equipment, national procedures, or centre populations.
128
Defining the outcome to be predicted
129For diagnostic models, the presence or absence of the condition should be measured by the
130best available and commonly accepted method, the ‘reference standard’. The outcome
131measurement is typically available only after the candidate predictors have been assessed.
132
The time interval should be specified and kept short. Preferably, patients receive no
133treatment during this interval. For prognostic models the outcome should be a clinically
134relevant event, occurring in a certain period in the future. In both diagnostic and prognostic
135settings, outcome assessment i.e., blinded information on the predictors, avoids unwanted
136review bias.
137
Handling missing data
138A popular method is (multiple) imputation by ‘chained equations’, also called fully
139conditional specification.
31Using this approach, missing information can be estimated
140(“imputed”) using related variables for which information is available, including the
141outcome and variables that are related to the missingness. Several papers provide guidance
142and illustrations of imputation in the context of risk models.
21-23, 32, 33Most imputation
143approaches assume that values are ‘missing at random’ (MAR).
34This means that missing
144values do not occur in a completely random fashion, but are random ‘conditional on the
145available data’. For example, if values are more often missing in older patients, then MAR
146holds if patient age is available.
147
148
Step III: Modelling strategy
149Before delving into any statistical analysis, it is recommended to define a modelling
150strategy, which should preferably be specified in the study protocol.
3, 4Regression methods
151such as logistic (for short-term diagnostic or prognostic outcomes) or Cox regression (for
152long term prognostic outcomes) are the most frequently used approaches to develop risk
153models. One could also use flexible methods such as support vector machines (see Box 2).
35 154An important consideration is the complexity of the model, relative to the available sample
155size (see Box 2). Overly complex models are often overfitted: they seem to perform well on
156the data used to derive the model, but perform poorly on new data (Table 1, pitfall 3).
36 157Recommendations are to control the ratio of the number of events to the number of
158variables examined, referred to as events-per-variable (EPV). Although EPV is a common
159term, it is more appropriate to consider the total number of estimated parameters (i.e., all
160regression coefficients, including all those considered prior to any variable selection) rather
161than just the number of variables in the final model.
3For binary outcomes, the number of
162events is the number of individuals in the smallest outcome category. A value of 10 EPV is
163frequently recommended,
37, 38but a more realistic guideline is the following: 10 EPV is a
164minimum when the model is prespecified, although additional ‘shrinkage’ (see Box 2 and
165step IV) of model coefficients may be required. A value of 20 EPV may alleviate the need for
166shrinkage, whilst 50 EPV is recommended when statistical (data-driven) variable selection is
167used.
4, 39-41168
It is common practice to use some form of statistical variable selection to reduce the
169number of variables and for parsimony. Backwards elimination is preferred because this
170approach starts with a full model (i.e., includes all variables) and eliminates non-significant
171variables one by one.
3Although convenient, these methods have important limitations,
172especially in small datasets.
3, 4, 36, 39, 40Due to repeated statistical testing, stepwise selection
173methods lead to overestimated coefficients (resulting in overfitting) and overoptimistic p-
174values (Table 1, pitfall 4).
42The same is true when selecting variables based on their
175univariate association with the outcome, which should be avoided.
3Statistical significance is
176often not the best criterion for inclusion or exclusion of predictors. A preferable strategy is
177to use subject matter knowledge to select predictors a priori (see Box 2).
3, 4178
When dealing with categorical variables, categories with little data may be combined to
179reduce the number of parameters. Categorising continuous variables should be avoided,
180because this entails a substantial loss of information.
43To maximise predictive ability, it is
181advisable to assess whether the variable has a linear effect or might need a transformation
182(see Box 2). Investigations of nonlinearity should be kept within reasonable limits relative to
183the number of available events to avoid overly complex models (Table 1, pitfall 5). For the
184same reason, it is recommended to control EPV by specifying in advance which interaction
185terms are known or potentially relevant, instead of testing all possible interaction terms
186(Box 2).
3, 4 187Box 2. Modelling strategy: technical details
188Flexible models
189The machine learning literature offers a class of models that focus on automatization and
190computational flexibility. Examples include neural networks, support vector machines, and
191random forests.
44However, these methods struggle with reliable probabilistic estimation
45, 19246
and interpretability. Moreover, at least for clinical applications, machine learning methods
193generally do not yield better performance than regression methods.
46-50 194Overfitting
195Two samples of patients from the same population will always be different due to random
196variation. Capturing random peculiarities in a sample should be avoided, and we should
197focus on identifying the ‘true’ underlying associations with the outcome. Modelling random
198idiosyncrasies is called overfitting.
36Overfitting is more likely when the sample size is too
199small relative to the total number of (candidate) predictors considered.
3 200The need for shrinkage
201Fitting logistic regression models using standard ‘maximum likelihood’ yields unbiased
202parameter estimates provided the sample size is sufficiently large.
51Nevertheless, because
203parameter estimates are combined to obtain risk predictions, these predictions contain
204overfitting even in relatively large datasets: predicted risks are too extreme, in that higher
205risks are overestimated and lower risks are underestimated. Shrinkage of model coefficients
206(see step IV) can alleviate this problem.
207
Subject matter knowledge
208Based on experience and scientific literature, domain experts can often judge the likely
209importance of predictors a priori. In addition, predictors can be judged in terms of
210subjectivity,
29, 30financial cost, and invasiveness. For example, the variable age is easy and
211cheap, whereas measurements based on magnetic resonance imaging can be difficult and
212expensive. Nevertheless, statistical selection remains of interest when subject matter
213knowledge falls short.
3, 52Subject matter knowledge may also prove useful when scrutinizing
214the model that was statistically fitted. For example, an effect in opposite direction to what is
215expected is suspicious. The exclusion of such predictors can be beneficial for the robustness
216and transportability of the model.
40This criterion was used, for instance, to eliminate body
217mass index from a risk model to predict successful vaginal birth after Caesarean section.
53 218Nonlinearity
219Continuous variables may have a nonlinear effect. For example, biomarker effects are often
220logarithmic: for low biomarker values, small differences have a strong influence on the risk
221of disease, whereas for high values small differences do not matter much. When the
222number of events is small, linear effects can be assumed or nonlinearity can be investigated
223for the most important predictors only. Popular methods to model nonlinearity are spline
224functions (e.g., restricted cubic splines) and multivariable fractional polynomials (MFP).
3, 52 225An interesting feature of MFP is that it combines the assessment of nonlinearity with
226backward variable selection.
227
Interactions
228An interaction occurs when the coefficient of a predictor depends on the value of another
229predictor. The number of possible interaction terms grows exponentially with the number of
230predictors, and power to detect interactions is low.
54As a result, many statistically
231significant interactions are hard to replicate. Moreover, it is unlikely that interaction terms
232will dramatically improve predictions. An interesting exception is the interaction between
233fetal heart activity and gestational sac size when predicting the chance of pregnancy viability
234beyond the first trimester: larger gestational sac sizes increase the chance of viability when
235fetal heart activity was observed but decrease the chance of viability when no such activity
236was seen.
55 237Step IV: Model fitting
238Once the development strategy has been defined, it can be implemented. For smaller
239samples sizes (e.g., EPV <20) shrinkage methods may be considered, whereby models are
240penalised towards simplicity (e.g., small regression coefficients can be shrunk towards
241zero).
40Common penalization methods include ridge regression and LASSO.
56242
Step V: Validation of model performance
243The proof of the pudding is in the eating, certainly for risk models. It is key to evaluate
244model performance. We distinguish three aspects of model performance (see Box 3 for an
245elaboration):
246
Discrimination assesses whether estimated risks are different for patients with and
247without the disease. The most well-known measure is the c-statistic, also known as
248the area under the ROC curve (AUC) for binary outcomes (Figure 1). It is the
249probability that a randomly selected event has a higher predicted probability than a
250randomly selected non-event.
251
Calibration assesses the agreement between the predicted risks and corresponding
252observed event rates. This is preferably assessed using a calibration plot (Figure 2).
4, 253254 57
Discrimination and calibration do not address the clinical consequences of using a
255risk model.
58-60One measure to assess the clinical utility for decision making is the
256Net Benefit that is plotted in a decision curve (Figure 3).
61, 62Net Benefit relies on the
257relationship of the risk threshold with the relative importance of true vs false
258positives: when we adopt a risk threshold of 10% to select patients for treatment, we
259accept the harm in treating 9 patients without disease to gain the benefit of treating
2601 patient with disease. This means that we accept up to 9 false positives per true
261positive. Net benefit uses this relationship to correct the proportion of true positives
262for the proportion of false positives.
263
264
We further distinguish three types of validation:
265
Apparent validation refers to evaluation on the exact same data on which the model
266was developed. The resulting performance will usually be optimistic. The amount of
267optimism depends strongly on how the model was developed, including the sample
268size and the number of variables examined.
269
Internal validation involves an independent evaluation using the dataset that was
270used to develop the model.
63The most popular approach is to randomly split the
271dataset into a development set and a validation set, however this approach is
272inefficient and should be avoided (Table 1, pitfall 6). Alternative approaches such as
273cross-validation or bootstrapping are recommended.
64Publications describing the
274development of new risk models should always include an internal validation.
2, 26 275 External validation involves an evaluation on a different dataset that is collected at a
276different time point and/or at a different location (see below).
63 277Box 3.Validation of performance: technical details
278Risk models vs classification rules.
279
The use of a risk threshold turns the risk model into a classification (or decision) rule.
280
Thresholds (or cut-offs) can be selected in several ways.
65, 66At the level of the individual
281patient, the relative consequences of a false positive or false negative can be taken into
282account. For example, a threshold below 50% indicates that a false negative is
283considered more harmful than a false positive. At the population level, a desired balance
284in terms of sensitivity and specificity may be sought. This framework is more useful for
285cost-effectiveness studies and for national guidelines that formulate recommendations
286for clinical practice. When a threshold is defined, patient classification and actual disease
287status can be cross-tabulated and summarised with measures such as sensitivity,
288specificity, positive predictive value, and negative predictive value. The positive and
289negative predictive value directly depend on event rate (i.e., prevalence for diagnostic
290outcomes), however in practice sensitivity and specificity also vary with event rate.
67 291Calibration
292Calibration plots are used to check whether predicted risks correspond to observed
293event rates.
4, 57They can be supplemented with estimates of the calibration slope and
294intercept.
4, 68The Hosmer-Lemeshow statistic is frequently used to test for
295miscalibration, but suffers from serious drawbacks: e.g.,low power and inability to assess
296the magnitude or direction of (mis)calibration. We therefore recommend mot to use
297it.
69, 70 298Clinical Utility
299Despite being introduced recently, measures for clinical utility have already been
300recommended by several leading journals.
62, 71-75Poor discrimination and miscalibration
301reduce clinical utility, but good discrimination and calibration do not guarantee utility.
76 302However, measures for clinical utility are not always of interest. For example, a risk
303model to merely inform pregnant couples of the chance of a successful pregnancy
304beyond the first trimester
55would not require an analysis of clinical utility, as no
305decision needs to be made that is dependent on a threshold. Calibration is the key
306measure in such contexts.
307
Model comparison
308Models can be compared by calculating the difference in the c-statistic, by comparing
309calibration and by comparing Net Benefit or another metric for clinical usefulness.
77The
310significance of an added predictor can be based on a test for the predictor’s regression
311coefficient rather than on tests for performance improvement (e.g. change in AUC).
78 312Reclassification statistics such as net reclassification improvement (NRI) and integrated
313discrimination improvement (IDI) have fuelled intense debate.
79-81We advise caution
314when using NRI and IDI, because they depend on calibration in the sense that models
315may appear advantageous simply because of poor calibration.
82-84 316317
Step VI: Model presentation and interpretation
318The exact formula of the model (including the intercept or baseline survival at a given time
319point) should always be reported in the publication to allow others to use it, including for
320validation by independent investigators.
2, 85To aid uptake of risk models, they are often
321simplified or presented in alternative formats, including score charts, nomograms, or colour
322bars.
3, 4, 55, 73, 86Clear eligibility criteria should be presented along with the model, including
323ranges of continuous predictors, so that users are made aware if they are extrapolating
324beyond the ranges observed in the development data.
325
326
Step VII: Model reporting
327Complete and transparent communication of model development and validation ensures
328that others can fully understand what was done. Reviews of reporting quality have
329repeatedly revealed clear shortcomings (Table 1, pitfall 7).
6-8, 87This led to comprehensive
330guidelines and checklists for manuscript preparation such as the recent TRIPOD statement.
2, 33126, 88
332
333
Step VIII: External validation
334The real test of a model involves an evaluation on new data, either collected at a later point
335in time from the same centre(s) (temporal validation) or collected at different centres
336(geographical validation).
89It is disappointing that the literature contains many more
337publications on the development of new risk models than on external validations of existing
338models (Table 1, pitfall 8).
8, 20, 90It is better to develop fewer models that hold more promise
339to be robust and useful. In addition, when several models for the same purpose exist, it is
340recommended to directly compare these models in an external validation study.
77Details on
341external validation are provided in Box 4.
342
Box 4. External Validation: Technical Details
343External validation
344A reliable external validation study requires a sufficiently large sample size. At least 100 but
345preferably 200 events are recommended.
70, 91-93External validation often results in poorer
346performance with respect to discrimination or calibration.
90There are many factors that can
347explain results at external validation, therefore these results have to be interpreted
348carefully, often in the context of differences in case-mix between the development and
349validation datasets.
94, 95For example, discrimination is typically lower in more homogeneous
350populations. Even temporal validation may show performance degradation, for example
351because the centre changed from a secondary to a tertiary care centre, or because new
352clinical guidelines have changed the population.
96353
Updating
354If the performance of a model is disappointing, updating the model is a sensible solution
355that is more efficient than the development of a completely new one, because the original
356one contains useful information. Different approaches exist, such as intercept adjustment,
357rescaling of model coefficients, model refitting, or model extension.
97, 98, 99358
359
Step IX: Impact studies
360The ultimate aim of many risk models is to improve patient care. Therefore, the final step
361would be an impact study, perhaps together with a cost-effectiveness analysis.
20, 100, 101362
Preferably, impact studies are (cluster) randomised studies for which primary endpoint(s)
363are clinical care parameters such as length of hospitalization, number of unnecessary
364operations, days off work, time to diagnosis, morbidity, or quality of life.
100, 102, 103365
Unfortunately, few risk models reach this stage of investigation,
20, 100although predictive
366performance is no guarantee for beneficial impact on patient outcomes.
104, 105367
368
Step X: Model implementation
369To increase the uptake of a model, a user-friendly implementation can be provided.
106The
370model can for example be implemented in a spreadsheet for use with office software, or
371made accessible on websites (e.g. www.qrisk.org). With the rise of smartphones and tablets,
372implementation is currently shifting towards mobile applications (apps). Nevertheless,
373disseminating inadequately validated risk models has to be avoided.
374
375
Conclusion
376The development and validation of risk models should be appropriately planned, conducted,
377presented, and reported. Risk models should only be developed when there is a clinical
378need, and external validation studies should be given the attention and prominence they
379deserve. The development of a robust model and an informative validation is not
380straightforward, with several pitfalls looming along the way. Perfection is impossible, but
381adhering to current methodological standards is important to arrive at a good model that
382has the potential to be useful in clinical practice.
383
Acknowledgments
384Disclosure of interest
385The authors report no conflict of interest.
386
Contribution to Authorship
387All authors contributed to the conception of the study. LW and BVC performed the literature
388search and drafted the manuscript. GSC contributed to the content of the drafts and
389suggested additional literature. All authors read and approved the final manuscript.
390
Details of Ethical Approval
391This study did not require ethical approval because it did not involve any human or animal
392subjects, nor did it make use of hospital records.
393
Funding
394This work was supported by KU Leuven (grant C24/15/037), Research Foundation Flanders
395(FWO grant G049312N), Flanders’ Agency for Innovation by Science and Technology (IWT
396Vlaanderen, grant IWT-TBM 070706-IOTA3) and the Medical Research Council (grant
397number G1100513). L.W. is a PhD fellow of IWT Vlaanderen.
398
399
References
4001. Kleinrouweler CE, Cheong-See FM, Collins GS, Kwee A, Thangaratinam S, Khan KS, et
401al. Prognostic models in obstetrics: available, but far from applicable. Am J Obstet Gynecol.
402
2016; 214: 79-90.e36.
403
2. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al.
404
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or
405Diagnosis (TRIPOD): Explanation and ElaborationThe TRIPOD Statement: Explanation and
406Elaboration. Ann Intern Med. 2015; 162: W1-W73.
407
3. Harrell FE. Regression modeling strategies : with applications to linear models,
408logistic regression, and survival analysis. New York, NY: Springer, 2001.
409
4. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development,
410Validation, and Updating. New York, NY: Springer US, 2009.
411
5. Schoorel EN, Vankan E, Scheepers HC, Augustijn BC, Dirksen CD, de Koning M, et al.
412
Involving women in personalised decision-making on mode of delivery after caesarean
413section: the development and pilot testing of a patient decision aid. BJOG. 2014; 121: 202-9.
414
6. Mallett S, Royston P, Dutton S, Waters R and Altman DG. Reporting methods in
415studies developing prognostic models in cancer: a review. BMC Med. 2010; 8: 20.
416
7. Bouwmeester W, Zuithoff NP, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW,
417et al. Reporting and methods in clinical prediction research: a systematic review. PLoS Med.
418
2012; 9: 1-12.
419
8. Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External
420validation of multivariable prediction models: a systematic review of methodological
421conduct and reporting. BMC Med Res Methodol. 2014; 14: 40.
422
9. Van Calster B, Van Hoorde K, Valentin L, Testa AC, Fischerova D, Van Holsbeke C, et
423al. Evaluating the risk of ovarian cancer before surgery using the ADNEX model to
424differentiate between benign, borderline, early and advanced stage invasive, and secondary
425metastatic tumours: prospective multicentre diagnostic study. BMJ. 2014; 349: g5920.
426
10. Schuit E, Kwee A, Westerhuis ME, Van Dessel HJ, Graziosi GC, Van Lith JM, et al. A
427clinical prediction model to assess the risk of operative delivery. BJOG. 2012; 119: 915-23.
428
11. Schoorel EN, Melman S, van Kuijk SM, Grobman WA, Kwee A, Mol BW, et al.
429
Predicting successful intended vaginal delivery after previous caesarean section: external
430validation of two predictive models in a Dutch nationwide registration-based cohort with a
431high intended vaginal delivery rate. BJOG. 2014; 121: 840-7; discussion 7.
432
12. Hand DJ. Deconstructing Statistical Questions. J R Stat Soc Ser A. 1994; 157: 317-56.
433
13. Geersing GJ, Bouwmeester W, Zuithoff P, Spijker R, Leeflang M and Moons KG.
434
Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance
435systematic reviews. PLoS One. 2012; 7: e32844.
436
14. Wyatt JC and Altman DG. Commentary: Prognostic models: clinically useful or quickly
437forgotten? BMJ. 1995; 311: 1539-41.
438
15. Vickers AJ and Cronin AM. Everything you always wanted to know about evaluating
439prediction models (but were too afraid to ask). Urology. 2010; 76: 1298-301.
440
16. Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, Ioannidis JP, et al. Biomedical
441research: increasing value, reducing waste. Lancet. 2014; 383: 101-4.
442
17. Kaijser J, Sayasneh A, Van Hoorde K, Ghaem-Maghami S, Bourne T, Timmerman D, et
443al. Presurgical diagnosis of adnexal tumours using mathematical models and scoring
444systems: a systematic review and meta-analysis. Hum Reprod Update. 2014; 20: 449-62.
445
18. Peat G, Riley RD, Croft P, Morley KI, Kyzas PA, Moons KG, et al. Improving the
446transparency of prognosis research: the role of reporting, data sharing, registration, and
447protocols. PLoS Med. 2014; 11: e1001671.
448
19. Altman DG. The time has come to register diagnostic and prognostic research. Clin
449Chem. 2014; 60: 580-2.
450
20. Steyerberg EW, Moons KG, van der Windt DA, Hayden JA, Perel P, Schroter S, et al.
451
Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013; 10:
452
e1001381.
453
21. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple
454imputation for missing data in epidemiological and clinical research: potential and pitfalls.
455
BMJ. 2009; 338: b2393.
456
22. Ambler G, Omar RZ and Royston P. A comparison of imputation techniques for
457handling missing predictor values in a risk model with a binary outcome. Stat Methods Med
458Res. 2007; 16: 277-98.
459
23. Vergouwe Y, Royston P, Moons KG and Altman DG. Development and validation of a
460prediction model with missing predictor data: a practical approach. J Clin Epidemiol. 2010;
461
63: 205-14.
462
24. Liew SM, Doust J and Glasziou P. Cardiovascular risk scores do not account for the
463effect of treatment: a review. Heart. 2011; 97: 689-97.
464
25. Ganna A, Reilly M, de Faire U, Pedersen N, Magnusson P and Ingelsson E. Risk
465Prediction Measures for Case-Cohort and Nested Case-Control Designs: An Application to
466Cardiovascular Disease. Am J Epidemiol. 2012; 175: 715-24.
467
26. Collins GS, Reitsma JB, Altman DG and Moons KG. Transparent reporting of a
468multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD
469statement. BMJ. 2015; 350: g7594.
470
27. Khudyakov P, Gorfine M, Zucker D and Spiegelman D. The impact of covariate
471measurement error on risk prediction. Stat Med. 2015; 34: 2353-67.
472
28. Carroll RJ, Delaigle A and Hall P. Nonparametric Prediction in Measurement Error
473Models. Journal of the American Statistical Association. 2009; 104: 993-1014.
474
29. Stiell IG and Wells GA. Methodologic standards for the development of clinical
475decision rules in emergency medicine. Ann Emerg Med. 1999; 33: 437-47.
476
30. Wynants L, Timmerman D, Bourne T, Van Huffel S and Van Calster B. Screening for
477data clustering in multicenter studies: the residual intraclass correlation. BMC Med Res
478Methodol. 2013; 13.
479
31. White IR, Royston P and Wood AM. Multiple imputation using chained equations:
480
Issues and guidance for practice. Stat Med. 2011; 30: 377-99.
481
32. Moons KG, Donders RA, Stijnen T and Harrell FE, Jr. Using the outcome for
482imputation of missing predictor values was preferred. J Clin Epidemiol. 2006; 59: 1092-101.
483
33. Janssen KJ, Vergouwe Y, Donders AR, Harrell FE, Jr., Chen Q, Grobbee DE, et al.
484
Dealing with missing predictor values when applying clinical prediction models. Clin Chem.
485
2009; 55: 994-1001.
486
34. Little RJ and Rubin DB. Statistical analysis with missing data. Hoboken, NJ: J Wiley &
487
Sons. 2002.
488
35. Steyerberg EW, van der Ploeg T and Van Calster B. Risk prediction with machine
489learning and regression methods. Biom J. 2014; 56: 601-6.
490
36. Babyak MA. What you see may not be what you get: a brief, nontechnical
491introduction to overfitting in regression-type models. Psychosom Med. 2004; 66: 411-21.
492
37. Peduzzi P, Concato J, Kemper E, Holford TR and Feinstein AR. A simulation study of
493the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996; 49:
494
1373-9.
495
38. Harrell FE, Lee KL and Mark DB. Multivariable prognostic models: issues in
496developing models, evaluating assumptions and adequacy, and measuring and reducing
497errors. Stat Med. 1996; 15: 361-87.
498
39. Steyerberg EW, Eijkemans MJ and Habbema JD. Stepwise selection in small data sets:
499
a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999; 52: 935-42.
500
40. Steyerberg EW, Eijkemans MJ, Harrell FE, Jr. and Habbema JD. Prognostic modeling
501with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis
502Making. 2001; 21: 45-56.
503
41. Wynants L, Bouwmeester W, Moons KG, Moerbeek M, Timmerman D, Van Huffel S,
504et al. A simulation study of sample size demonstrated the importance of the number of
505events per variable to develop prediction models in clustered data. J Clin Epidemiol. 2015.
506
42. Chatfield C. Model Uncertainty, Data Mining and Statistical Inference. J R Stat Soc Ser
507A. 1995; 158: 419-66.
508
43. Royston P, Altman DG and Sauerbrei W. Dichotomizing continuous predictors in
509multiple regression: a bad idea. Stat Med. 2006; 25: 127-41.
510
44. Hastie T, Tibshirani R and Friedman J. The elements of statistical learning. Springer,
5112009.
512
45. Kruppa J, Liu Y, Biau G, Kohler M, Konig IR, Malley JD, et al. Probability estimation
513with machine learning methods for dichotomous and multicategory outcome: theory.
514
Biometrical journal Biometrische Zeitschrift. 2014; 56: 534-63.
515
46. Van Hoorde K, Van Huffel S, Timmerman D, Bourne T and Van Calster B. A spline-
516based tool to assess and visualize the calibration of multiclass risk predictions. Journal of
517biomedical informatics. 2015; 54: 283-93.
518
47. Van Calster B, Condous G, Kirk E, Bourne T, Timmerman D and Van Huffel S. An
519application of methods for the probabilistic three-class classification of pregnancies of
520unknown location. Artif Intell Med. 2009; 46: 139-54.
521
48. Van Holsbeke C, Van Calster B, Bourne T, Ajossa S, Testa AC, Guerriero S, et al.
522
External Validation of Diagnostic Models to Estimate the Risk of Malignancy in Adnexal
523Masses. Clin Cancer Res. 2012; 18: 815-25.
524
49. Austin PC, Tu JV, Ho JE, Levy D and Lee DS. Using methods from the data-mining and
525machine-learning literature for disease classification and prediction: a case study examining
526classification of heart failure subtypes. J Clin Epidemiol. 2013; 66: 398-407.
527
50. Tollenaar N and Van der Heijden P. Which method predicts recidivism best?: a
528comparison of statistical, machine learning and data mining predictive models. J R Stat Soc
529Ser A. 2013; 176: 565-84.
530
51. Nemes S, Jonasson JM, Genell A and Steineck G. Bias in odds ratios by logistic
531regression modelling and sample size. BMC Med Res Methodol. 2009; 9: 56.
532
52. Royston P and Sauerbrei W. Multivariable model-building: a pragmatic approach to
533regression anaylsis based on fractional polynomials for modelling continuous variables.
534
Chichester: John Wiley & Sons, 2008.
535
53. Naji O, Wynants L, Smith A, Abdallah Y, Stalder C, Sayasneh A, et al. Predicting
536successful vaginal birth after Cesarean section using a model based on Cesarean scar
537features examined by transvaginal sonography. Ultrasound Obstet Gynecol. 2013; 41: 672-8.
538
54. Greenland S. Tests for interaction in epidemiologic studies: A review and a study of
539power. Stat Med. 1983; 2: 243-51.
540
55. Bottomley C, Van Belle V, Kirk E, Van Huffel S, Timmerman D and Bourne T. Accurate
541prediction of pregnancy viability by means of a simple scoring system. Hum Reprod. 2013;
542
28: 68-76.
543
56. Ambler G, Seaman S and Omar RZ. An evaluation of penalised survival methods for
544developing prognostic models with rare events. Stat Med. 2012; 31: 1150-61.
545
57. Austin PC and Steyerberg EW. Graphical assessment of internal and external
546calibration of logistic regression models by using loess smoothers. Stat Med. 2014; 33: 517-
54735.
548
58. Pepe MS and Janes HE. Gauging the performance of SNPs, biomarkers, and clinical
549factors for predicting risk of breast cancer. J Natl Cancer Inst. 2008; 100: 978-9.
550
59. Vickers AJ and Cronin AM. Traditional statistical methods for evaluating prediction
551models are uninformative as to clinical value: towards a decision analytic framework. Semin
552Oncol. 2010; 37: 31-8.
553
60. Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ and Van Calster B.
554
Assessing the incremental value of diagnostic and prognostic markers: a review and
555illustration. Eur J Clin Invest. 2012; 42: 216-28.
556
61. Vickers AJ and Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating
557Prediction Models. Med Decis Making. 2006; 26: 565-74.
558
62. Vickers AJ, Van Calster B and Steyerberg EW. Net benefit approaches to the
559evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016; 352.
560
63. Justice AC, Covinsky KE and Berlin JA. Assessing the generalizability of prognostic
561information. Ann Intern Med. 1999; 130: 515-24.
562
64. Steyerberg EW, Harrell FE, Borsboom G, Eijkemans MJC, Vergouwe Y and Habbema
563JDF. Internal validation of predictive models: Efficiency of some procedures for logistic
564regression analysis. J Clin Epidemiol. 2001; 54: 774-81.
565
65. Morrow DA and Cook NR. Determining decision limits for new biomarkers: clinical
566and statistical considerations. Clin Chem. 2011; 57: 1-3.
567
66. Steyerberg EW, Van Calster B and Pencina MJ. [Performance measures for prediction
568models and markers: evaluation of predictions and classifications]. Rev Esp Cardiol. 2011;
569
64: 788-94.
570
67. Leeflang MM, Rutjes AW, Reitsma JB, Hooft L and Bossuyt PM. Variation of a test's
571sensitivity and specificity with disease prevalence. CMAJ. 2013; 185: E537-44.
572
68. Cox DR. Two Further Applications of a Model for Binary Regression. Biometrika.
573
1958; 45: 562-5.
574
69. Hosmer DW and Hjort NL. Goodness-of-fit processes for logistic regression:
575
simulation results. Stat Med. 2002; 21: 2723-38.
576
70. Peek N, Arts DG, Bosman RJ, van der Voort PH and de Keizer NF. External validation
577of prognostic models for critically ill patients required substantial sample sizes. J Clin
578Epidemiol. 2007; 60: 491-501.
579
71. Vickers AJ. Prediction models: revolutionary in principle, but do they do more good
580than harm? J Clin Oncol. 2011; 29: 2951-2.
581
72. Mallett S, Halligan S, Thompson M, Collins GS and Altman DG. Interpreting diagnostic
582accuracy studies for patient care. BMJ. 2012; 345.
583
73. Balachandran VP, Gonen M, Smith JJ and DeMatteo RP. Nomograms in oncology:
584
more than meets the eye. Lancet Oncol. 2015; 16: e173-e80.
585
74. Baker SG. Putting Risk Prediction in Perspective: Relative Utility Curves. J Natl Cancer
586Inst. 2009; 101: 1538-42.
587
75. Localio AR and Goodman S. Beyond the usual prediction accuracy metrics: reporting
588results for clinical decision making. Ann Intern Med. 2012; 157: 294-5.
589
76. Van Calster B and Vickers AJ. Calibration of risk prediction models: impact on
590decision-analytic performance. Med Decis Making. 2015; 35: 162-9.
591
77. Collins GS and Moons KG. Comparing risk prediction models. BMJ. 2012; 344: e3186.
592
78. Pepe MS, Kerr KF, Longton G and Wang Z. Testing for improvement in prediction
593model performance. Stat Med. 2013; 32: 1467-82.
594
79. Cook NR and Paynter NP. Performance of reclassification statistics in comparing risk
595prediction models. Biometrical journal Biometrische Zeitschrift. 2011; 53: 237-58.
596
80. Pencina MJ, D'Agostino RB, Sr., D'Agostino RB, Jr. and Vasan RS. Evaluating the added
597predictive ability of a new marker: from area under the ROC curve to reclassification and
598beyond. Stat Med. 2008; 27: 157-72; discussion 207-12.
599
81. Leening MJ, Vedder MM, Witteman JC, Pencina MJ and Steyerberg EW. Net
600reclassification improvement: computation, interpretation, and controversies: a literature
601review and clinician's guide. Ann Intern Med. 2014; 160: 122-31.
602
82. Pepe MS, Fan J, Feng Z, Gerds T and Hilden J. The Net Reclassification Index (NRI): A
603Misleading Measure of Prediction Improvement Even with Independent Test Data Sets.
604
Statistics in Biosciences. 2015; 7: 282-95.
605
83. Leening MJ, Steyerberg EW, Van Calster B, D'Agostino RB, Sr. and Pencina MJ. Net
606reclassification improvement and integrated discrimination improvement require calibrated
607models: relevance from a marker and model perspective. Stat Med. 2014; 33: 3415-8.
608
84. Hilden J and Gerds TA. A note on the evaluation of novel biomarkers: do not rely on
609integrated discrimination improvement and net reclassification index. Stat Med. 2014; 33:
610
3405-14.
611
85. Collins GS. How can I validate a nomogram? Show me the model. Ann Oncol. 2015;
612
26: 1034-5.
613
86. Van Belle V and Van Calster B. Visualizing Risk Prediction Models. PLoS One. 2015;
614
10: e0132614.
615
87. Collins GS, Mallett S, Omar O and Yu LM. Developing risk prediction models for type
6162 diabetes: a systematic review of methodology and reporting. BMC Med. 2011; 9: 103.
617
88. Janssens AC, Ioannidis JP, van Duijn CM, Little J and Khoury MJ. Strengthening the
618reporting of genetic risk prediction studies: the GRIPS statement. Genome Med. 2011; 3: 16.
619
89. Konig IR, Malley JD, Weimar C, Diener HC and Ziegler A. Practical experiences on the
620necessity of external validation. Stat Med. 2007; 26: 5499-511.
621
90. Siontis GC, Tzoulaki I, Castaldi PJ and Ioannidis JP. External validation of new risk
622prediction models is infrequent and reveals worse prognostic discrimination. J Clin
623Epidemiol. 2015; 68: 25-34.
624
91. Vergouwe Y, Steyerberg EW, Eijkemans MJ and Habbema JD. Substantial effective
625sample sizes were required for external validation studies of predictive logistic regression
626models. J Clin Epidemiol. 2005; 58: 475-83.
627
92. Collins GS, Ogundimu EO and Altman DG. Sample size considerations for the external
628validation of a multivariable prognostic model: a resampling study. Stat Med. 2016; 35: 214-
62926.
630
93. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ and Steyerberg EW. A
631calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin
632Epidemiol. 2016.
633
94. Debray TP, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW and Moons KG. A
634new framework to enhance the interpretation of external validation studies of clinical
635prediction models. J Clin Epidemiol. 2015; 68: 279-89.
636
95. Vergouwe Y, Moons KGM and Steyerberg EW. External Validity of Risk Models: Use
637of Benchmark Values to Disentangle a Case-Mix Effect From Incorrect Coefficients. Am J
638Epidemiol. 2010; 172: 971-80.
639
96. Strobl AN, Vickers AJ, Van Calster B, Steyerberg E, Leach RJ, Thompson IM, et al.
640
Improving patient prostate cancer risk assessment: Moving from static, globally-applied to
641dynamic, practice-specific cancer risk calculators. Journal of biomedical informatics. 2015.
642
97. Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ and Habbema JD.
643
Validation and updating of predictive logistic regression models: a study on sample size and
644shrinkage. Stat Med. 2004; 23: 2567-86.
645
98. Janssen KJM, Moons KGM, Kalkman CJ, Grobbee DE and Vergouwe Y. Updating
646methods improved the performance of a clinical prediction model in new patients. J Clin
647Epidemiol. 2008; 61: 76-86.
648
99. Ankerst DP, Koniarski T, Liang Y, Leach RJ, Feng Z, Sanda MG, et al. Updating risk
649prediction tools: a case study in prostate cancer. Biometrical journal Biometrische
650Zeitschrift. 2012; 54: 127-42.
651
100. Reilly BM and Evans AT. Translating clinical research into clinical practice: impact of
652using prediction rules to make decisions. Ann Intern Med. 2006; 144: 201-9.
653
101. Moons KGM, Altman DG, Vergouwe Y and Royston P. Prognosis and prognostic
654research: application and impact of prognostic models in clinical practice. Br Med J. 2009;
655
338.
656
102. Stiell IG, Clement CM, Grimshaw J, Brison RJ, Rowe BH, Schull MJ, et al.
657
Implementation of the Canadian C-Spine Rule: prospective 12 centre cluster randomised
658trial. BMJ. 2009; 339: b4146.
659
103. Ferrante di Ruffano L, Hyde CJ, McCaffery KJ, Bossuyt PMM and Deeks JJ. Assessing
660the value of diagnostic tests: a framework for designing and evaluating trials. 2012.
661
104. Van den Bruel A, Aertgeerts B, Bruyninckx R, Aerts M, Buntinx F and Hall H. Signs and
662symptoms for diagnosis of serious infections in children: a prospective study in primary care.
663
Br J Gen Pract. 2007; 57: 538-46.
664
105. Siontis KC, Siontis GC, Contopoulos-Ioannidis DG and Ioannidis JP. Diagnostic tests
665often fail to lead to changes in patient outcomes. J Clin Epidemiol. 2014; 67: 612-21.
666
106. Kawamoto K, Houlihan CA, Balas EA and Lobach DF. Improving clinical practice using
667clinical decision support systems: a systematic review of trials to identify features critical to
668success. BMJ. 2005; 330: 765.
669
670
671
672
Tables/Figure Captions
673Table 1. Overview of potential pitfalls when developing or validating risk models.
674
Figure 1. ROC curve for the ADNEX model to distinguish between malignant and benign
675lesions. The black line represents the ROC curve for the ADNEX model, the grey diagonal line
676presents the ROC curve for a model without discriminative ability, the dots represent the
677specificity and sensitivity for risk thresholds 0.03, 0.05, 0.10 and 0.15 (risk threshold
678(specificity, sensitivity)).
679
Figure 2. Calibration plot of the ADNEX model at external validation. Thick line: Loess
680smoother indicating the relation between the predicted probability of malignancy by the
681ADNEX model and observed probability. Grey band: 95% confidence interval for the loess
682smoother. The thin diagonal line indicates perfect calibration. The histogram on the x-axis
683shows the distribution of predicted probabilities of malignancy, with the frequencies of
684predicted probabilities for events plotted above and for non-events below the x-axis.
685
Figure 3. Net benefit of the ADNEX model at external validation. Dashed line: Net benefit
686(NB) of the ADNEX model to distinguish between benign and malignant lesions. Grey line:
687
NB of classifying all tumours as malignant. Black line at zero: NB of classifying all tumours as
688benign (none as malignant). Dot: NB of ADNEX at threshold probability 10%. NB is
689computed at various risk thresholds. If the probability of malignant disease predicted by
690ADNEX is higher than the risk threshold, the tumour is classified as malignant. Higher NB
691values indicate more clinical utility. E.g. at threshold 10% and compared to classifying no
692tumours as malignant, the use of ADNEX leads to the equivalent of a net 37 (NB=0.37)
693correctly identified malignancies per 100 patients, without an increase in the number of
694benign lesions classified as malignancies. Moreover, the NB of ADNEX is 0.033 greater than
695assuming all tumours are malignant. This is equivalent to a net 29 (=0.33*100/(10/90))
696fewer benign lesions classified as malignancies per 100 patients, compared to classifying all
697as malignant.
698
699