Key steps and common pitfalls in developing and validating risk models: a review

(1)

Citation/Reference Wynants L., Collins G. (2016),

Key steps and common pitfalls in developing and validating risk models: a review

BJOG: An International Journal of Obstetrics and Gynaecology (in press)

Archived version Author manuscript: the content is identical to the content of the submitted manuscript, before refereeing

Published version insert link to the published version of your paper http://dx.doi.org/10

Journal homepage http://www.bjog.org/view/0/index.html

Author contact laure.wynants@esat.kuleuven.be + 32 (0)16 32 76 70

IR url in Lirias https://lirias.kuleuven.be/handle/123456789/xxxxxx

(article begins on next page)

(2)

Key steps and common pitfalls in developing and validating risk models: a review

1

2

Ms. Laure WYNANTS, Msc; Leuven, Belgium; KU Leuven, Department of Electrical

3

Engineering-ESAT, STADIUS Center for Dynamical Systems, Signal Processing and Data

4

Analytics; KU Leuven, iMinds Medical IT Department; Kasteelpark Arenberg 10, Box 2446,

5

Leuven 3001, Belgium

6

Dr. Gary S COLLINS , PhD; Oxford, UK; Centre for Statistics in Medicine, Nuffield Department

7

of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre,

8

University of Oxford; Windmill Road, Oxford OX3 7LD, UK

9

Dr. Ben VAN CALSTER, PhD; Leuven, Belgium; KU Leuven, Department of Development and

10

Regeneration; Herestraat 49 Box 7003, Leuven 3000, Belgium

11

Corresponding Author:

12

Ben Van Calster

13

Address: KU Leuven Department of Development and Regeneration, Herestraat 49 box

14

7003, 3000 Leuven, Belgium

15

Work phone number: +32 16 37 77 88

16

E-mail: ben.vancalster@med.kuleuven.be

17

Running title: Risk models: key steps and common pitfalls

18

Word count: abstract: 86 ; main text: 3904

19

20

21

(3)

22

Abstract

23

24

Risk models to estimate an individual’s risk of having or developing a disease are abundant

25

in the medical literature, yet many do not meet the methodological standards that have

26

been set to maximise generalisability and utility. This paper presents an overview of ten

27

steps from the conception of the study to the implementation of the risk model and

28

discusses common pitfalls. We discuss crucial aspects of study design, data collection, model

29

development, and performance evaluation, and we discuss how to bring the model to

30

clinical practice.

31

Key words: clinical prediction model, logistic regression, model development, model

32

reporting, model validation, risk model

33

34

35

36

(4)

37

Introduction

38

In recent years we have seen an increasing number of papers about risk models.

¹

The term

39

risk model refers to any model that predicts the risk that a condition is present (diagnostic)

40

or will develop in the future (prognostic).

²

Risk prediction is usually performed using

41

multivariable models that aim to provide reliable predictions in new patients.

^{3, 4}

Risk models

42

may aid clinicians in making treatment decisions based on patient-specific measurements. In

43

that sense risk models may fuel personalised and evidence-based medicine, and enhance

44

shared decision making.

⁵

45

To maximise their potential it is imperative that risk models are developed carefully and

46

validated rigorously. Unfortunately, reviews have demonstrated that methodological and

47

reporting standards are often not met.

^{1, 6-8}

We present a roadmap with ten steps to

48

summarise the process from the conception of a risk model to its final implementation. We

49

address these steps one by one and highlight common pitfalls (Table 1). Technical details on

50

the methods for the development and evaluation of risk models are presented in boxes for

51

the interested reader.

52

53

Case studies

54

Three case studies are presented in supplementary material that illustrate the ten steps

55

(Table S1). The first concerns the development of the ADNEX model for the preoperative

56

diagnosis of suspicious ovarian tumours.

⁹

The ADNEX model predicts whether a tumour is

57

benign, borderline malignant, stage I ovarian cancer, advanced stage (II-IV) ovarian cancer,

58

(5)

or secondary metastatic cancer. The second involves the development of a prediction model

59

to assess the risk of operative delivery.

¹⁰

The model predicts the need for instrumental

60

vaginal delivery or caesarean section for fetal distress or failure to progress. The third deals

61

with the external validation of two models developed in the Unites States for the prediction

62

of successful vaginal delivery after a previous caesarean section.

¹¹

The models’

63

transportability to the Dutch setting is investigated.

64

65

Step I: Before getting started

66

It is pivotal to define up front what clinical purpose the risk model should serve,

¹²

as well as

67

the target population and clinical setting. A mismatch with clinical needs is not uncommon

68

and makes the model useless (Table 1, pitfall 1). Next, ask whether a new model needs to be

69

developed. Often risk models for the same purpose already exist.

¹

If so, the characteristics

70

of these models should be checked (e.g., the context and setting they were developed for,

71

the included predictors), and the extent to which these models have been validated. Search

72

strategies for finding relevant existing risk models are available.

^{2, 13}

Systematic reviews

73

indicate that risk models are often developed in vain because there was no clear clinical

74

need, several other models already exist, and/or the models are not validated.

^14-17 75

76

Step II: Study design and setup

77

We encourage investigators to discuss the study setup beforehand, and to write a protocol

78

that covers each of the ten steps.

¹⁸

For complete transparency, consider to publish the

79

protocol and register the study.

¹⁹

The preferred study design is a prospective cross-sectional

80

(6)

cohort design for diagnostic risk models and a prospective longitudinal cohort design for

81

prognostic risk models. Both designs study a group of patients which should be

82

representative of the population in which the model is intended to be applied. By

83

prospective, we mean that the data is collected with the primary aim to develop a risk

84

model. Box 1 presents alternative designs and data sources with their limitations. In order to

85

ensure an unbiased sample, a consecutive sample of patients is preferred. To enhance

86

generalisability and transportability of the model, multicentre data collection is

87

recommended. It is important to carefully identify the predictor variables for inclusion in the

88

study, such as established risk factors, useful or easily obtainable variables in the clinical

89

context of the models’ intended use,

²⁰

and promising variables of which the predictive value

90

has yet to be assessed. The definitions of these variables should be unambiguous and

91

standardised, and measurement error should be avoided (see Box 1). The outcome to be

92

predicted should be clearly defined in advance (see Box 1). Finally, bear in mind that the

93

potential predictors should be available when the risk model will be used during patient

94

care. For example, when predicting the risk of a birth defect based on measurements at the

95

nuchal scan, it makes no sense to include birth weight in the risk model.

96

However carefully the study is designed, data are rarely complete. A common

97

mistake is to automatically exclude patients with missing data, and perform the analysis

98

only on the complete cases (Table 1, pitfall 2). Besides reducing the sample size, this may

99

lead to a biased sample resulting in a model with poor generalisability, because there is

100

nearly always an underlying reason for missing values.

²¹

The preferred approach is to

101

replace (‘impute’) the missing data with sensible values. Preferably several plausible values

102

are imputed (using ‘multiple imputation’) to acknowledge that imputed values are

103

themselves estimates and by definition uncertain.

^{22, 23}

104

(7)

Box 1. Study set-up and design: technical details

105

Alternative designs

106

Risk models can be developed as a secondary analysis of existing data, for example, from a

107

registry study or a randomised clinical trial, or retrospectively based on hospitals’ medical

108

health records. A common limitation of using existing data is that they were collected for a

109

different purpose and that potential or even established risk factors may not have been

110

collected in a suboptimal fashion or not at all (i.e., plenty of missing data). This can seriously

111

affect the performance and credibility of the model. For data from randomised trials,

112

generalisability is a concern due to strong inclusion/exclusion criteria or volunteer bias.

113

Furthermore, in the presence of an effective treatment, some thought also needs to be

114

given to how treatment should be handled in the analysis.

²⁴

If the outcome to be predicted

115

is rare, it may be useful to identify sufficient cases and recruit a random sample of non-

116

cases, using a case-cohort or nested case-control design.

²⁵

The prediction model needs to be

117

adjusted for the sampling frequency of non-cases to ensure reliable predictions.

²⁶ 118

Measurement error

119

Measurement error is common for many predictors and may influence model

120

performance.

²⁷

Random measurement error tends to attenuate or dilute estimated

121

regression coefficients, although the opposite is also observed.

²⁸

It may be useful to set up

122

an interobserver agreement study, or at least discuss the likely interobserver agreement of

123

potential predictors, particularly when subjectivity in determining the predictor’s values is

124

present.

²⁹

When a large multicentre dataset is available, the amount of systematic

125

differences between centres can be assessed for each predictor variable.

³⁰

These

126

(8)

differences can be caused by several factors, such as systematic differences in

127

measurements, equipment, national procedures, or centre populations.

128

Defining the outcome to be predicted

129

For diagnostic models, the presence or absence of the condition should be measured by the

130

best available and commonly accepted method, the ‘reference standard’. The outcome

131

measurement is typically available only after the candidate predictors have been assessed.

132

The time interval should be specified and kept short. Preferably, patients receive no

133

treatment during this interval. For prognostic models the outcome should be a clinically

134

relevant event, occurring in a certain period in the future. In both diagnostic and prognostic

135

settings, outcome assessment i.e., blinded information on the predictors, avoids unwanted

136

review bias.

137

Handling missing data

138

A popular method is (multiple) imputation by ‘chained equations’, also called fully

139

conditional specification.

³¹

Using this approach, missing information can be estimated

140

(“imputed”) using related variables for which information is available, including the

141

outcome and variables that are related to the missingness. Several papers provide guidance

142

and illustrations of imputation in the context of risk models.

21-23, 32, 33

Most imputation

143

approaches assume that values are ‘missing at random’ (MAR).

³⁴

This means that missing

144

values do not occur in a completely random fashion, but are random ‘conditional on the

145

available data’. For example, if values are more often missing in older patients, then MAR

146

holds if patient age is available.

147

148

(9)

Step III: Modelling strategy

149

Before delving into any statistical analysis, it is recommended to define a modelling

150

strategy, which should preferably be specified in the study protocol.

^{3, 4}

Regression methods

151

such as logistic (for short-term diagnostic or prognostic outcomes) or Cox regression (for

152

long term prognostic outcomes) are the most frequently used approaches to develop risk

153

models. One could also use flexible methods such as support vector machines (see Box 2).

³⁵ 154

An important consideration is the complexity of the model, relative to the available sample

155

size (see Box 2). Overly complex models are often overfitted: they seem to perform well on

156

the data used to derive the model, but perform poorly on new data (Table 1, pitfall 3).

³⁶ 157

Recommendations are to control the ratio of the number of events to the number of

158

variables examined, referred to as events-per-variable (EPV). Although EPV is a common

159

term, it is more appropriate to consider the total number of estimated parameters (i.e., all

160

regression coefficients, including all those considered prior to any variable selection) rather

161

than just the number of variables in the final model.

³

For binary outcomes, the number of

162

events is the number of individuals in the smallest outcome category. A value of 10 EPV is

163

frequently recommended,

^{37, 38}

but a more realistic guideline is the following: 10 EPV is a

164

minimum when the model is prespecified, although additional ‘shrinkage’ (see Box 2 and

165

step IV) of model coefficients may be required. A value of 20 EPV may alleviate the need for

166

shrinkage, whilst 50 EPV is recommended when statistical (data-driven) variable selection is

167

used.

^{4, 39-41}

168

It is common practice to use some form of statistical variable selection to reduce the

169

number of variables and for parsimony. Backwards elimination is preferred because this

170

approach starts with a full model (i.e., includes all variables) and eliminates non-significant

171

(10)

variables one by one.

³

Although convenient, these methods have important limitations,

172

especially in small datasets.

3, 4, 36, 39, 40

Due to repeated statistical testing, stepwise selection

173

methods lead to overestimated coefficients (resulting in overfitting) and overoptimistic p-

174

values (Table 1, pitfall 4).

⁴²

The same is true when selecting variables based on their

175

univariate association with the outcome, which should be avoided.

³

Statistical significance is

176

often not the best criterion for inclusion or exclusion of predictors. A preferable strategy is

177

to use subject matter knowledge to select predictors a priori (see Box 2).

^{3, 4}

178

When dealing with categorical variables, categories with little data may be combined to

179

reduce the number of parameters. Categorising continuous variables should be avoided,

180

because this entails a substantial loss of information.

⁴³

To maximise predictive ability, it is

181

advisable to assess whether the variable has a linear effect or might need a transformation

182

(see Box 2). Investigations of nonlinearity should be kept within reasonable limits relative to

183

the number of available events to avoid overly complex models (Table 1, pitfall 5). For the

184

same reason, it is recommended to control EPV by specifying in advance which interaction

185

terms are known or potentially relevant, instead of testing all possible interaction terms

186

(Box 2).

^{3, 4} 187

Box 2. Modelling strategy: technical details

188

Flexible models

189

The machine learning literature offers a class of models that focus on automatization and

190

computational flexibility. Examples include neural networks, support vector machines, and

191

random forests.

⁴⁴

However, these methods struggle with reliable probabilistic estimation

^45, 192

46

and interpretability. Moreover, at least for clinical applications, machine learning methods

193

generally do not yield better performance than regression methods.

^46-50 194

(11)

Overfitting

195

Two samples of patients from the same population will always be different due to random

196

variation. Capturing random peculiarities in a sample should be avoided, and we should

197

focus on identifying the ‘true’ underlying associations with the outcome. Modelling random

198

idiosyncrasies is called overfitting.

³⁶

Overfitting is more likely when the sample size is too

199

small relative to the total number of (candidate) predictors considered.

³ 200

The need for shrinkage

201

Fitting logistic regression models using standard ‘maximum likelihood’ yields unbiased

202

parameter estimates provided the sample size is sufficiently large.

⁵¹

Nevertheless, because

203

parameter estimates are combined to obtain risk predictions, these predictions contain

204

overfitting even in relatively large datasets: predicted risks are too extreme, in that higher

205

risks are overestimated and lower risks are underestimated. Shrinkage of model coefficients

206

(see step IV) can alleviate this problem.

207

Subject matter knowledge

208

Based on experience and scientific literature, domain experts can often judge the likely

209

importance of predictors a priori. In addition, predictors can be judged in terms of

210

subjectivity,

^{29, 30}

financial cost, and invasiveness. For example, the variable age is easy and

211

cheap, whereas measurements based on magnetic resonance imaging can be difficult and

212

expensive. Nevertheless, statistical selection remains of interest when subject matter

213

knowledge falls short.

^{3, 52}

Subject matter knowledge may also prove useful when scrutinizing

214

the model that was statistically fitted. For example, an effect in opposite direction to what is

215

expected is suspicious. The exclusion of such predictors can be beneficial for the robustness

216

(12)

and transportability of the model.

⁴⁰

This criterion was used, for instance, to eliminate body

217

mass index from a risk model to predict successful vaginal birth after Caesarean section.

⁵³ 218

Nonlinearity

219

Continuous variables may have a nonlinear effect. For example, biomarker effects are often

220

logarithmic: for low biomarker values, small differences have a strong influence on the risk

221

of disease, whereas for high values small differences do not matter much. When the

222

number of events is small, linear effects can be assumed or nonlinearity can be investigated

223

for the most important predictors only. Popular methods to model nonlinearity are spline

224

functions (e.g., restricted cubic splines) and multivariable fractional polynomials (MFP).

^{3, 52} 225

An interesting feature of MFP is that it combines the assessment of nonlinearity with

226

backward variable selection.

227

Interactions

228

An interaction occurs when the coefficient of a predictor depends on the value of another

229

predictor. The number of possible interaction terms grows exponentially with the number of

230

predictors, and power to detect interactions is low.

⁵⁴

As a result, many statistically

231

significant interactions are hard to replicate. Moreover, it is unlikely that interaction terms

232

will dramatically improve predictions. An interesting exception is the interaction between

233

fetal heart activity and gestational sac size when predicting the chance of pregnancy viability

234

beyond the first trimester: larger gestational sac sizes increase the chance of viability when

235

fetal heart activity was observed but decrease the chance of viability when no such activity

236

was seen.

⁵⁵ 237

Step IV: Model fitting

238

(13)

Once the development strategy has been defined, it can be implemented. For smaller

239

samples sizes (e.g., EPV <20) shrinkage methods may be considered, whereby models are

240

penalised towards simplicity (e.g., small regression coefficients can be shrunk towards

241

zero).

⁴⁰

Common penalization methods include ridge regression and LASSO.

⁵⁶

242

Step V: Validation of model performance

243

The proof of the pudding is in the eating, certainly for risk models. It is key to evaluate

244

model performance. We distinguish three aspects of model performance (see Box 3 for an

245

elaboration):

246

 Discrimination assesses whether estimated risks are different for patients with and

247

without the disease. The most well-known measure is the c-statistic, also known as

248

the area under the ROC curve (AUC) for binary outcomes (Figure 1). It is the

249

probability that a randomly selected event has a higher predicted probability than a

250

randomly selected non-event.

251

 Calibration assesses the agreement between the predicted risks and corresponding

252

observed event rates. This is preferably assessed using a calibration plot (Figure 2).

^4, 253

254 57

 Discrimination and calibration do not address the clinical consequences of using a

255

risk model.

^58-60

One measure to assess the clinical utility for decision making is the

256

Net Benefit that is plotted in a decision curve (Figure 3).

^{61, 62}

Net Benefit relies on the

257

relationship of the risk threshold with the relative importance of true vs false

258

positives: when we adopt a risk threshold of 10% to select patients for treatment, we

259

accept the harm in treating 9 patients without disease to gain the benefit of treating

260

1 patient with disease. This means that we accept up to 9 false positives per true

261

(14)

positive. Net benefit uses this relationship to correct the proportion of true positives

262

for the proportion of false positives.

263

264

We further distinguish three types of validation:

265

 Apparent validation refers to evaluation on the exact same data on which the model

266

was developed. The resulting performance will usually be optimistic. The amount of

267

optimism depends strongly on how the model was developed, including the sample

268

size and the number of variables examined.

269

 Internal validation involves an independent evaluation using the dataset that was

270

used to develop the model.

⁶³

The most popular approach is to randomly split the

271

dataset into a development set and a validation set, however this approach is

272

inefficient and should be avoided (Table 1, pitfall 6). Alternative approaches such as

273

cross-validation or bootstrapping are recommended.

⁶⁴

Publications describing the

274

development of new risk models should always include an internal validation.

^{2, 26} 275

 External validation involves an evaluation on a different dataset that is collected at a

276

different time point and/or at a different location (see below).

⁶³ 277

Box 3.Validation of performance: technical details

278

Risk models vs classification rules.

279

The use of a risk threshold turns the risk model into a classification (or decision) rule.

280

Thresholds (or cut-offs) can be selected in several ways.

^{65, 66}

At the level of the individual

281

patient, the relative consequences of a false positive or false negative can be taken into

282

account. For example, a threshold below 50% indicates that a false negative is

283

considered more harmful than a false positive. At the population level, a desired balance

284

(15)

in terms of sensitivity and specificity may be sought. This framework is more useful for

285

cost-effectiveness studies and for national guidelines that formulate recommendations

286

for clinical practice. When a threshold is defined, patient classification and actual disease

287

status can be cross-tabulated and summarised with measures such as sensitivity,

288

specificity, positive predictive value, and negative predictive value. The positive and

289

negative predictive value directly depend on event rate (i.e., prevalence for diagnostic

290

outcomes), however in practice sensitivity and specificity also vary with event rate.

⁶⁷ 291

Calibration

292

Calibration plots are used to check whether predicted risks correspond to observed

293

event rates.

^{4, 57}

They can be supplemented with estimates of the calibration slope and

294

intercept.

^{4, 68}

The Hosmer-Lemeshow statistic is frequently used to test for

295

miscalibration, but suffers from serious drawbacks: e.g.,low power and inability to assess

296

the magnitude or direction of (mis)calibration. We therefore recommend mot to use

297

it.

^{69, 70} 298

Clinical Utility

299

Despite being introduced recently, measures for clinical utility have already been

300

recommended by several leading journals.

^{62, 71-75}

Poor discrimination and miscalibration

301

reduce clinical utility, but good discrimination and calibration do not guarantee utility.

⁷⁶ 302

However, measures for clinical utility are not always of interest. For example, a risk

303

model to merely inform pregnant couples of the chance of a successful pregnancy

304

beyond the first trimester

⁵⁵

would not require an analysis of clinical utility, as no

305

decision needs to be made that is dependent on a threshold. Calibration is the key

306

measure in such contexts.

307

(16)

Model comparison

308

Models can be compared by calculating the difference in the c-statistic, by comparing

309

calibration and by comparing Net Benefit or another metric for clinical usefulness.

⁷⁷

The

310

significance of an added predictor can be based on a test for the predictor’s regression

311

coefficient rather than on tests for performance improvement (e.g. change in AUC).

⁷⁸ 312

Reclassification statistics such as net reclassification improvement (NRI) and integrated

313

discrimination improvement (IDI) have fuelled intense debate.

^79-81

We advise caution

314

when using NRI and IDI, because they depend on calibration in the sense that models

315

may appear advantageous simply because of poor calibration.

^82-84 316

317

Step VI: Model presentation and interpretation

318

The exact formula of the model (including the intercept or baseline survival at a given time

319

point) should always be reported in the publication to allow others to use it, including for

320

validation by independent investigators.

^{2, 85}

To aid uptake of risk models, they are often

321

simplified or presented in alternative formats, including score charts, nomograms, or colour

322

bars.

3, 4, 55, 73, 86

Clear eligibility criteria should be presented along with the model, including

323

ranges of continuous predictors, so that users are made aware if they are extrapolating

324

beyond the ranges observed in the development data.

325

326

Step VII: Model reporting

327

Complete and transparent communication of model development and validation ensures

328

that others can fully understand what was done. Reviews of reporting quality have

329

(17)

repeatedly revealed clear shortcomings (Table 1, pitfall 7).

^{6-8, 87}

This led to comprehensive

330

guidelines and checklists for manuscript preparation such as the recent TRIPOD statement.

^2, 331

26, 88

332

333

Step VIII: External validation

334

The real test of a model involves an evaluation on new data, either collected at a later point

335

in time from the same centre(s) (temporal validation) or collected at different centres

336

(geographical validation).

⁸⁹

It is disappointing that the literature contains many more

337

publications on the development of new risk models than on external validations of existing

338

models (Table 1, pitfall 8).

^{8, 20, 90}

It is better to develop fewer models that hold more promise

339

to be robust and useful. In addition, when several models for the same purpose exist, it is

340

recommended to directly compare these models in an external validation study.

⁷⁷

Details on

341

external validation are provided in Box 4.

342

Box 4. External Validation: Technical Details

343

External validation

344

A reliable external validation study requires a sufficiently large sample size. At least 100 but

345

preferably 200 events are recommended.

^{70, 91-93}

External validation often results in poorer

346

performance with respect to discrimination or calibration.

⁹⁰

There are many factors that can

347

explain results at external validation, therefore these results have to be interpreted

348

carefully, often in the context of differences in case-mix between the development and

349

validation datasets.

^{94, 95}

For example, discrimination is typically lower in more homogeneous

350

populations. Even temporal validation may show performance degradation, for example

351

(18)

because the centre changed from a secondary to a tertiary care centre, or because new

352

clinical guidelines have changed the population.

⁹⁶

353

Updating

354

If the performance of a model is disappointing, updating the model is a sensible solution

355

that is more efficient than the development of a completely new one, because the original

356

one contains useful information. Different approaches exist, such as intercept adjustment,

357

rescaling of model coefficients, model refitting, or model extension.

97, 98, 99

358

359

Step IX: Impact studies

360

The ultimate aim of many risk models is to improve patient care. Therefore, the final step

361

would be an impact study, perhaps together with a cost-effectiveness analysis.

20, 100, 101

362

Preferably, impact studies are (cluster) randomised studies for which primary endpoint(s)

363

are clinical care parameters such as length of hospitalization, number of unnecessary

364

operations, days off work, time to diagnosis, morbidity, or quality of life.

100, 102, 103

365

Unfortunately, few risk models reach this stage of investigation,

^{20, 100}

although predictive

366

performance is no guarantee for beneficial impact on patient outcomes.

^{104, 105}

367

368

Step X: Model implementation

369

To increase the uptake of a model, a user-friendly implementation can be provided.

¹⁰⁶

The

370

model can for example be implemented in a spreadsheet for use with office software, or

371

made accessible on websites (e.g. www.qrisk.org). With the rise of smartphones and tablets,

372

(19)

implementation is currently shifting towards mobile applications (apps). Nevertheless,

373

disseminating inadequately validated risk models has to be avoided.

374

375

Conclusion

376

The development and validation of risk models should be appropriately planned, conducted,

377

presented, and reported. Risk models should only be developed when there is a clinical

378

need, and external validation studies should be given the attention and prominence they

379

deserve. The development of a robust model and an informative validation is not

380

straightforward, with several pitfalls looming along the way. Perfection is impossible, but

381

adhering to current methodological standards is important to arrive at a good model that

382

has the potential to be useful in clinical practice.

383

Acknowledgments

384

Disclosure of interest

385

The authors report no conflict of interest.

386

Contribution to Authorship

387

All authors contributed to the conception of the study. LW and BVC performed the literature

388

search and drafted the manuscript. GSC contributed to the content of the drafts and

389

suggested additional literature. All authors read and approved the final manuscript.

390

Details of Ethical Approval

391

This study did not require ethical approval because it did not involve any human or animal

392

subjects, nor did it make use of hospital records.

393

(20)

Funding

394

This work was supported by KU Leuven (grant C24/15/037), Research Foundation Flanders

395

(FWO grant G049312N), Flanders’ Agency for Innovation by Science and Technology (IWT

396

Vlaanderen, grant IWT-TBM 070706-IOTA3) and the Medical Research Council (grant

397

number G1100513). L.W. is a PhD fellow of IWT Vlaanderen.

398

(21)

399

References

400

1. Kleinrouweler CE, Cheong-See FM, Collins GS, Kwee A, Thangaratinam S, Khan KS, et

401

al. Prognostic models in obstetrics: available, but far from applicable. Am J Obstet Gynecol.

402

2016; 214: 79-90.e36.

403

2. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al.

404

Transparent Reporting of a multivariable prediction model for Individual Prognosis Or

405

Diagnosis (TRIPOD): Explanation and ElaborationThe TRIPOD Statement: Explanation and

406

Elaboration. Ann Intern Med. 2015; 162: W1-W73.

407

3. Harrell FE. Regression modeling strategies : with applications to linear models,

408

logistic regression, and survival analysis. New York, NY: Springer, 2001.

409

4. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development,

410

Validation, and Updating. New York, NY: Springer US, 2009.

411

5. Schoorel EN, Vankan E, Scheepers HC, Augustijn BC, Dirksen CD, de Koning M, et al.

412

Involving women in personalised decision-making on mode of delivery after caesarean

413

section: the development and pilot testing of a patient decision aid. BJOG. 2014; 121: 202-9.

414

6. Mallett S, Royston P, Dutton S, Waters R and Altman DG. Reporting methods in

415

studies developing prognostic models in cancer: a review. BMC Med. 2010; 8: 20.

416

7. Bouwmeester W, Zuithoff NP, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW,

417

et al. Reporting and methods in clinical prediction research: a systematic review. PLoS Med.

418

2012; 9: 1-12.

419

(22)

8. Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External

420

validation of multivariable prediction models: a systematic review of methodological

421

conduct and reporting. BMC Med Res Methodol. 2014; 14: 40.

422

9. Van Calster B, Van Hoorde K, Valentin L, Testa AC, Fischerova D, Van Holsbeke C, et

423

al. Evaluating the risk of ovarian cancer before surgery using the ADNEX model to

424

differentiate between benign, borderline, early and advanced stage invasive, and secondary

425

metastatic tumours: prospective multicentre diagnostic study. BMJ. 2014; 349: g5920.

426

10. Schuit E, Kwee A, Westerhuis ME, Van Dessel HJ, Graziosi GC, Van Lith JM, et al. A

427

clinical prediction model to assess the risk of operative delivery. BJOG. 2012; 119: 915-23.

428

11. Schoorel EN, Melman S, van Kuijk SM, Grobman WA, Kwee A, Mol BW, et al.

429

Predicting successful intended vaginal delivery after previous caesarean section: external

430

validation of two predictive models in a Dutch nationwide registration-based cohort with a

431

high intended vaginal delivery rate. BJOG. 2014; 121: 840-7; discussion 7.

432

12. Hand DJ. Deconstructing Statistical Questions. J R Stat Soc Ser A. 1994; 157: 317-56.

433

13. Geersing GJ, Bouwmeester W, Zuithoff P, Spijker R, Leeflang M and Moons KG.

434

Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance

435

systematic reviews. PLoS One. 2012; 7: e32844.

436

14. Wyatt JC and Altman DG. Commentary: Prognostic models: clinically useful or quickly

437

forgotten? BMJ. 1995; 311: 1539-41.

438

15. Vickers AJ and Cronin AM. Everything you always wanted to know about evaluating

439

prediction models (but were too afraid to ask). Urology. 2010; 76: 1298-301.

440

(23)

16. Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, Ioannidis JP, et al. Biomedical

441

research: increasing value, reducing waste. Lancet. 2014; 383: 101-4.

442

17. Kaijser J, Sayasneh A, Van Hoorde K, Ghaem-Maghami S, Bourne T, Timmerman D, et

443

al. Presurgical diagnosis of adnexal tumours using mathematical models and scoring

444

systems: a systematic review and meta-analysis. Hum Reprod Update. 2014; 20: 449-62.

445

18. Peat G, Riley RD, Croft P, Morley KI, Kyzas PA, Moons KG, et al. Improving the

446

transparency of prognosis research: the role of reporting, data sharing, registration, and

447

protocols. PLoS Med. 2014; 11: e1001671.

448

19. Altman DG. The time has come to register diagnostic and prognostic research. Clin

449

Chem. 2014; 60: 580-2.

450

20. Steyerberg EW, Moons KG, van der Windt DA, Hayden JA, Perel P, Schroter S, et al.

451

Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013; 10:

452

e1001381.

453

21. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple

454

imputation for missing data in epidemiological and clinical research: potential and pitfalls.

455

BMJ. 2009; 338: b2393.

456

22. Ambler G, Omar RZ and Royston P. A comparison of imputation techniques for

457

handling missing predictor values in a risk model with a binary outcome. Stat Methods Med

458

Res. 2007; 16: 277-98.

459

23. Vergouwe Y, Royston P, Moons KG and Altman DG. Development and validation of a

460

prediction model with missing predictor data: a practical approach. J Clin Epidemiol. 2010;

461

63: 205-14.

462

(24)

24. Liew SM, Doust J and Glasziou P. Cardiovascular risk scores do not account for the

463

effect of treatment: a review. Heart. 2011; 97: 689-97.

464

25. Ganna A, Reilly M, de Faire U, Pedersen N, Magnusson P and Ingelsson E. Risk

465

Prediction Measures for Case-Cohort and Nested Case-Control Designs: An Application to

466

Cardiovascular Disease. Am J Epidemiol. 2012; 175: 715-24.

467

26. Collins GS, Reitsma JB, Altman DG and Moons KG. Transparent reporting of a

468

multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD

469

statement. BMJ. 2015; 350: g7594.

470

27. Khudyakov P, Gorfine M, Zucker D and Spiegelman D. The impact of covariate

471

measurement error on risk prediction. Stat Med. 2015; 34: 2353-67.

472

28. Carroll RJ, Delaigle A and Hall P. Nonparametric Prediction in Measurement Error

473

Models. Journal of the American Statistical Association. 2009; 104: 993-1014.

474

29. Stiell IG and Wells GA. Methodologic standards for the development of clinical

475

decision rules in emergency medicine. Ann Emerg Med. 1999; 33: 437-47.

476

30. Wynants L, Timmerman D, Bourne T, Van Huffel S and Van Calster B. Screening for

477

data clustering in multicenter studies: the residual intraclass correlation. BMC Med Res

478

Methodol. 2013; 13.

479

31. White IR, Royston P and Wood AM. Multiple imputation using chained equations:

480

Issues and guidance for practice. Stat Med. 2011; 30: 377-99.

481

32. Moons KG, Donders RA, Stijnen T and Harrell FE, Jr. Using the outcome for

482

imputation of missing predictor values was preferred. J Clin Epidemiol. 2006; 59: 1092-101.

483

(25)

33. Janssen KJ, Vergouwe Y, Donders AR, Harrell FE, Jr., Chen Q, Grobbee DE, et al.

484

Dealing with missing predictor values when applying clinical prediction models. Clin Chem.

485

2009; 55: 994-1001.

486

34. Little RJ and Rubin DB. Statistical analysis with missing data. Hoboken, NJ: J Wiley &

487

Sons. 2002.

488

35. Steyerberg EW, van der Ploeg T and Van Calster B. Risk prediction with machine

489

learning and regression methods. Biom J. 2014; 56: 601-6.

490

36. Babyak MA. What you see may not be what you get: a brief, nontechnical

491

introduction to overfitting in regression-type models. Psychosom Med. 2004; 66: 411-21.

492

37. Peduzzi P, Concato J, Kemper E, Holford TR and Feinstein AR. A simulation study of

493

the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996; 49:

494

1373-9.

495

38. Harrell FE, Lee KL and Mark DB. Multivariable prognostic models: issues in

496

developing models, evaluating assumptions and adequacy, and measuring and reducing

497

errors. Stat Med. 1996; 15: 361-87.

498

39. Steyerberg EW, Eijkemans MJ and Habbema JD. Stepwise selection in small data sets:

499

a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999; 52: 935-42.

500

40. Steyerberg EW, Eijkemans MJ, Harrell FE, Jr. and Habbema JD. Prognostic modeling

501

with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis

502

Making. 2001; 21: 45-56.

503

(26)

41. Wynants L, Bouwmeester W, Moons KG, Moerbeek M, Timmerman D, Van Huffel S,

504

et al. A simulation study of sample size demonstrated the importance of the number of

505

events per variable to develop prediction models in clustered data. J Clin Epidemiol. 2015.

506

42. Chatfield C. Model Uncertainty, Data Mining and Statistical Inference. J R Stat Soc Ser

507

A. 1995; 158: 419-66.

508

43. Royston P, Altman DG and Sauerbrei W. Dichotomizing continuous predictors in

509

multiple regression: a bad idea. Stat Med. 2006; 25: 127-41.

510

44. Hastie T, Tibshirani R and Friedman J. The elements of statistical learning. Springer,

511

2009.

512

45. Kruppa J, Liu Y, Biau G, Kohler M, Konig IR, Malley JD, et al. Probability estimation

513

with machine learning methods for dichotomous and multicategory outcome: theory.

514

Biometrical journal Biometrische Zeitschrift. 2014; 56: 534-63.

515

46. Van Hoorde K, Van Huffel S, Timmerman D, Bourne T and Van Calster B. A spline-

516

based tool to assess and visualize the calibration of multiclass risk predictions. Journal of

517

biomedical informatics. 2015; 54: 283-93.

518

47. Van Calster B, Condous G, Kirk E, Bourne T, Timmerman D and Van Huffel S. An

519

application of methods for the probabilistic three-class classification of pregnancies of

520

unknown location. Artif Intell Med. 2009; 46: 139-54.

521

48. Van Holsbeke C, Van Calster B, Bourne T, Ajossa S, Testa AC, Guerriero S, et al.

522

External Validation of Diagnostic Models to Estimate the Risk of Malignancy in Adnexal

523

Masses. Clin Cancer Res. 2012; 18: 815-25.

524

(27)

49. Austin PC, Tu JV, Ho JE, Levy D and Lee DS. Using methods from the data-mining and

525

machine-learning literature for disease classification and prediction: a case study examining

526

classification of heart failure subtypes. J Clin Epidemiol. 2013; 66: 398-407.

527

50. Tollenaar N and Van der Heijden P. Which method predicts recidivism best?: a

528

comparison of statistical, machine learning and data mining predictive models. J R Stat Soc

529

Ser A. 2013; 176: 565-84.

530

51. Nemes S, Jonasson JM, Genell A and Steineck G. Bias in odds ratios by logistic

531

regression modelling and sample size. BMC Med Res Methodol. 2009; 9: 56.

532

52. Royston P and Sauerbrei W. Multivariable model-building: a pragmatic approach to

533

regression anaylsis based on fractional polynomials for modelling continuous variables.

534

Chichester: John Wiley & Sons, 2008.

535

53. Naji O, Wynants L, Smith A, Abdallah Y, Stalder C, Sayasneh A, et al. Predicting

536

successful vaginal birth after Cesarean section using a model based on Cesarean scar

537

features examined by transvaginal sonography. Ultrasound Obstet Gynecol. 2013; 41: 672-8.

538

54. Greenland S. Tests for interaction in epidemiologic studies: A review and a study of

539

power. Stat Med. 1983; 2: 243-51.

540

55. Bottomley C, Van Belle V, Kirk E, Van Huffel S, Timmerman D and Bourne T. Accurate

541

prediction of pregnancy viability by means of a simple scoring system. Hum Reprod. 2013;

542

28: 68-76.

543

56. Ambler G, Seaman S and Omar RZ. An evaluation of penalised survival methods for

544

developing prognostic models with rare events. Stat Med. 2012; 31: 1150-61.

545

(28)

57. Austin PC and Steyerberg EW. Graphical assessment of internal and external

546

calibration of logistic regression models by using loess smoothers. Stat Med. 2014; 33: 517-

547

35.

548

58. Pepe MS and Janes HE. Gauging the performance of SNPs, biomarkers, and clinical

549

factors for predicting risk of breast cancer. J Natl Cancer Inst. 2008; 100: 978-9.

550

59. Vickers AJ and Cronin AM. Traditional statistical methods for evaluating prediction

551

models are uninformative as to clinical value: towards a decision analytic framework. Semin

552

Oncol. 2010; 37: 31-8.

553

60. Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ and Van Calster B.

554

Assessing the incremental value of diagnostic and prognostic markers: a review and

555

illustration. Eur J Clin Invest. 2012; 42: 216-28.

556

61. Vickers AJ and Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating

557

Prediction Models. Med Decis Making. 2006; 26: 565-74.

558

62. Vickers AJ, Van Calster B and Steyerberg EW. Net benefit approaches to the

559

evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016; 352.

560

63. Justice AC, Covinsky KE and Berlin JA. Assessing the generalizability of prognostic

561

information. Ann Intern Med. 1999; 130: 515-24.

562

64. Steyerberg EW, Harrell FE, Borsboom G, Eijkemans MJC, Vergouwe Y and Habbema

563

JDF. Internal validation of predictive models: Efficiency of some procedures for logistic

564

regression analysis. J Clin Epidemiol. 2001; 54: 774-81.

565

65. Morrow DA and Cook NR. Determining decision limits for new biomarkers: clinical

566

and statistical considerations. Clin Chem. 2011; 57: 1-3.

567

(29)

66. Steyerberg EW, Van Calster B and Pencina MJ. [Performance measures for prediction

568

models and markers: evaluation of predictions and classifications]. Rev Esp Cardiol. 2011;

569

64: 788-94.

570

67. Leeflang MM, Rutjes AW, Reitsma JB, Hooft L and Bossuyt PM. Variation of a test's

571

sensitivity and specificity with disease prevalence. CMAJ. 2013; 185: E537-44.

572

68. Cox DR. Two Further Applications of a Model for Binary Regression. Biometrika.

573

1958; 45: 562-5.

574

69. Hosmer DW and Hjort NL. Goodness-of-fit processes for logistic regression:

575

simulation results. Stat Med. 2002; 21: 2723-38.

576

70. Peek N, Arts DG, Bosman RJ, van der Voort PH and de Keizer NF. External validation

577

of prognostic models for critically ill patients required substantial sample sizes. J Clin

578

Epidemiol. 2007; 60: 491-501.

579

71. Vickers AJ. Prediction models: revolutionary in principle, but do they do more good

580

than harm? J Clin Oncol. 2011; 29: 2951-2.

581

72. Mallett S, Halligan S, Thompson M, Collins GS and Altman DG. Interpreting diagnostic

582

accuracy studies for patient care. BMJ. 2012; 345.

583

73. Balachandran VP, Gonen M, Smith JJ and DeMatteo RP. Nomograms in oncology:

584

more than meets the eye. Lancet Oncol. 2015; 16: e173-e80.

585

74. Baker SG. Putting Risk Prediction in Perspective: Relative Utility Curves. J Natl Cancer

586

Inst. 2009; 101: 1538-42.

587

(30)

75. Localio AR and Goodman S. Beyond the usual prediction accuracy metrics: reporting

588

results for clinical decision making. Ann Intern Med. 2012; 157: 294-5.

589

76. Van Calster B and Vickers AJ. Calibration of risk prediction models: impact on

590

decision-analytic performance. Med Decis Making. 2015; 35: 162-9.

591

77. Collins GS and Moons KG. Comparing risk prediction models. BMJ. 2012; 344: e3186.

592

78. Pepe MS, Kerr KF, Longton G and Wang Z. Testing for improvement in prediction

593

model performance. Stat Med. 2013; 32: 1467-82.

594

79. Cook NR and Paynter NP. Performance of reclassification statistics in comparing risk

595

prediction models. Biometrical journal Biometrische Zeitschrift. 2011; 53: 237-58.

596

80. Pencina MJ, D'Agostino RB, Sr., D'Agostino RB, Jr. and Vasan RS. Evaluating the added

597

predictive ability of a new marker: from area under the ROC curve to reclassification and

598

beyond. Stat Med. 2008; 27: 157-72; discussion 207-12.

599

81. Leening MJ, Vedder MM, Witteman JC, Pencina MJ and Steyerberg EW. Net

600

reclassification improvement: computation, interpretation, and controversies: a literature

601

review and clinician's guide. Ann Intern Med. 2014; 160: 122-31.

602

82. Pepe MS, Fan J, Feng Z, Gerds T and Hilden J. The Net Reclassification Index (NRI): A

603

Misleading Measure of Prediction Improvement Even with Independent Test Data Sets.

604

Statistics in Biosciences. 2015; 7: 282-95.

605

83. Leening MJ, Steyerberg EW, Van Calster B, D'Agostino RB, Sr. and Pencina MJ. Net

606

reclassification improvement and integrated discrimination improvement require calibrated

607

models: relevance from a marker and model perspective. Stat Med. 2014; 33: 3415-8.

608

(31)

84. Hilden J and Gerds TA. A note on the evaluation of novel biomarkers: do not rely on

609

integrated discrimination improvement and net reclassification index. Stat Med. 2014; 33:

610

3405-14.

611

85. Collins GS. How can I validate a nomogram? Show me the model. Ann Oncol. 2015;

612

26: 1034-5.

613

86. Van Belle V and Van Calster B. Visualizing Risk Prediction Models. PLoS One. 2015;

614

10: e0132614.

615

87. Collins GS, Mallett S, Omar O and Yu LM. Developing risk prediction models for type

616

2 diabetes: a systematic review of methodology and reporting. BMC Med. 2011; 9: 103.

617

88. Janssens AC, Ioannidis JP, van Duijn CM, Little J and Khoury MJ. Strengthening the

618

reporting of genetic risk prediction studies: the GRIPS statement. Genome Med. 2011; 3: 16.

619

89. Konig IR, Malley JD, Weimar C, Diener HC and Ziegler A. Practical experiences on the

620

necessity of external validation. Stat Med. 2007; 26: 5499-511.

621

90. Siontis GC, Tzoulaki I, Castaldi PJ and Ioannidis JP. External validation of new risk

622

prediction models is infrequent and reveals worse prognostic discrimination. J Clin

623

Epidemiol. 2015; 68: 25-34.

624

91. Vergouwe Y, Steyerberg EW, Eijkemans MJ and Habbema JD. Substantial effective

625

sample sizes were required for external validation studies of predictive logistic regression

626

models. J Clin Epidemiol. 2005; 58: 475-83.

627

92. Collins GS, Ogundimu EO and Altman DG. Sample size considerations for the external

628

validation of a multivariable prognostic model: a resampling study. Stat Med. 2016; 35: 214-

629

26.

630

(32)

93. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ and Steyerberg EW. A

631

calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin

632

Epidemiol. 2016.

633

94. Debray TP, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW and Moons KG. A

634

new framework to enhance the interpretation of external validation studies of clinical

635

prediction models. J Clin Epidemiol. 2015; 68: 279-89.

636

95. Vergouwe Y, Moons KGM and Steyerberg EW. External Validity of Risk Models: Use

637

of Benchmark Values to Disentangle a Case-Mix Effect From Incorrect Coefficients. Am J

638

Epidemiol. 2010; 172: 971-80.

639

96. Strobl AN, Vickers AJ, Van Calster B, Steyerberg E, Leach RJ, Thompson IM, et al.

640

Improving patient prostate cancer risk assessment: Moving from static, globally-applied to

641

dynamic, practice-specific cancer risk calculators. Journal of biomedical informatics. 2015.

642

97. Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ and Habbema JD.

643

Validation and updating of predictive logistic regression models: a study on sample size and

644

shrinkage. Stat Med. 2004; 23: 2567-86.

645

98. Janssen KJM, Moons KGM, Kalkman CJ, Grobbee DE and Vergouwe Y. Updating

646

methods improved the performance of a clinical prediction model in new patients. J Clin

647

Epidemiol. 2008; 61: 76-86.

648

99. Ankerst DP, Koniarski T, Liang Y, Leach RJ, Feng Z, Sanda MG, et al. Updating risk

649

prediction tools: a case study in prostate cancer. Biometrical journal Biometrische

650

Zeitschrift. 2012; 54: 127-42.

651

(33)

100. Reilly BM and Evans AT. Translating clinical research into clinical practice: impact of

652

using prediction rules to make decisions. Ann Intern Med. 2006; 144: 201-9.

653

101. Moons KGM, Altman DG, Vergouwe Y and Royston P. Prognosis and prognostic

654

research: application and impact of prognostic models in clinical practice. Br Med J. 2009;

655

338.

656

102. Stiell IG, Clement CM, Grimshaw J, Brison RJ, Rowe BH, Schull MJ, et al.

657

Implementation of the Canadian C-Spine Rule: prospective 12 centre cluster randomised

658

trial. BMJ. 2009; 339: b4146.

659

103. Ferrante di Ruffano L, Hyde CJ, McCaffery KJ, Bossuyt PMM and Deeks JJ. Assessing

660

the value of diagnostic tests: a framework for designing and evaluating trials. 2012.

661

104. Van den Bruel A, Aertgeerts B, Bruyninckx R, Aerts M, Buntinx F and Hall H. Signs and

662

symptoms for diagnosis of serious infections in children: a prospective study in primary care.

663

Br J Gen Pract. 2007; 57: 538-46.

664

105. Siontis KC, Siontis GC, Contopoulos-Ioannidis DG and Ioannidis JP. Diagnostic tests

665

often fail to lead to changes in patient outcomes. J Clin Epidemiol. 2014; 67: 612-21.

666

106. Kawamoto K, Houlihan CA, Balas EA and Lobach DF. Improving clinical practice using

667

clinical decision support systems: a systematic review of trials to identify features critical to

668

success. BMJ. 2005; 330: 765.

669

670

671

672

(34)

Tables/Figure Captions

673

Table 1. Overview of potential pitfalls when developing or validating risk models.

674

Figure 1. ROC curve for the ADNEX model to distinguish between malignant and benign

675

lesions. The black line represents the ROC curve for the ADNEX model, the grey diagonal line

676

presents the ROC curve for a model without discriminative ability, the dots represent the

677

specificity and sensitivity for risk thresholds 0.03, 0.05, 0.10 and 0.15 (risk threshold

678

(specificity, sensitivity)).

679

Figure 2. Calibration plot of the ADNEX model at external validation. Thick line: Loess

680

smoother indicating the relation between the predicted probability of malignancy by the

681

ADNEX model and observed probability. Grey band: 95% confidence interval for the loess

682

smoother. The thin diagonal line indicates perfect calibration. The histogram on the x-axis

683

shows the distribution of predicted probabilities of malignancy, with the frequencies of

684

predicted probabilities for events plotted above and for non-events below the x-axis.

685

Figure 3. Net benefit of the ADNEX model at external validation. Dashed line: Net benefit

686

(NB) of the ADNEX model to distinguish between benign and malignant lesions. Grey line:

687

NB of classifying all tumours as malignant. Black line at zero: NB of classifying all tumours as

688

benign (none as malignant). Dot: NB of ADNEX at threshold probability 10%. NB is

689

computed at various risk thresholds. If the probability of malignant disease predicted by

690

ADNEX is higher than the risk threshold, the tumour is classified as malignant. Higher NB

691

values indicate more clinical utility. E.g. at threshold 10% and compared to classifying no

692

tumours as malignant, the use of ADNEX leads to the equivalent of a net 37 (NB=0.37)

693

correctly identified malignancies per 100 patients, without an increase in the number of

694

benign lesions classified as malignancies. Moreover, the NB of ADNEX is 0.033 greater than

695

(35)

assuming all tumours are malignant. This is equivalent to a net 29 (=0.33*100/(10/90))

696

fewer benign lesions classified as malignancies per 100 patients, compared to classifying all

697

as malignant.

698

699