Machine learning to predict bad functional outcome after endovascular treatment in patients with acute ischemic stroke

(1)

Machine Learning to Predict Bad

Functional Outcome After Endovascular

Treatment in Patients With Acute Ischemic

Stroke

Bouke Postma

Student number: 10358072

Date of final version: August 14, 2018 Master’s programme: Econometrics

Specialisation: Big Data Business Analytics Supervisor: mw. dr. E. Aristodemou

Second reader: dhr. dr. N. P. A. van Giersbergen

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

Statement of Originality

This document is written by Student Bouke Mirko Louis Postma who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business

(3)

Abstract

Stroke is one of the major causes of death in the Western world. A new endovascular treatment has shown significant improvements and resulted in a vast amount of data. In The Netherlands, the results are registered in the MR CLEAN registry. Using this dataset to accurately predict bad outcome after endovascular treatment in patients with ischemic stroke can help physicians in improving health-care, both by lowering mortality and by improving the emotional well-being of patients and their relatives. Also, assessing methods to handle missing data with regards to predictive modeling is relevant for medical and economic considerations. Machine learning methods such as support vector machine and random forest may substantially improve the predictive performance, compared with the current standard in medical research: logistic regression. Therefore, this thesis compares the predictive performance of these three classification methods. Also, several single imputation methods are compared with regards to their effect on the predictive performance of the different classifiers.

The classifiers and methods to handle missing data are optimized using (repeated) five-fold cross-validation. Then, the performances are compared. The results of this procedure are in turn checked for robustness using the performance on a hold-out test sample and by implementing different dichotomizations of the outcome.

This way, it is established that support vector machine significantly performs best. It is also shown that all classifiers perform best using pairwise deletion of variables with low pre-dictive power, imputation of missing values, and subsequently listwise deletion of incomplete observations. The specific imputation technique that is used is not of significant importance; all imputation methods provide a similar significant improvement when employed in combi-nation with pair- and listwise deletion.

(4)

1 Introduction 1

2 Literature Review 4

2.1 Clinical background . . . 4

2.2 Machine learning in medicine . . . 6

3 Models 8 3.1 Preliminary definitions . . . 8 3.2 Preprocessing . . . 8 3.2.1 Normalization . . . 8 3.2.2 Missing data . . . 9 3.3 Classifiers . . . 14 3.3.1 Logistic regression . . . 14 3.3.2 Random forest . . . 14

3.3.3 Support Vector Machine . . . 16

3.4 Model selection . . . 18

3.4.1 Area under the curve . . . 18

3.4.2 Validation method . . . 18

3.4.3 Hyper-parameter optimization for classifiers . . . 19

3.4.4 Optimization of classifiers and methods to handle missing data . . . 20

4 Data 22 4.1 MR CLEAN registry . . . 22

4.2 Descriptive analysis . . . 22

4.3 Data cleaning . . . 26

5 Results 27 5.1 Optimization of models without imputation . . . 27

5.2 Optimization of models with imputation . . . 30

5.3 Comparison of optimized performances . . . 32

5.4 Robustness tests . . . 33

5.4.1 Test-set performance . . . 33

(5)

5.4.2 Different threshold . . . 36

6 Conclusion 39

A Lexicon 41

A.1 Abbreviations . . . 41 A.2 List of relevant variables . . . 41

B Data 43 B.1 Numerical variables . . . 43 B.2 Categorical variables . . . 46 B.3 Missing values . . . 48 C Optimization 50 C.1 Hyper-parameter settings . . . 50 C.2 Repeated cross-validation results . . . 50

D Robustness Checks 54

D.1 Dichotomization 0-1 vs 2-6 . . . 54

(6)

Background

Every year, 1-2 out of 1000 (Koopman, Van Dis, Vaartjes, Visseren, & Bots, 2014; Saka, Mcguire, & Wolfe, 2009; Truelsen et al., 2006) suffer from an acute ischemic stroke, which occurs when an artery to the brain is blocked, resulting in a loss of blood circulation in a corresponding brain area. Half of these persons die or remain permanently disabled, making stroke one of the major causes of death (Benjamin et al., 2017; Wang et al., 2016) and the primary cause for disability in Europe and the United States (Adamson, Beswick, & Ebrahim, 2004).

Recent years there have been major developments regarding treatments for ischemic stroke. A new endovascular treatment (EVT) where the blood clot is removed by a catheter that is inserted through the groin shows significant improvement in patient outcome (Goyal et al., 2015; Jovin et al., 2015; Saver et al., 2015). Therefore, this treatment has become the new standard in the Western world (Nederlandse Vereniging voor Neurologie, 2017; Powers et al., 2018a). To gather more evidence on the (positive) effects of the treatment, the new patients have been monitored closely.

Therefore, the treatment’s results have been collected in several registries worldwide, rep-resented by the multi-center MR CLEAN registry in The Netherlands. The main outcome measure of interest of this registry is the functional outcome 90 days after receiving the treat-ment (O. Berkhemer, 2017). This is measured by the modified Rankin Scale (mRS) of disability, which ranges from 0 (no symptoms of disability) to 6 (dead).

Predicting functional outcome after endovascular treatment

Using the data of this registry, the positive effect of the treatment is proven. Also, relevant causal factors for the outcome after treatment are identified in multiple associative studies. However, so far there has been little research on predictive modeling of the functional outcome, although this is relevant for multiple reasons.

Firstly, it is relevant because a sound prediction could help physicians to assess the patients’ added risks and benefits of the EVT compared to other treatments, such as intravenous throm-bolysis. This could also help doctors make decisions to optimize resources in case of scarcity, by differentiating between more and less promising patients.

Secondly, a well-functioning predictive model can give doctors at remote locations, where EVT is not possible, insights in the possible benefits of transferring a patient to a better equipped

(7)

hospital. The time between stroke onset and treatment is one of the most important indicators of the functional outcome. The estimated travel time to a different hospital can be taken into account in a prediction.

Lastly, a prediction may help patients and their families by getting more complete informa-tion on the patients’ prospects. Effective communicainforma-tion is one of the core elements of modern patient-centered care (Kitson, Marshall, Bassett, & Zeitz, 2013). In addition, there exists a strong positive relation between patient satisfaction and outcome (Woolley, Kane, Hughes, & Wright, 1978).

Therefore, this thesis will investigate what classification technique performs best on pre-dicting a bad functional outcome after EVT. The techniques that are discussed are known to perform well on (binary) classification problems; multiple support vector machines with differ-ent kernels and random forests are compared with logistic regression, the currdiffer-ent standard in medical literature.

By using more advanced classification techniques and comparing them with logistic regres-sion, my thesis is an extension to previous research. A thorough literature search results in a very limited number of studies dealing with the subject of more advanced predictive modeling of outcome after stroke. In addition, the latter use datasets that are substantially smaller than the one available for this thesis. Nevertheless, the previous research does form a sound basis for this thesis; it has identified some important predictors during investigation of the predictive performance of the logistic regression (Venema et al., 2017; Weimar, Ziegler, K¨onig, & Diener, 2002).

Missing data

Also, studying the data that is available for this thesis gives relevant insights in the way how missing values in data are optimally treated in the context of predictive modeling. Although missing values are a common problem in medical data, not much research is done on strategies to handle them in order to make better predictions. Most (statistical) literature is focused on handling missing data to perform unbiased statistical inference. However, when the main goal is to accurately classify a new occurrence, different approaches may be desired (Lo, Chernoff, Zheng, & Lo, 2015).

The importance of investigating methods to handle missing data is also supported by eco-nomic considerations. An example is cost-effectiveness analysis. When one wants to predict whether a certain intervention will provide good value for money for a patient, using an inap-propriate method to handle missing data may lead to misleading results (Faria, Gomes, Epstein, & White, 2014). This could ultimately negatively affect the quality of the decision whether or not to treat a patient.

Besides, not only health-care related data suffers from missing values. In general, socio-economic datasets often contain missing values (Schafer & Graham, 2002). Depending on the specific data, there may be several causes for the fact that values are missing. It can happen

(8)

accidentally or due to an underlying reason. If then, for example, a company wants to increase its sales by accurately predicting consumer behavior based on questionnaires, it is important to investigate methods to handle missing data.

The methods to handle missing data involve the creation of additional dummy variables for categorical variables, pairwise and listwise deletion, as well as imputation for numerical variables. Different methods for imputation are compared in this thesis: mean-mode, median-mode, random sampling and the k-nearest neighbors algorithm.

Research questions

Summarizing, it is valuable for the quality of health-care to improve predictive models on functional outcome after treatment for ischemic stroke. It is also important to improve methods to handle missing data in conjunction with predictive modeling. This is motivated by both clinical and economical reasons.

Therefore, this thesis will use the extensive multi-center dataset of the MR CLEAN registry to compare different classifiers based on their out-of-sample predictive performance. Also, the predictive performances following different methods to handle missing data are compared.

Accordingly, the research questions are defined as follows:

• Do random forests and/or support vector machines perform better in predicting bad functional outcome after EVT than the current standard in medical research: logistic regression?

And if so: Which classifier performs best?

• Which method to handle missing data leads to the best performance when predicting bad functional outcome after EVT?

In order to answer theses questions, Chapter 2 starts with a literature review, assessing the relevant clinical background and related previous research. Then, in Chapter 3, all relevant models are discussed. First, the models for handling missing data are explained. Then, the used classifiers are elaborated upon. Also, the set-up of the normalization and finally, the validation of the results, are discussed. Subsequently, Chapter 4 gives an overview of the MR CLEAN data that is used in this thesis, after which the results are presented in Chapter 5.

(9)

2.1 Clinical background

Stroke

There are two types of strokes: ischemic and hemorrhagic. The focus of this paper will be on the ischemic stroke, which occurs when an artery to the brain is blocked, resulting in an interrupted supply of oxygen and nutrients to brain cells. Deprived of oxygen and nutrients, brain cells can get damaged severely within minutes. Ischemic stroke is the type of stroke that occurs most frequently and is one of the main causes of mortality in the Western world (Benjamin et al., 2017).

An ischemic stroke can occur due to multiple reasons, including the spontaneous formation of blood clots in the blocked artery (thrombosis) or a dislodging clot getting stuck in the artery (embolism). These cases are defined as an acute ischemic stroke. Other examples of (non-acute) causes for ischemic stroke are drug abuse, traumatic injury to the blood vessels of the neck, or disorders causing blood clotting.

(Acute) ischemic stroke can in turn be subdivided in large- and small-vessel occlusion. Large-vessel occlusion is the term used when blood supply through one of the brain’s larger arteries is blocked, while small-vessel occlusion involves the smaller, yet deeper arteries in the brain.

Treatments of ischemic stroke

Until recently, the most common treatment for a stroke was intravenous thrombolysis (IVT). During this treatment, the blood clots blocking the intracerebral artery are dissolved using intravenously administered drugs.

The downside of IVT is, however, that it is only effective when applied within a relatively small time frame between stroke and treatment. Also, it shows limited outcome for proximal arterial occlusions (Emberson et al., 2014; Wardlaw, Murray, Berge, & del Zoppo, 2014). In addition, IVT has many contraindications such as recent surgery, coagulation abnormalities and a history of intracranial hemorrhage (Powers et al., 2018a).

Therefore, a younger, additional treatment has become widely adapted in the United States (Powers et al., 2018b), The Netherlands (Nederlandse Vereniging voor Neurologie, 2017), as well as in the rest of Continental Europe (Wahlgren et al., 2016). During this new, endovascular treatment, physicians try to retract or dissolve the blood clots using a catheter that enters in

(10)

the groin. This method shows to be more effective for a wider range of patients and within a wider time frame between stroke and treatment, in comparison with IVT (O. A. Berkhemer et al., 2015; Goyal et al., 2016; Jovin et al., 2015).

Functional outcome after ischemic stroke

Currently, the most prevalent measure of functional outcome in stroke research is the modified Rankin Scale (mRS) (Quinn, Dawson, Walters, & Lees, 2009; Van Swieten, Koudstaal, Visser, Schouten, & Van Gijn, 1988). The mRS is an ordinal scale that ranges between 0 (completely healthy) and 6 (dead), with steps as seen in Table 2.1.

Score Description 0 No symptoms

1 No significant disability; able to carry out all usual duties and activities

2 Slight disability; unable to carry out all previous activities, but able to look after own affairs without assistance

3 Moderate disability; requiring some help, but able to walk wihout assistance

4 Moderately severe disability; unable to walk without assistance and unable to attend to own bodily needs without assistance

5 Severe disability; bedridden, incontinent and requiring constant nursing care and attention

6 Dead

Table 2.1: Modified Rankin Scale (mRS)

An overview of all predictors that are used in this thesis are found in Table A.2. The relevance of these variables is established in agreement with medical experts and several studies that are described below.

Predictors of functional outcome

In order to improve the selection of patients with acute ischemic stroke for EVT, Venema et al. (2017) search for characteristics that are significantly related with the mRS at 90 days after EVT. They analyze combined data from the IMS III trial in the US, Canada, Australia and Europe and the predecessor of the MR CLEAN registry: the MR CLEAN trial.

Using a multivariate logistic regression, the authors find that the following predictors are of significant importance (ordered from high to lower importance): age, baseline NIHSS score, systolic blood pressure, diabetes mellitus, pre-stroke mRS score, ASPECTS, location of occlu-sion, and collateral score. Also, the effects of EVT show to be weakly better for patients who did not suffer from an ischemic stroke before, patients with better collateral scores and patients with shorter times to groin puncture (Saver et al., 2016). The occurrence of atrial fibrillation does not appear to be of significant importance.

With regards to their predictive performance of the dichotomized mRS score (0-2 versus 3-6), Venema et al. (2017) realize an area under the curve (AUC) of 0.73. However, they may

(11)

suffer from overfitting problems due to the relatively small size of the dataset, not using a separated test sample, and the inclusion of (many) interaction terms. The authors also mention that they suffer from calibration problems due to the combination of two different trials, as e.g. the trials have different inclusion criteria. These issues are resolved in this thesis.

Intracranial hemorrhage

One of the main side-effects after treatment for an ischemic stroke, is a hemorrhagic stroke. This condition causes a significantly higher mortality after treatment (Hao et al., 2017). Func-tional outcomes and rates of angiographic revascularization are significantly better for EVT in comparison with IVT. Also, multiple meta-analyses have shown that the mortality or risk of intracerebral hemorrhage within 90 days after treatment do not differ significantly (Badhiwala et al., 2015; Goyal et al., 2016).

Previous research indicate that factors such as the direct results of the treatment, eligibility to applied drugs, age, gender, NIHSS score, and glucose values correlate with the risk of a future hemorrhage (Kaesmacher et al., 2017; Ntaios, Papavasileiou, Michel, Tatlisumak, & Strbian, 2015). These variables can be of importance for predicting functional outcome after treatment as well, as the risk for a future hemorrhage may be linked to the future functional outcome.

2.2 Machine learning in medicine

Obermeyer and Emanuel (2016) clearly explain the opportunities and risks of using machine learning in medicine. One arguments in favor is that machine learning excels in handling a relatively large number of independent variables, as well as handling an enormous number of observations. Also, machine learning algorithms are known to combine predictors in nonlinear and highly interactive ways (Mullainathan & Spiess, 2017).

Besides, Obermeyer and Emanuel (2016) claim that machine learning will vastly outperform the prognostic tools that have been developed until now, to assist health professionals in treating patients. This is supported by the fact that current score-based prognostic models have limited complexity and number of variables, as they were invented when gathering of information was a more labor-intensive process compared to the automated recording of patient data in e.g. electronic health records nowadays.

On the other hand, authors stress the need to be cautious for the risk of overfitting, which can lead to overly optimistic estimates of the model’s performance. This can be achieved by testing the performance on a strictly differentiated dataset in addition to using cross-validation procedures.

Also, the authors point out that machine learning, although it often delivers good predictive performance, is not necessarily an appropriate tool for causal inference. This argument is further strengthened by the ’black box’ nature of some algorithms, which makes interpretation in some cases extremely difficult.

(12)

Machine learning to predict outcome

Some research has already been done on predicting functional outcome 90 days after EVT, using more advanced machine learning algorithms than logistic regression. A good example is the work of Asadi, Dowling, Yan, and Mitchell (2014), who compare support vector machine and two-layer feed-forward neural network using a dataset of 107 patients. The outcome measure, mRS after 90 days, was dichotomized as relatively good (mRS< 2) versus bad (mRS≥ 2). The authors show that the support vector machine performs best, and is the most robust to overfitting, with a test set accuracy of approximately 70 percent.

Another example of a successful implementation of machine learning is given by Bentley et al. (2014). The authors predict the outcome of IVT using CT brain scans, as well as other patient characteristics. This implementation of the support vector machine leads to an AUC of 0.74.

In a more general context, Chu and Hwang (2016) predict clinical deterioration at the wards of multiple hospitals. In this setting, the authors compare the predictive performance of various machine learning methods. The best-performing method is random forest, followed by gradient boosting machine, bagged trees, support vector machine and neural network. All of these methods outperform logistic regression with regards to out-of-sample AUC.

For further research, Asadi et al. (2014) also suggest the use of decision-tree based algo-rithms. Chu and Hwang (2016) assign the success of the decision-tree based algorithms to their flexibility and ability to capture complex interactions in the data.

All in all, machine learning algorithms often outperform logistic regression with regards to predictive performance. Especially random forest and support vector machine are often named as the best-performing algorithms. Therefore, these methods are used in this thesis to predict functional outcome after EVT. The intuition, a more technical description, and some advantages and disadvantages of all methods are given in the following chapter.

(13)

3.1 Preliminary definitions

This chapter starts with discussing different preprocessing steps, including normalization and methods to handle missing data. Furthermore, the classifiers that are applied in this thesis are discussed. These include regular logistic regression, random forest, and support vector machine (SVM). Then, the methodology used for measuring the predictive performance and validating the respective results is discussed. Lastly, the grid search to find the optimal hyper-parameter settings for the respective classifiers is discussed. All of the calculations that are part of this thesis are executed using R (R Core Team, 2018). Before advancing to the next section, the common notation for all models is quickly introduced below.

Let us first define the scalar yi and the (1 × M ) vector x0i as the dependent and independent

variables, where i = 1, ..., N with N the total number of observations and M the number of different independent variables. Then y is a (N × 1) vector and X a matrix with dimension (N × M ).

Predicting bad functional outcome after an ischemic stroke is made a dichotomous classifi-cation problem, where for all i ∈ {1, 2, ..., N }:

yi =    1 with probability P (mrs ≥ 5) = pi, 0 with probability P (mrs < 5) = 1 − pi.

This is Bernoulli distributed with the following log-likelihood function L_N(p) = N X i=1 {y_ilog pi+ (1 − yi) log(1 − pi)} (3.1)

3.2 Preprocessing

3.2.1 Normalization

The first preprocessing step that is performed is normalization. The numerical variables in the dataset all have different orders of magnitude and variances. In such cases, it has been shown that normalizing variables before fitting a model causes some of the models to perform better (Ben-hur & Weston, 2010; Liu, 2005). Also, the computational costs decrease if the input variables are normalized before the classifiers are trained.

(14)

Therefore, all variables are normalized using either min-max normalization or (z-)standardization. Min-max normalization is used for variables that are bounded by design, such as ASP ECT S BL. This method uniformly transforms variables to a pre-specified range. Bounded variables are converted to the range [0, 1] with the following formula:

xnorm =

x − min(x)

max(x) − min(x). (3.2) Normalization with z-standardization is used for unbounded variables by performing the familiar transformation:

xnorm=

x − x ˆ

σ , (3.3)

where x is the sample mean and ˆσ the sample standard deviation of the variable x.

It is important to note that the unbounded variables in the test set are standardized using the sample mean and standard deviation derived from the variables in the train set. This is to prevent information leakage from the test set to the train set, as this could result in too optimistic performance measures due to overfitting. The same principle is applied during k-fold cross-validation, which is explained in section 3.4.2.

3.2.2 Missing data

The second preprocessing step that is necessary, involves dealing with missing data. It becomes clear in section 4.2 that the dataset contains many missing values; both in the target variable mrs and its predictors. Approximately two thirds of observations contain one or more missing values. Multiple approaches for handling missing data are available and discussed in this section, but first it is helpful to understand the three types of missingness that are commonly considered. Let us therefore define vector Z as the incomplete data, divided in an observed, Zobs, and

a missing part, Zmis. Now, the missing data can be split into three distinct categories (Rubin,

1976):

• Missing completely at random (MCAR): if the pattern of missing values is generated independently of both the observed and unobserved variables:

P (Zmis|Zobs, Zmis) = P (Zmis) (3.4)

• Missing at random (MAR): if the pattern of missing values is actually not at generated randomly, but can be fully explained by other variables with complete information:

P (Zmis|Zobs, Zmis) = P (Zmis|Zobs) (3.5)

• Missing not at random (MNAR): if the missing values cannot (fully) be explained by the distribution of observed values, but depend on the distribution of the incomplete variable: P (Zmis|Zobs, Zmis) 6= P (Zmis|Zobs) (3.6)

(15)

It is important to distinguish between these different types of missingness, as different meth-ods of handling missing data rely on assumptions about the underlying reason for missingness. At all times, MCAR is preferred to MAR, which in is turn is preferred to MNAR.

When performing causal inference, missing values that are MAR or MNAR often lead to biased results and underestimation of standard errors (Donders, van der Heijden, Stijnen, & Moons, 2006). Only when data is MAR, it can be imputed without bias by using multiple imputation (Sterne et al., 2009).

The main idea of multiple imputation is to generate multiple imputed datasets using a Bayesian approach. Missing values are sampled from variables their predictive distribution based on the observed data. Then, the coefficients are estimated for each imputed dataset. Subsequently, the coefficients are averaged and the standard errors can be estimated (Sterne et al., 2009).

Although this method works well when in conjunction with logistic regression, it is not applicable with SVM or random forest, as the latter methods do not estimate coefficients that can be combined. Therefore, only single imputation techniques are applied in this thesis.

This may result in biased coefficients for the logistic regression (Sterne et al., 2009). There-fore, we should be cautious with inferring conclusions from the estimated coefficients.

When performing predictive analysis, however, single imputation does not necessarily nega-tively affect the performance of the logistic regression. The effect of different methods to handle missing data on the predictive performance can be estimated and validated by cross-validation and a separate train-test split as explained later in this chapter.

Still, the generalizability of a predictive model may suffer if missing data are MAR or MNAR. When performing any method to handle missing data for predictive modeling, we rely on the assumption that the data-generating process for the observed and/or unobserved values do not change. Although we can validate this assumption to some extent by cross-validation and a separate train-test split, data that are not MCAR may have a negative impact.

The best method to ultimately validate the generalizability, is obviously to test the per-formance of the optimized models in this thesis when predicting functional outcome on an unrelated dataset from e.g. a different country. If the predictive performance of a model that is trained on the MR CLEAN dataset is comparable to the predictive performance of the same trained model for an unrelated dataset, the generalizability is further verified.

Adding dummy-variables for missing values

The first method to handle missing data involves adding additional dummy-variables for cate-gorical variables that contain missing values. This method is fully efficient, as it uses all available information. Even when the variable is missing due to an unobserved reason (MNAR), e.g. a variable is consistently missing at one specific hospital, the dummy will function as an indicator for this underlying reason. If then the patient population of that specific hospital differs from the general population, the dummy can (partly) take on the effects of the unobserved reason.

(16)

If the true data-generating process for missingness changes, however, this negatively affects the predictive performance of the model on new data. Therefore, we must rely on the assump-tions that the data-generating processes of the observed and unobserved values remain constant, or the assumption that the variables are MCAR.

Listwise deletion

Furthermore, listwise deletion can be performed. That is, all observations that contain at least one missing variable are deleted, resulting in only complete observations. This is the most common approach when data is assumed to be MCAR and enough complete observations are available. However, especially when only few values are missing per observation (for many observations), this approach leads to a big loss in information and thus in efficiency. It may also introduce a bias in estimated regression coefficients in case of MAR of MNAR.

Pairwise deletion

The second method to handle missing values, pairwise deletion, is performed by leaving variables with missing values out of the set of predictors. Similar to listwise deletion, this method is inefficient if the variables being left out contain useful information.

However, if certain variables are not important for the predictive performance, the best decision is to leave them out. This will also reduce the risk on overfitting. Summarizing, to optimally perform pairwise deletion, we need to find the right balance in the trade-off between information and number of missing values.

An obvious method to find variables that are useful for predictions, is to examine the co-efficients of a logistic regression. However, as explained above, the coco-efficients of the logistic regression could be biased if the missing values are not MCAR. In addition, Lo et al. (2015) show that significant variables are not necessarily good predictors. Besides, logistic regres-sion does not take into account the information gain that follows from the additional complete observations after pairwise deletion.

Therefore, other methods are used to find highly predictive rather than highly significant variables. First, we introduce a greedy algorithm to find the optimal candidates for pairwise deletion. Algorithm 1 iteratively searches for and removes the variable that causes the biggest increase in complete observations if it is dropped.

(17)

Algorithm 1 Pseudo-code of the greedy pairwise deletion algorithm for data df 1: function define var to drop(df )

2: Baseline ← # observations in df 3: for var in {variables in df }\target)} do 4: Remove variable var from df

5: Drop incomplete rows from df

6: Improvement ← # observations in df - Baseline 7: Save Improvement for var

8: end for

9: return var with largest Improvement 10: end function

Although this algorithm is fast and simple, it does not take into account the predictive value of variables. It relies on the very strong assumption that all variables are similarly important for the predictive performance. In this case, the trade-off between dropping information from variables versus dropping information from observations can be clearly evaluated by cumula-tively removing the resulting variables. If the assumption holds, there is a globally optimal number of variables that must be dropped in the order defined by the greedy algorithm.

Another approach is to remove all possible combinations of variables and evaluating the respective effects on the predictive performance using cross-validation. This approach is com-putationally very expensive. However, it does not rely on any assumptions, except that the optimal hyper-parameter settings of each classification method are roughly captured by its respective optimization grid.

Imputation

The final solution to handle missing data is to impute the missing values. There are several techniques that can be used, such as simply imputing the mean, median or mode value of a variable, imputing a randomly drawn value from the observed values, and more advanced clus-tering methods such as k-Nearest Neighbors (kNN). All of these methods have their advantages and disadvantages.

First of all, mean-mode and median-mode imputation respectively fill in the mean or me-dian value for numerical variables, and the mode for categorical variables (if applicable). The advantage of these techniques is their simplicity and easy interpretability. However, they have a diminishing effect on the variance of a variable. This makes them unsuitable for the purpose of causal inference.

Secondly, random imputation draws a random value from the observed values. This method is simple and intuitive as well, and also leaves the variance of the individual variables intact. A disadvantage is that it does not take into account any properties of a patient with missing value(s).

(18)

The final method applied in this paper, kNN, does use the patients their individual char-acteristics. First, it uses the observed values of a patient to find similar patients. Then, the imputed value is derived from the observed values of a predefined number, k, of the most similar patients.

To find which patient are the most similar, the Gower Similarity Coefficient is used (Gower, 1971). This coefficient is especially suitable for mixed data types, as it takes into account numerical, categorical and ordinal variables. The similarity between patient i and j, defined as Sij is calculated as follows: Sij = P kδijkSijk P kδijk , (3.7) where δijk =   

1, if both patients i and j contain variable k 0, otherwise

(3.8)

and, if applicable, the similarity of patient i and j regarding variable k is defined as Sijk:

Sijk= 1 −

|x_ik− x_jk| rk

(3.9) with rk being the range for variable k, and xik and xjk the value of respectively patients i and

j for variable k. After the k patients with the highest similarity are found, one can estimate a patient’s missing value(s) by performing the aforementioned mean-mode or median-mode imputation.

There is no theoretically optimal value for the number of neighbors, thus this value must be found empirically. Also, whether to combine the observed values of the k nearest neighbors with mean-mode or median-mode imputation should be investigated empirically.

Missing values in the target variable

Regarding the observations that contain missing values of the target variable mrs, listwise dele-tion is the only feasible soludele-tion. Imputadele-tion of the target variable using any method available could establish artificial dependencies in the data that could subsequently be perfectly captured by a predictive model. This can also be described as a ’self-fulfilling prophecy’. Subsequently measuring the predictive performance results in a biased estimate.

On the other hand, a problem may be caused by listwise deletion if the observations of the target variable are not MCAR. mrs is measured by contacting a patient or his/her caregivers after 90 days. A non-response in this case might imply that the patient already passed away, causing the collection of data to become impossible. Then, the population with missing values has a higher expected value of mrs compared to the population without missing values. Remov-ing the non-responses may then lead to an unrepresentative sample with lower generalizability as the sample population differs from the general population.

Within the sample, this effect does not present itself through a lower predictive performance during cross-validation (see section 3.4.2). This is, because both train and validation sets only

(19)

use observations when mrs is observed. Hence, to establish further generalizability outside the MR CLEAN dataset, validation of the results using external data is necessary.

3.3 Classifiers

3.3.1 Logistic regression

After the preprocessing steps described above are performed, the classifiers can be trained. The baseline classifier that is used in this thesis is the logistic regression. This is the most widely applied model in medical literature (Churpek et al., 2016). It is relatively easily interpretable, and by design it is very suitable for dichotomous classification problems.

Estimation of the logistic regression follows from maximizing the log-likelihood in (3.1) where the cumulative logistic distribution function pi = F (x0iβ) is defined as follows:

F (x0_iβ) = P (yi = 1|xi) = ex0iβ 1 + ex0_iβ = 1 1 + e−x0iβ (3.10) The model is based on the assumption that observations are independent over i, meaning that there are no duplicate observations. Furthermore, there should be little or no multi-collinearity between the independent variables in order for the logistic regression to perform well.

3.3.2 Random forest

The first more advanced machine learning algorithm that is applied in this thesis is random forest. This is an extension to the regular decision tree. Therefore, the decision tree is ex-plained first, after which bagging and random forest are introduced. Then, the advantages and disadvantages of the classifier are discussed.

A decision tree basically translates a binary classification problem into a step-wise series of simple decision rules regarding the predictive variables, in order to partition the data into two distinct subsets.

The regular decision tree is ’grown’ as follows: Starting at the top, the divisions of all continuous (i.e. above or below a given threshold) and categorical variables are examined. This way, all the possible splits that can be done to partition the data, are taken into account. The split that results in partitions with the highest purity, and thus distinctive power, is performed first. Purity of the branch B as a result of a c-fold split is commonly measured by the entropy function H(·) (Breiman, 2001): H(B) = − c X i=1 −filog2fi, (3.11)

where fi is the frequency in fold i after the split.

The split that causes the biggest reduction of impurity is chosen and executed. This process repeats until all partitions at the end of the tree (or ’leaves’) are completely pure or when the

(20)

number of observations in the leaves drop below a certain user-specified threshold. Then, the set is completely dichotomized.

A simple technique to improve the performance of a decision tree is bootstrap aggregating (or ’bagging’). Bagging randomly samples from the observations with replacement and trains a decision tree for each sample. After training a specified amount of trees, we can compose binary predictions for new observations by taking the mode from the predictions of all trained trees.

An extension on bagged decision trees that has shown to highly improve the generalization properties and thus the predictive performance, is to also randomly sample the variables to be used as input for the tree. This method is called random forest, and is introduced by Breiman (2001).

Random forest has shown to be one of the best performing machine learning algorithms for classification to date, despite of its relative simplicity. This is due to its good generalization properties (Fern´andez-Delgado, Cernadas, Barro, & Amorim, 2014) and robustness to outliers and noise (Breiman, 2001).

Also, Random forest tends to work well on data with weak inputs with significant overlap, where it is hard to distinguish between classes. This is also the case for medical diagnosis, due to the natural complexity of this problem. Random forest is able to combine many weak individual classifiers into a better performing ensemble classifier, as long as their correlation is low (Breiman, 2001).

Practical considerations

The random forest classifiers are fit to the data using the randomForest package of Liaw and Wiener (2002). Most hyper-parameters are set fixed. The minimum size of the end leaves is set to one, which is the default for (binary) classification problems. The number of trees that need to be fit is set to 500. In a preliminary analysis, this shows to be sufficiently large to see the classifications for the MR CLEAN dataset converge. This is also shown by Figure 3.1. In addition, the computational costs of fitting 500 trees on the dataset are also acceptable (approximately 2.5 seconds).

(21)

Figure 3.1: Error rates of random forest trained on dataset with all observations, with no preprocessing except listwise deletion. Bad outcome error rate (green), good outcome error rate (red) and out-of-bag (OOB) error rate (black) are shown. The latter is calculated using predictions of the observations that were left out of the bootstrap sample. This example shows that 500 trees are sufficient for the classifier its performance to converge.

Furthermore, the only hyper-parameter that needs to be optimized is the number of variables that are randomly selected to fit individual decision trees. In section 3.4.3 a detailed description of the optimization process and different options is given.

At last, the output of a random forest classifier is binary rather than a probability value in the range [0, 1]. Therefore, the AUC cannot be calculated immediately. Hence, we must first estimate the probabilities that yi = 1, ∀i ∈ {1, 2, ..., N } by calculating the frequency that the

individual decision trees classify the observation as 1. Then, we can estimate the ROC and thus calculate the AUC.

3.3.3 Support Vector Machine

The last model, SVM, is completely different from logistic regression and random forest. SVMs are based on the fact that all data, given the right mapping to an appropriate dimension with the right function, can be successfully categorized by a separating hyperplane. To do so, a SVM first implicitly maps the input variables to an arbitrary (higher) dimension. This can be done either linearly or non-linearly, through the use of a kernel. Then, it constructs an optimal separating hyperplane based on a number of support vectors by maximizing the margin between all transformed observations and the hyperplane.

The crux of SVM is that it does not require the actual mapping to a possibly infinite dimensional space, while it still benefits from the flexibility in the higher parameter space. It is therefore said that SVMs circumvent the ’curse of dimensionality’, by constructing hyperplanes

(22)

in very high-dimensional space, while still remaining computationally tractable and resistant to overfitting (Burges, 1998).

Let us now briefly discuss the technicalities of SVM, as defined by Cortes and Vapnik (1995). SVM is used to find an explicit boundary separating samples that belong to two different classes labeled as +1 and −1. Thus, in essence, SVM solves a binary classification problem. A linear SVM separation function is then found by solving the following quadratic programming problem:

min w,ξ,b 1 2kwk 2_{+ C} N X i=1 ξi s.t. yi(w · xi+ b) ≥ 1 − ξi ξi ≥ 0 i = 1, ..., N (3.12)

where the bi is the bias, C the cost coefficient, and ξi are the slack variables that measure

the level of misclassification of each sample xi in the case of non-separable data. This linear

SVM can be generalized for nonlinear transformations by rewriting it as its dual problem and substituting the inner product by a kernel:

min λ 1 2 X i,j λiλjyiyjK(xi, xj) − X i λi s.t. X i λiyi = 0 0 ≤ λi ≤ C i = 1, ..., N (3.13)

where λi are Lagrange multipliers.

Examples of popular kernels that are used and compared in this thesis are the linear kernel: K(xi, xj) = x0ixj, (3.14)

the polynomial kernel:

K(xi, xj) = (αx0ixj + c)d, (3.15)

and the (Gaussian) radial basis function:

K(xi, xj) = exp(−γkxi− xjk2), γ > 0. (3.16)

Empirical evidence has to show which of the three kernels produces the best results on the MR CLEAN dataset, as this is highly problem-specific.

Note, however, that it can be shown that the linear kernel is actually a degenerate version of the Gaussian kernel, such that the latter should always perform at least as good as the linear kernel when it is correctly optimized (Huang & Lin, 2016). However, the linear kernel may show similar performance, while it much more straight-forward and faster to optimize. Its only hyper-parameter is the cost C, while the Gaussian kernel uses two hyper-parameters that must be optimized.

The biggest limitations of SVMs are the large influence of the chosen kernel and respec-tive hyper-parameter(s), the speed while dealing with very large datasets, and relarespec-tively poor

(23)

handling of discrete data (Burges, 1998). Although the data on hand in this thesis is not very large, finding the best kernel and parameters is certainly a relevant issue. Besides, the dataset includes many discrete or categorical variables that may be relevant for predicting the outcome. With the right rescaling, however, this problem can be circumvented (Burges, 1998).

Practical considerations

SVMs are estimated using the kernlab package of Karatzoglou, Smola, Hornik, and Zeileis (2004). The required set of hyper-parameters differ per classifier. The different options and optimization method are discussed in detail in section 3.4.3.

Similar to random forest, SVM does not provide an output value in the range of [0, 1], but a classification. Therefore, the probabilities must be estimated. This is done using Platt’s probabilistic outputs using three-fold cross-validation (Lin, Lin, & Weng, 2007; Platt, 1999).

3.4 Model selection

The goals of this thesis are to compare the predictive performances of different classifiers after optimizing their respective hyper-parameter settings, as well as examining the effects of different methods to handle missing data. Therefore, a framework to validly measure and compare the out-of-sample predictive performance of different models is required.

Accordingly, this section firstly introduces the measure for predictive performance. Then, the validation procedures are explained. Lastly, the optimization methods are given.

3.4.1 Area under the curve

The measure that is used in this thesis is the AUC. This is the most widely used performance measure for predictive models in medicine. The main reason to use the AUC instead of i.e. the accuracy is often the presence of class imbalances, which causes the prediction accuracy to be inappropriate. This becomes especially clear in a situation where e.g. only 10% of all patients have a certain condition. Then, simply classifying all new patients as ’healthy’ leads to a very high accuracy of 90%. However, this is clearly not a desired predictor. The AUC is robust to such class imbalances.

3.4.2 Validation method

To obtain a valid estimation of the out-of-sample AUC, the dataset is split into a train and test set, before (repeated) cross-validation is performed. This (80/20) split is carried out before any preprocessing step, such as normalization or imputation. The test set remains untouched until the optimization of the models based the train set is finalized. In the end, this set is used to calculate measures on predictive performance and compare these with the cross-validation measures. This lets us validate the cross-validation results (see section 5.4.1).

(24)

Now, (repeated) five-fold cross-validation is performed on the train set. The basic intuition behind cross-validation is that we can obtain an unbiased estimator of the out-of-sample pre-dictive performance, by averaging the prepre-dictive performances for several different train and validation splits.

This is done by splitting the train set is into five equal folds, where each of the folds is isolated once to function as validation set. Each time, the models are trained using the remaining observations. Subsequently, the resulting trained classifier is used to predict the outcome in the validation set. Now, the predictions and the true values of the validation set can be compared using a performance measure of choice.

Figure 3.2: Diagram explaining the cross-validation procedure

Subsequently, the performance measures are averaged over all five folds. This leaves one estimator for the test performance for every model. This cross-validation performance is used to compare and optimize different classifiers and their respective hyper-parameter settings. It is also used to compare different methods to handle missing data. A diagrammatic explanation of the cross-validation process is given in Figure 3.2.

To get a more robust estimator, the cross-validation procedure can be repeated after shuffling the train set. Repeated cross-validation gives better performing estimators with lower variance, and it outperforms other validation methods (Kim, 2009). However, as it is a computationally expensive process, performing repeated cross-validation is not always feasible.

3.4.3 Hyper-parameter optimization for classifiers

As seen in section 3.3, the different classification methods have several hyper-parameters that could influence their predictive performance. This offers a new challenge when we want to compare the predictive performance of the various classifiers and methods to handle missing data on an equal basis. A fair comparison can only be done when all classifiers their respective hyper-parameters lead to a performance that is close to optimal. Therefore, we perform grid

(25)

searches to find the optimal hyper-parameters for the random forest and support vector machine classifiers.

The optimal ranges of values for all classifier-specific hyper-parameters to perform grid search over are defined manually. Firstly, all models are estimated and validated through five-fold cross-validation for a wide range of hyper-parameters. This gives insights in the appropriate orders of magnitude. Then, the final grids are defined in such way that all observed optimal hyper-parameters are included within the range within the grids. Also, a margin is added for the case that new situations have different optimal settings. During this preliminary analysis, different methods to cope with missing values are included.

A complete overview of all values that are used to form the classifiers their respective hyper-parameter grids can be found in table C.1 in the appendix. To illustrate: the hyper-hyper-parameter settings that are tried for the Gaussian SVM are Cost ∈ {1015i|i = 0, 1, ..., 8} and σ ∈ {0.0005 ∗

j|j = 1, 2, ..., 5}. This leads to 5 · 8 = 40 models that need to be estimated and compared using five-fold cross-validation.

Finally, the choice of the grid values is validated throughout the process by monitoring the optimal hyper-parameter values. Preferably, one hyper-parameter its estimated optimal values are frequently located near to the median of its respective grid range, rather than the outer bounds.

3.4.4 Optimization of classifiers and methods to handle missing data

In order to compare the performances of different methods for missing values and all different classification methods, we perform r = 100 times repeated k = 5-fold cross-validation on the train set. This entire procedure is captured in Algorithm 2 below.

Under normality assumptions, the results of this procedure can be used in combination with a two-sample t-test to find the model that leads to the best out-of-sample AUC when the population equals the train set. Also, it can be assessed per classifier which method to handle missing data results in the best performance, and vice versa. The normality assumptions are tested using the adjusted Jarque-Bera test (Urz´ua, 1996).

Note that the test and cross-validation sets only contain complete cases after pairwise re-moval, Accordingly, all models and their respective hyper-parameter settings are optimized based on the predictive performance of only complete cases. A problem may occur when the data is not missing MCAR. This may cause the population of complete cases to differ from the population of incomplete cases. Then, test and cross-validation may result in overesti-mated AUC values that are not representative for the predictive performance on the general population.

This problem is inevitable during the repeated cross-validation process. It becomes compu-tationally infeasible to perform imputation on the validation set as well, because every single observation must be imputed using only the train data. This becomes computationally demand-ing in the case of kNN imputation. Searchdemand-ing the k nearest neighbors in the train data for every

(26)

patient with missing values in the new sample is an exhaustive process.

The reason that every observation in the validation set must be imputed using only train data, is that this approaches the real-life scenario. New patients get accepted to the hospital one-by-one. Hence, possible missing values can only be imputed using patients that have been accepted before that time. Thus, an out-of-sample prediction can only use the train data.

Nonetheless, it is computationally feasible to correctly impute the test set. Therefore, the predictive performance on imputed test sets are investigated once as robustness check in section 5.4.

Algorithm 2 Pseudo-code of the imputation methods for all estimation methods 1: _{function Multicompare(data, imp set, est set, hyper set, r, K)}

2: . data = raw data, imp set = set of imputation methods, est set = set of estimators, hyper set = set with all estimator-specific hyper-parameter grids

3: for i ← 1 : r do 4: Shuffle data 5: for k ← 1 : K do

6: validation set ← fold k of data 7: train set ← data\validation set

8: Normalize both sets using only data from train set 9: for imp method in imp set do

10: Perform pairwise deletion on train set and validation set

11: Impute train set using imp method . Can be costly 12: for estimator in est set do

13: grid ← estimator-specific hyper-parameter grid from hyper set

14: for hyperparameters in grid do . Expensive step 15: Train estimator using hyperparameters on train set

16: Store AUC of trained estimator when predicting validation set

17: end for

18: end for 19: end for 20: end for

21: Calculate average performance measure per estimator per setting over folds 22: Store max performance measure for all estimators

23: end for

24: return An (r × length(est set) × length(imp method)) set containing all maximal per-formance measures for different imputation or estimation methods

(27)

4.1 MR CLEAN registry

The Multicenter Randomised Controlled Trial of Endovascular Treatment for Acute Ischemic Stroke in the Netherlands (MR CLEAN) registry includes 1488 patients from 16 different cen-ters that all received EVT, including stent retriever thrombectomy, aspiration and alternative methods for acute ischemic stroke within 6.5 hours from onset of symptoms between March 2014 and June 2016. It is an ongoing, prospective, observational study in all centers that perform EVT in the Netherlands.

The primary outcome measure in this paper is the mRs score. The dataset furthermore consists of 45 variables, including information about previous diseases, smoking, baseline blood pressure, mRS, medication use, and parameters specific to the treatment, e.g. time between stroke and onset treatment. Please refer to Table A.2 for the complete list ofu variables.

Inclusion criteria

In comparison with other clinical trials that monitor EVT, the MR CLEAN registry has the most broadly defined inclusion criteria. This is a good argument in favor for the external validity of the results derived from the data. However, we should keep in mind that doctor judgment was most likely influencing patient enrollment (Jansen, Mulder, & Goldhoorn, 2018). This could negatively influence the generalizability of the trained classifiers in this thesis.

4.2 Descriptive analysis

To gain preliminary insights and understanding of the data, some descriptive statistics are shown. Please refer to Table 4.1 for a summary of all categorical statistics, and Table 4.2 for some summary statistics of the numerical variables.

(28)

Observed # mrs=1 # mrs=0 NAs (NA/total) # cat Categories

smoking 1145 325 719 343 0.23 3 No, Yes, Unknown mrs dich 1363 470 893 125 0.08 2 No,Yes

ct bl is 1377 421 838 111 0.07 2 No, Yes intracranvma c 1378 430 829 110 0.07 2 Absent, Present collaterals 1381 431 834 107 0.07 4 absent collaterals,

fill-ing <50% of occluded area, >50% but less <100%, 100% of oc-cluded area

ct bl ht 1394 429 847 94 0.06 2 No, Yes intracranath c 1398 438 843 90 0.06 2 Absent, Present ct bl leuk 1402 435 847 86 0.06 2 No, Yes ct bl has 1403 431 853 85 0.06 2 No, Yes

occlsegment c 1408 438 851 80 0.05 10 None, Intracranial ICA, Tandem occlu-sion, ICA-T, Proximal M1, Distal M1, M2, M3, A1, A2

ct bl old 1414 443 851 74 0.05 2 No, Yes

cbs occlsegment recoded 1422 442 859 66 0.04 9 ICA-T, Proximal M1, Distal M1, Intracranial ICA, M2, M3, A1, A2, None

prev hc 1441 454 867 47 0.03 2 No, Yes

med statin 1457 461 874 31 0.02 2 No, Yes

prev mi 1459 459 882 29 0.02 2 No, Yes

prev pad 1460 458 880 28 0.02 2 No, Yes

med hdu 1460 462 874 28 0.02 2 No, Yes

med noa 1462 464 876 26 0.02 2 No, Yes

prev af 1466 462 883 22 0.01 2 No, Yes

prev ht 1469 464 881 19 0.01 2 No, Yes

med apt 1469 466 880 19 0.01 2 No, Yes

med hep 1469 465 881 19 0.01 2 No, Yes

med cou 1477 468 885 11 0.01 2 No, Yes

prev str 1479 465 891 9 0.01 2 No, Yes

prev dm 1479 466 889 9 0.01 2 No, Yes

ivtrom 1485 469 891 3 0.00 2 No,Yes

sex 1488 470 893 0 - 2 Female Male

wkend 1488 470 893 0 - 2 Weekday, Weekend

offhours 1488 470 893 0 - 2 No,Yes

refhospital 1488 470 893 0 - 2 No,Yes

(29)

min max mean sd n NAs (NA/ total) crp 0 471 15.8 36.8 1187 301 0.20 INR 0.8 4.7 1.2 0.4 1214 274 0.18 CBS BL 0 10 5.7 2.5 1238 250 0.17 trombo 9 866 251.9 91.8 1301 187 0.13 creat 33 845 86.0 34.8 1302 186 0.13 glucose 3.1 26.5 7.4 2.6 1316 172 0.12 mrs 0 6 3.5 2.0 1363 125 0.08 ASPECTS BL 0 10 8.1 2.0 1423 65 0.04 rr dias 34 148 81.6 15.7 1441 47 0.03 gcs 3 15 12.6 2.3 1443 45 0.03 rr syst 68 245 149.5 24.5 1446 42 0.03 NIHSS BL 2 42 15.5 6.0 1458 30 0.02 premrs 0 5 0.7 1.2 1461 27 0.02 age 18 99 68.6 14.4 1488 0 0.00 togroin 48 390 215.0 72.3 1488 0 0.00

Table 4.2: Summary statistics of numerical data, ordered by number of missing values

Also interesting are the histograms in Figure B.2 and Figure B.3. The occurrences of all variables are plotted, while making a distinction between patients with a good, bad, and unknown outcome. This gives a few interesting insights.

Firstly, it appears that the distribution of age given a bad outcome differs from the distribu-tion given a good outcome. It appears that the mean given a bad outcome is higher. The same appears to be the case for N IHSS BL and togroin. With regards to the categorical variables, the distributions conditional on the outcome seem to differ for most variables. This is a first positive sign for the relevance of such variables.

Secondly, laboratory variables IN R, creat and crp appear to have a highly skewed distri-bution with some extreme outliers. This is something to keep in mind during further analysis.

Missing data

As visible in Table 4.1 and Table 4.2, the dataset contains many missing values, both in the tar-get variable mrs and its predictors. The variables smoking, CBS BL, mrs and the laboratory variables crp, IN R, trombo, and creat all contain over 10% missing values. Approximately two thirds of the observations is incomplete.

By visualizing the occurrences of missing values in the data it becomes clear that the miss-ingness of observations is not strictly coincidental. Variables are often missing together with other variables. For instance, the variables related to the non-contrast CT scans are often missing together with the variables that are related to the CTA scan. Also, the laboratory parameters IN R, trombo, creat, crp and glucose are often missing together. The variables containing information about statin use are also frequently missing jointly. At last, when one variable about a patient’s medication use is missing, the others are missing as well. Please refer to appendix B.3 for the detailed visualizations of the missing values using the VIM package (Kowarik & Templ, 2016).

(30)

Using the same package we can create scatter-plots of the missing values of the target variable mrs against all other numerical variables, as seen in Figure 4.1. The distributions of the missing mrs values, which are represented by the red lines on the diagonal, are very similar to the distributions of the observed mrs values. It is important that the target variable is MCAR or MAR, as explained in section 3.2.2. This appears to be the case.

A practical reason for the missingness of the data could be human mistakes during the registration of the data. Regarding the registration, it is possible that some of the persons involved in the patients’ treatments (almost) never fill in all the required information in the electronic patient record (EPR). Physicians, for example, are not obliged to fill in a treatment-specific form with mandatory input fields. Instead, they are asked to report their findings in the general EPR that is used by their respective employers. This way, it is possible that some persons structurally forget to register certain properties that are important for the clinical trial. Another reason for the missingness could be mistakes during encoding of the written records of different EPR systems to the MR CLEAN database. This process is done by hand, such that human mistakes may cause relevant information to be left out.

Figure 4.1: Missing values of target variable mrs plotted against other (observed) numerical variables, distribu-tions of missing values (red) and observed values (blue) on the diagonal

(31)

4.3 Data cleaning

Besides for the binary variable ivtrom, where some missing values are denoted by a 9 instead of missing, no data cleaning steps have to be performed.

(32)

5.1 Optimization of models without imputation

As a first result, the models are trained on the data while using only the complete observations. If all the possibly relevant variables are included, the full train set consists of 448 complete observations over 62 variables after one-hot encoding. The hyper-parameters of the random forest and SVM classifiers are optimized by a grid search.

Firstly, we add an additional dummy representing ’Unknown’ for all categorical variables with missing values, resulting in 562 complete observations over 86 variables after one-hot-encoding. This causes the performance for all classification methods to improve, except for the logistic regression. This is also seen in Figure 5.1.

The decrease in performance for the logistic regression may be explained by overfitting. Adding more variables adds different coefficients and thus increases the risk for overfitting. In addition, logistic regression is the only classifier that is not protected against overfitting. Ran-dom forest is protected against overfitting by design (Breiman, 2001), and for the SVM classifier, overfitting is penalized using the C parameter, which is optimized using cross-validation.

Figure 5.1: Model performances before and after adding extra dummy variables for categorical variables with missing values. AUC is average out of 100 times repeated five-fold cross-validation using grid search for model optimization. The upper and lower error bars represent one standard deviation.

(33)

Pairwise deletion

As a second improvement, pairwise deletion is performed on the numerical variables. In Table 5.1 we see in which sequence the first ten variables should be deleted according to the greedy algorithm as defined in section 3.2.2. The results are globally in line with the prevalence of missing values per observation, as seen in tables 4.1 and 4.2.

Deleted variable Improvement Complete observations

None 0 562 CBS BL 88 650 INR 65 715 crp 72 787 ASPECTS BL 31 818 glucose 30 848 trombo 31 879 creat 114 993 gcs 37 1030 NIHSS BL 20 1050

Table 5.1: Best ten candidates for iterative pairwise deletion according to the greedy algorithm. This algorithms iteratively chooses which variable to drop, based on the respective increase in complete observations after the drop.

Training the models after additively deleting the first until the ninth variable according to the greedy algorithm does not necessarily improve the performance of the models. As visible in Figure 5.2, removing the first variable CBS BL results in a small increase in AUC for all models. However, during additional pairwise removal of IN R, the AUC declines. Removing crp, and ASP ECT S BL does not have a large effect, while removing glucose causes the performances to decrease again. When trombo and creat are removed, the performance increases. Removing gcs and N IHSS BL again results in a strong decline in performance for all models.

(34)

Figure 5.2: Model performances for additive pairwise deletion of variables according to greedy algorithm. AUC is average out of 100 times repeated five-fold cross-validation using grid search for model optimization.

These results are not in line with our expectations under the assumption of similar im-portance of variables. Under this assumption, we expect the function to be concave in the number of variables that are dropped according to the greedy algorithm. This should result in a clear global optimum. As this is not the case, the assumption that all variables are similarly important for the predictive performance, may be invalid.

Therefore, a brute force approach is applied as well to investigate whether, and if so, which variables should be deleted. Using this method, all 2047 possible combinations of the numerical variables are dropped once, and the resulting average performances over all optimized models are compared. Note that the cross-validation process for all possible combinations is repeated only once, due to the extensive computational costs of one iteration (approximately 15 hours). As an example, Figure 5.3 shows the thirty best results of this experiment.

Observing which variables appear the most frequent in e.g. the best-performing i ∈ {10, 20, ..., 200} combinations, shows a consistent result: Pairwise deletion on itself is best performed on the following variables, ordered by decreasing importance: IN R, crp, creat, trombo, CBS BL, rr syst, rr dias, and ASP ECT S BL. From now on, this is regarded as the best possible order to perform pairwise removal, as it takes into account the most information.

(35)

Figure 5.3: Top-30 of CV AUC averaged over all algorithms after pairwise deletion of different combinations of variables

crp INR CBS BL trombo creat glucose ASPECTS BL rr dias gcs rr syst NIHSS BL premrs

Missing values 1 2 3 4 5 6 7 8 9 10 11 12

Greedy 3 2 1 6 7 4 5 10 8 11 9 12

Brute force 2 1 5 4 3 9 8 7 11 6 10 12

Table 5.2: Order in which the variables should be deleted according to the absolute number of missing values, the greedy algorithm, and the brute force approach

It is interesting to note that this optimal order of pairwise deletion is not always in line with the absolute number of missing values per variable, nor with the outcome of the greedy algorithm. Both CBS BL and glucose show to be less beneficial candidates for removal than expected if solely relying on the missing values. This implies that both variables have a relatively high predictive power if observed. With the same reasoning, we can state that creat has a relatively low predictive power when observed.

5.2 Optimization of models with imputation

Optimizing pairwise deletion combined with kNN imputation

To find the combination of pairwise deletion and kNN imputation that leads to the best pre-dictive performance, multiple options are compared. classification methods are optimized using grid-search hyper-parameter tuning with five-fold cross-validation on the train data. Again, cross-validation is performed once due to high computational costs.

Now, the remaining optimization problem can be narrowed down to the following two hyper-parameters:

(36)

• The number of variables that are deleted in the order IN R, crp, creat, trombo, CBS BL, rr syst, rr dias, and ASP ECT S BL

• The number of neighbors that are used for kNN imputation

Looking at the results in Figure 5.4, it becomes clear that the classifiers trained on data that are imputed with kNN perform best when only IN R, crp and creat are removed. Therefore, further analysis uses pairwise deletion of these three laboratory values.

It is unclear, however, which number of neighbors is optimal. Therefore, two numbers of neighbors are analyzed further: eight and thirteen. Both choices seem to perform comparably well. This limitation is explained by the fact that performing repeated cross-validation on all possible numbers of neighbors in the next section becomes computationally intractable.

Figure 5.4: Average five-fold CV AUC for different number k of neighbors in kNN imputation and different number of deleted variables. Classifiers are optimized using grid-search.

Optimizing pairwise deletion combined with other imputation methods

The optimization problem for the other imputation methods is more straightforward, as only one hyper-parameter exists: the number of variables that are deleted. The analysis can be performed using the results in Figure 5.5. Consistent with the kNN imputation, removing the