Dealing with the Temporal Structure of Routine Primary Care Data

(1)

(2)

Dealing with the Temporal Structure of Routine

Primary Care Data

MSc Thesis (Afstudeerscriptie)

written by Ioannis Pantazis

(born June 3rd, 1993 in Athens, Greece)

under the supervision of Mr Hine van Os and Dr Mark Hoogendoorn, and submitted to the Board of Examiners in partial fulfillment of the

requirements for the degree of

MSc in Computational Science

at the Vrije Universiteit Amsterdam and Universiteit van Amsterdam.

Date of the public defense: Members of the Thesis Committee: September 23, 2019 Mr Hine van Os

(3)

(4)

Abstract

Stroke is one of the leading causes of death and disabilities in the Netherlands, as well as worldwide. In an effort to distinguish cases of stroke amongst patients in time, this thesis implements a long-term prediction model while tackling various issues within the process. While providing an insight of the impact of cardio-vascular prediction models in clinical practice and the challenges of handling Electronic Health Records, the thesis focuses on the aspect of survival analysis; a prediction model which takes the time till the occurrence of the event into account. At first, since extracting information from Electronic Health Records has proven to be a quite demanding procedure, two preprocessing techniques have been brought to the test one of which incorporates the aspect of time into the process by defining time intervals based on the patients visits. In addition, a survival implementation of the machine learning techniques of gradient boost-ing and random forests are compared to the traditional survival analysis Cox method, as well as support vector machines; all of which methods are subjects to hyperparameter tuning and cross validation. The final topic of interest is the inclusion of a decade-long period of data which has been examined on whether it provides a noticeably increase in the accuracy of the model’s prediction in comparison to a yearly period. The results indicate that the Cox model can provide a thorough insight, regardless of the increased number of factors of the model. The survival implementation of gradient boosting performs relatively better in comparison to the survival random forests, while it is clear that the inclusion of solely the data of the short roll-in period is sufficient for the models accurate prediction.

Keywords: Survival Analysis, Electronic Health Records, Gradient Boost-ing, Survival Random Forests, Cox method, Support Vector Machines

(5)

Acknowledgements

I would like to express my deepest gratitude towards my supervisor Mr Hine van Os, whose ideas, understanding and motivation played a vital role in the completion of the thesis, while creating a friendly atmosphere that allowed me to be as productive as I could.

I would also like to express my appreciation towards Dr Mark Hoogendoorn for giving me an opportunity to work on the most exciting project I have worked thus far, while always being eager on providing insights and help whenever it was needed.

Finally, I would like to thank my family for the constant emotional and practical support throughout all those years; even though we have spent so much time apart, I have always kept you closest to my heart.

(6)

List of Figures

1 Example of right censored data (source: [21]) . . . 14

2 Illustration of survival function over time (source: [22]) . . . 15

3 Illustration of two cases of cumulative hazard function over time (source: [22]) . . . 15

4 Illustration of the SVM approach with a two dimensional data set, split into two classes; circles and triangles (source: [26]) . . . 18

5 Distribution of visits in Verrichtingen.tab . . . 26

6 Distribution of birthyears in Patient.tab . . . 27

7 Distribution of recorded ICPC codes in ICPC.tab . . . 28

8 Distribution of recorded ICPC codes in Journaal.tab . . . 28

9 Distribution of recorded RRDIKA measurements . . . 29

10 Distribution of recorded RRSYKA measurements . . . 29

11 Distribution of smoking habits . . . 30

12 Distribution of recorded CHHDBMI measurements . . . 31

13 Distribution of recorded CHOLBMT measurements . . . 31

14 Distribution of recorded GLUCBNU measurements . . . 31

15 Distribution of age of patients in main dataset . . . 33

16 Distribution of time to event in main dataset . . . 33

17 Distribution of age of patients in time interval preprocessing of roll-in period . . . 35

18 Distribution of time to event (in days) in time interval prepro-cessing of roll-in period . . . 35

19 Distribution of time to event (in days) in time - interval prepro-cessing of follow-up period . . . 36

20 Feature importance of gradient boosting performed on the main dataset . . . 43

21 Feature importance of gradient boosting performed on time in-terval preprocessing of roll-in period (all cases of granularity) . . 44

22 Feature importance of gradient boosting performed on time - in-terval preprocessing of roll-in and follow-up period (all cases of granularity) . . . 45

23 Distribution of age of patients in main dataset . . . 54

24 Distribution of gender of patients in main dataset . . . 54

25 Distribution of time to event in main dataset . . . 54

26 Distribution of recorded ICPC codes in main dataset . . . 54

27 Distribution of recorded medication codes in main dataset . . . . 54

28 Distribution of blood measurements in main dataset . . . 55

29 Distribution of cholesterol measurements in main dataset . . . . 55

30 Distribution of recorded smoking habits in main dataset . . . 55

31 Distribution of recorded ICPC codes in time - interval prepro-cessing of roll-in period (Granularity of 2 months) . . . 55

32 Distribution of recorded ICPC codes in time - interval prepro-cessing of roll-in period (Granularity of 1 month) . . . 55

(9)

33 Distribution of recorded ICPC codes in time - interval prepro-cessing of roll-in period (Granularity of 2 weeks) . . . 55 34 Distribution of blood measurements in time - interval

preprocess-ing of roll-in period (Granularity of 2 months) . . . 55 35 Distribution of blood measurements in time - interval

preprocess-ing of roll-in period (Granularity of 1 month) . . . 55 36 Distribution of blood measurements in time - interval

preprocess-ing of roll-in period (Granularity of 2 weeks) . . . 55 37 Distribution of cholesterol measurements in time - interval

pre-processing of roll-in period (Granularity of 2 months) . . . 56 38 Distribution of cholesterol measurements in time - interval

pre-processing of roll-in period (Granularity of 1 month) . . . 56 39 Distribution of cholesterol measurements in time - interval

pre-processing of roll-in period (Granularity of 2 weeks) . . . 56 40 Distribution of recorded smoking habits in time - interval

prepro-cessing of roll-in period (Granularity of 2 months) . . . 56 41 Distribution of recorded smoking habits in time - interval

prepro-cessing of roll-in period (Granularity of 1 month) . . . 56 42 Distribution of recorded smoking habits in time - interval

prepro-cessing of roll-in period (Granularity of 2 weeks) . . . 56 43 Distribution of recorded ICPC codes in time - interval

prepro-cessing of follow-up period (Granularity of 2 months) . . . 57 44 Distribution of recorded ICPC codes in time - interval

prepro-cessing of follow-up period (Granularity of 1 month) . . . 57 45 Distribution of recorded ICPC codes in time - interval

prepro-cessing of follow-up period (Granularity of 2 weeks) . . . 57 46 Distribution of blood measurements in time - interval

preprocess-ing of follow-up period (Granularity of 2 months) . . . 57 47 Distribution of blood measurements in time - interval

preprocess-ing of follow-up period (Granularity of 1 month) . . . 57 48 Distribution of blood measurements in time - interval

preprocess-ing of follow-up period (Granularity of 2 weeks) . . . 57 49 Distribution of cholesterol measurements in time - interval

pre-processing of follow-up period (Granularity of 2 months) . . . 58 50 Distribution of cholesterol measurements in time - interval

pre-processing of follow-up period (Granularity of 1 month) . . . 58 51 Distribution of cholesterol measurements in time - interval

pre-processing of follow-up period (Granularity of 2 weeks) . . . 58 52 Distribution of recorded smoking habits in time - interval

prepro-cessing of follow-up period (Granularity of 2 months) . . . 58 53 Distribution of recorded smoking habits in time - interval

prepro-cessing of follow-up period (Granularity of 1 month) . . . 58 54 Distribution of recorded smoking habits in time - interval

prepro-cessing of follow-up period (Granularity of 2 weeks) . . . 58 55 Feature importance of gradient boosting performed on main dataset 59

(10)

56 Feature importance of survival random forests performed on main dataset . . . 60 57 Feature importance of gradient boosting performed on

bench-mark variables of time-interval preprocessing of roll-in period (all cases of granularity) . . . 61 58 Feature importance of survival random forests performed on

bench-mark variables of time-interval preprocessing of roll-in period (all cases of granularity) . . . 62 59 Feature importance of gradient boosting performed on age and

gender of time-interval preprocessing of roll-in period (all cases of granularity) . . . 63 60 Feature importance of survival random forests performed on age

and gender of time-interval preprocessing of roll-in period (all cases of granularity) . . . 64 61 Feature importance of gradient boosting performed on

bench-mark variables of time-interval preprocessing of roll-in and follow-up period (all cases of granularity) . . . 65 62 Feature importance of survival random forests performed on

bench-mark variables of time-interval preprocessing of roll-in and follow-up period (all cases of granularity) . . . 66 63 Feature importance of gradient boosting performed on age and

gender of time-interval preprocessing of roll-in and follow-up pe-riod (all cases of granularity) . . . 67 64 Feature importance of survival random forests performed on age

and gender of time-interval preprocessing of roll-in and follow-up period (all cases of granularity) . . . 68

(11)

List of Tables

1 50 Most Important Variables of the Survival Gradient Boosting Model . . . 40 2 Scores of survival and machine learning methods performed on

the main dataset . . . 42 3 Scores of survival and machine learning methods performed on

time - interval preprocessing of roll-in period . . . 44 4 Scores of survival and machine learning methods performed on

(12)

1 Introduction

1.1 Motivation

Stroke is one of the leading causes of death and disability in the Netherlands [1], [2]. Although treatment of stroke has been drastically improved, prevention is key. Taking preventive measures for stroke and other cardiovascular disease such as heart infarction are based on long-term cardiovascular risk prediction models. Numerous prediction models have been built aiming to detect the risk of stroke and other cardiovascular diseases [3]. However, their predictive performance is limited due to the fact that they are based on research cohorts with only a restricted number of stroke cases partially because of limited sample size, the models often incorporate only a very limited number of traditional vascular risk factors such as smoking, hypertension, and hypercholesterolemia. Nonetheless, during the last decade, enormous amounts of routinely available data has been made available from the Electronic Health Records (EHRs). Although these large numbers offer unprecedented opportunities for development of prediction models, dealing with EHR data has always been a challenge [4], due to the poor quality of medical data which either leads to uncertain results or ignoring a large part of it during the prediction process [5]. Another challenge in using EHR data for prediction, is that a baseline assessment of all risk factors such as with traditional scientific research is missing. The aim of this thesis is therefore to determine the optimal analysis strategy in dealing with the highly temporal nature of EHR data. One should therefore determine other strategies in dealing with temporal nature of EHR data. The final model will provide a long-term prediction of stroke risk, which could be used to steer preventive measures in time.

1.2 Research Questions

Cardiovascular risk prediction models generally provide long-term risk predic-tions. As patients have varying follow-up time in cohorts, especially long-term prediction warrants time-to-event analysis (also survival analysis) to deal with loss-to-follow up within the cohort. In medical literature, the Cox model [6] has been traditionally used for this purpose. However, as the potential number of predictions is large and high-dimensional interactions between them are poten-tially important, the focus in researching the survival analysis methods is set on efficiently dealing with said interactions, by proposing survival implementations of known machine learning algorithms. The fact that these survival variants of common machine learning algorithms have been only scarcely researched in the medicine, and not yet at all in the cardiovascular domain, leads to the first research question.

1) Will the discriminate performance of survival implementations of machine learning models be higher than the one of the traditional Cox model? To directly analyzing data from EHRs, one should find a solution for the fact

(13)

that a baseline measurement of all predictors has not been set, as it is usually the case in research cohorts. Two strategies deal with this problem; first, the construction of a baseline assessment of all predictors out of a roll-in period at the start of the follow-up time. The second strategy is to consider multiple intervals for one patient across the entire follow-up time. In this thesis, we choose for intervals defined by general practitioner (GP) visits. Both strategies for dealing with temporal nature of EHR data have not yet been compared, and are the topic of the second research question.

2) Will the incorporation of the data extracted by the entirety of the follow-up period lead to better discriminate performance in comparison to the one of

data extracted solely in the roll-in period?

Including multiple time intervals per patient will be computationally de-manding. Therefore, a detailed definition of these time intervals is crucial. Even though a finer granularity in definition of the multiple time intervals per patient across the follow-up period will lead to more information available to the models, the third research question checks if this additional information will result in a better model performance.

3) What is the trade-off in the performance of models based on data extracted by a varying granularity of time interval per patient in comparison to the one

based on data extracted solely at a particular time point?

1.3 Contribution

The contribution of this thesis is twofold. First, it lies on its applicability of a real life problem. It aims at creating a prediction model focusing on the oc-currence of a main cardiovascular event using primary care data from EHRs. This model will be further developed into software that is planned to be tested in GP practices in 2021. Hence, tackling the survival analysis related questions presented in the scope of the thesis is essential in order to make an impact in clinical practice. Second, this thesis aims to answer multiple methodological questions, as well as get insights that could be reused for research on numerous other clinical topics dealing with EHR data. It includes the question on how to deal with clinical data without a clearly defined baseline assessment of predic-tors, which models are the most suitable for a survival analysis of complex data with large number of predictors, and which steps should be taken in order to achieve a time interval preprocessing of EHR data.

1.4 Structure

The remainder of this thesis is organized as follows. In Chapter 2, the reader is provided with background information on what the ischemic stroke is, the difficulties of handling routinely available EHRs, how prediction models are used in medicine and being introduced to survival analysis; a key concept of the thesis. Chapter 3 presents the different survival and machine learning methods

(14)

used, while Chapter 4 offers an overview of the files used in the experiments, as well as the preprocessing technique that was followed. Chapter 5 provides an insight on the experimental setup and Chapter 6 contains the derived results. Finally, Chapters 7 and 8 comment on the results, pinpoint the most insightful conclusions and offer room for discussion for ideas that could potentially enrich the proposed model.

(15)

2 Background

This section presents a brief introduction on the clinical entity of stroke, the challenges in handling medical data as well as the manner on which prediction models are used in medicine; providing an insight on the main issues that had to be tackled and how the derived results should be interpreted. The section ends by introducing the concept of survival analysis to the reader.

2.1 The clinical entity of stroke

Stroke is the collective term for brain infarction (also known as ischemic stroke) and brain hemorrhage. During a brain infarction, an occlusion of a cerebral blood vessel leads to restricted blood supply. This in turn leads in reduction of oxygen which quickly causes a death of brain tissue, if the blood flow is not restored in time. A brain hemorrhage is caused by a burst of an artery in brain which leads to localized bleeding in the surrounding tissues, while killing brain cells. The ischemic stroke is much more prevalent, accounting for 87% of all strokes, according to the American Stroke Association. In the Netherlands, there is an expected increase in cases of stroke in 2020 by 27% in comparison to 2000 [7]. The most common pathophysiological mechanism underlying stroke is macrovascular atherosclerosis, which thickens and hardens the arterial walls, characterizes by plaque deposits of lipids, fibrous connective tissue, calcium, and other blood substances; its traditional risk factors contributing to macrovascular atherosclerosis are smoking, hypertension and diabetes [7]. Although acute treatment of stroke has improved drastically over the last decades, prevention remains a key concern. In the Netherlands, the efforts for stroke prevention are organized in the primary care and are coordinated by the GPs. A thorough cardiovascular risk management system is instrumental for a valid and precise estimation of risk of main cardiovascular events as stroke.

2.2 The cardiovascular prediction models in clinical

prac-tice

Due to the great impact of cardiovascular disease on our society since the dawn of modern medicine, there have been increasing investments in the development of cardiovascular risk prediction models. This claim is illustrated by meta-analysis, which has found about 150 of such models in literature in 2015 [3]. The vast majority of such models are based on traditional research cohorts, whose risk factors are common traditional vascular elements such as age and smoking habits. However, in the last decade, enormous routinely available clin-ical datasets have become available through linking EHRs of multiple EHR distributors and healthcare providers. In the Netherlands, multiple private par-ties have created these data lakes and made them available for clinical research. An example of such is Stizon, which allows the Leiden University Medical Cen-ter (LUMC) primary care public health department to research a dataset of 3

(16)

million GP patients throughout the Netherlands. A direct result of the avail-ability of large multicenter registries and administrative databases is the need for fitting methodologies which are derived by predictive analytics. Especially predictive methods that efficiently deal with large datasets including large num-bers of predictors, such as machine learning techniques.

Machine learning techniques, such as random forests, gradient boosting ma-chines and neural networks, have been used for improving the accuracy of pre-dictive modelling by exploiting complex interactions between risk factors and have helped in identifying many cases that would benefit from preventive treat-ment, while others receive unnecessary intervention [8]. These techniques have also been used in identifying predictors for risk of diseases, as were smoking and age for cardiovascular diseases [3], even thought individual machine learning techniques perform differently according to the problem at hand [9, 10, 11].

However, it is important to clarify the role and limits of risk models. Pre-dictive modelling is heavily restricted by assumptions that may not apply to real-world situations as well as by modeler bias. Therefore, a model cannot predict exactly what will happen to a patient in cases of a serious illness or a complication after surgery. For the case of stroke, for instance, it is not neces-sary that a patient whose measurements are in accordance with the values of the factors of a stoke prediction model will have to experience such an event. Nevertheless, it is an extremely useful tool for decision making [12].

2.3 Dealing with routinely available EHR data

Building a prediction model based on data directly from EHRs is a complex task due to the numerous challenges that have to be encountered when dealing with EHRs. The data may contain either a restricted number or sparsely recorded measurements [13], since clinicians only collect metrics that deem to be neces-sary at each clinic visit [4] and the clinical processes vary [14, 15]. Because of this, the concept of missing data gets an entirely different meaning in EHR data. Common methods dealing with missingness will generally not work because of the extent of missingness and that these values are likely missing not at ran-dom. Currently, there is not a universal cleaning and standardization technique that handles data missingness in EHRs, discarding information in each patient’s record [5] or building oversimplified models with consideration to context [14] is a common approach. Handling incomplete data is another task that is taken care by various imputation techniques, where the missing values are either filled by deriving statistics out of the rest of the measurements or discarded if their impact is insignificant.

However, apart from trying to define standard content for EHRs, such as Fast Healthcare Interoperability Resources [16], data scientist have found var-ious ways to tackle these issues; implementing deep learning approaches that incorporate the entire EHR for better predictions by mapping the EHR data

(17)

into a highly curated set of structured predictors variables and then feeding these variables into a statistical model [5], mitigate the associated biases via appropriate and thoughtful study design on the way that a patient interacts with the particular health system [14, 15], aggregating the time varying data into different summary statistics instead of using only the most recent value as a predictor [17], combining text mining tools and special post processing to fa-cilitate information retrieval since EHRs tend to contain many data points and frequently can be linked with outside sources that can help to reduce possible confounding, while improving imputation and helping capturing the outcomes of interest [16, 18], as well as building better classification systems [16].

2.4 Survival Analysis

Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of inter-est [19], [20]. The event can be death, occurrence of a disease, marriage, divorce, etc. For example, if the event of interest is heart attack, then the survival time can be the time in years until a person develops a heart attack.

In survival analysis, subjects are usually followed over a specified time period and the focus is on the time at which the event of interest occurs. However, it is worth acknowledging that some of these subjects may be censored. Observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring. The types of right censoring are:

• Fixed type I censoring occurs when a study is designed to end after C years of follow-up. In this case, everyone who does not have an event observed during the course of the study is censored at C years.

• In random type I censoring, the study is designed to end after C years, but censored subjects do not all have the same censoring time. This is the main type of right-censoring we will be concerned with. An illustrated example is shown in Figure 1.

• In type II censoring, a study ends when there is a prespecified number of events.

Regardless of its type, censoring comes with the assumption that it is non-informative about the event; therefore, the censoring is caused by something other than the impending failure. Censoring is an important issue in survival analysis, representing a particular type of missing data. Censoring that is ran-dom and non informative is usually required in order to avoid bias in standard survival analysis.

Unlike ordinary regression models, survival methods correctly incorporate information from both censored and uncensored observations in estimating

(18)

im-Figure 1: Example of right censored data (source: [21])

portant model parameters. The dependent variable in survival analysis is com-posed of two parts: one is the time to event and the other is the event status, which records if the event of interest occurred or not. One can then estimate two functions that are dependent on time, the survival and hazard functions. The survival and hazard functions are key concepts in survival analysis for de-scribing the distribution of event times.

The survival function S(t) gives, for every time, the probability of surviving (or not experiencing the event) up to that time. In mathematical terms, if T denotes the response variable (T ≥ 0), the survival function is

S(t) = P rob(T > t) = 1 − F (t) (1) where F (t) denotes the occurrence of event at time t. The survival function has three properties, which are also shown in Figure 2:

1. It is a non-increasing function.

2. S(t = 0) = 1. Therefore, the probability of surviving past time 0 is 1. 3. S(t = ∞) = 0; as time goes to infinity, the survival curve goes to 0.

(19)

Figure 2: Illustration of survival function over time (source: [22]) The hazard function h(t) gives the probability that the event will occur, per time unit, given that an individual has survived up to the specified time:

h(t) = lim ∆t→∞ P rob(t < T ≤ t + ∆t|T > t) ∆t = f (t) S(t) (2)

where f (t) is the the probability density function of survival time t and S(t) the survivor function (the probability of surviving beyond a certain point in time). The cumulative hazard describes the accumulated risk up to time t, H(t) =Rt

0h(u)du, also illustrated in Figure 3.

Figure 3: Illustration of two cases of cumulative hazard function over time (source: [22])

While S(t), H(t) or h(t) are often of direct interest, many other quantities of interest (e.g., median survival) may subsequently be estimated from know-ing either the hazard or survival function. It is generally of interest in survival studies to describe the relationship of a factor of interest (e.g. treatment) to the time to event, in the presence of several covariates, such as age, gender, etc. A number of models are available to analyze the relationship of a set of predictor variables with the survival time.

(20)

3 Methods

This section introduces the survival analysis related methods. Furthermore, it gives an insight on gradient boosting and survival random forests, as well as the concepts of concordance index and variable importance, which are vital for the interpretation of the results.

3.1 Survival Analysis Methods

This subsection continues examining the survival analysis presented in the back-ground, by providing an insight at the Kaplan - Meier and Cox method that estimate the survival probabilities, as well as the survival implementation of support vector machines.

3.1.1 Kaplan-Meier method

The Kaplan-Meier method is a non-parametric method used to estimate the survival probability from observed survival times [23]. The survival probability at time ti, S(ti), is calculated as follows:

S(ti) = S(ti−1)(1 −

di

ni

) (3)

Where,

• S(ti− 1) = the probability of being alive at ti− 1

• ni = the number of patients alive just before ti

• di = the number of events at ti

• t0= 0, S(0) = 1

The estimated probability (S(t)) is a step function that changes value only at the time of each event. It is also possible to compute the confidence intervals for the survival probability. The Kaplan-Meier survival curve, a plot of the Kaplan-Meier survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.

3.1.2 Cox Model

The Kaplan-Meier method is an example of univariate analysis, which means that it describes the survival according to one factor under investigation, but it does not take into account the impact of others. Additionally, Kaplan-Meier curves are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They do not work easily for quantitative predictors such as weight or age.

(21)

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor and for categorical variables. Fur-thermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time. In other words, Cox method allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection or death) at a particular point in time [24]. This rate is commonly referred as the hazard rate. Predictor variables (or factors) are usually termed covariates in the survival-analysis literature.

The Cox model is expressed by the hazard function denoted by h(t). In order to get a better insight of how the Cox model works, the hazard function will be redefined. It can be interpreted as the risk of experiencing the event at time t and it can be estimated as follows:

h(t) = h0(t) ∗ exp(b1x1+ b2x2+ ... + bpxp) (4)

where,

1. t represents the survival time

2. h(t) is the hazard function determined by a set of p features x1, x2, ..., xp

3. the coefficients b1, b2, ..., bp measure the impact of the features

4. the term h0 is called the baseline hazard. It corresponds to the value of

the hazard if all the xi are equal to zero. The hazard may vary over time.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables xi, with the baseline hazard being an intercept

term that varies with time. The quantities exp(bi) are called hazard ratios (HR).

A value of bigreater than zero, or equivalently a HR greater than one, indicates

that as the value of the i-th term increases, the event hazard increases and thus the length of survival decreases. Thus, a HR above 1 indicates a term that is positively associated with the event probability, and thus negatively associated with the length of survival.

3.1.3 Survival Support Vector Machine

One alternative approach to the Proportional Hazard (PH) model is to use sup-port vector machines (SVMs). SVMs were first proposed as a learning method for dichotomous outcomes [25]. The formulation of an SVM supposes a tar-get variable Y ∈ −1, 1 and covariates X ∈ <d_{. Assuming that the two target}

classes are linearly separable, there exists a linear function f (x) = ψx + b. This function is known as a kernel function and it is used to take data as input and transform it into the required form. Even though in this example it is linear, the function could be of different type such as nonlinear, polynomial, radial basis function (RBF), and sigmoid. The SVM task is to find the separating hyper-plane H(ψ, b) = x|hx, ψi + b = 0 for the two classes with the maximal margin.

(22)

The margin is defined as the smallest distance between any data point and the separating hyperplane. Data points lying at this distance from the separating hyperplane are called support vectors. They determine the margin, and must verify either f (x) = 1 or f (x) = −1. If the two classes are not linearly separable, misclassifications can be allowed. This is done by introducing slack variables ξi ≥ 0, which allow but penalize misclassifications. The slack variable for an

individual i is defined as ξi = |yi− f (xi)|. Hence, it is ξi = 0 if i is correctly

classified, ξi< 1 if it is inside the margin and on the correct side of the

hyper-plane, ξi > 1 if the data point lies on the wrong side of the hyperplane, and

ξi = 1 if observation i lies exactly on the hyperplane. Figure 4 [26] presents

an illustrative example of the SVM approach for a two dimensional data set. Data points are grouped into the classes y = 1 (circles) and y = 1 (triangles). The filled and darkened data points are the support vectors defining the mar-gin. The two red data points lying inside the margin are misclassified. They are penalized depending on their locations on the wrong or the correct side of the hyperplane, represented by the dashed line. All the other data points are correctly classified and therefore not penalized.

Figure 4: Illustration of the SVM approach with a two dimensional data set, split into two classes; circles and triangles (source: [26])

To achieve the goal of the SVM approach described above, the following optimization problem is posed in primal space:

min ψ,b,ξ 1 2||ψ|| 2_{+ γ} n X i=1 ξ (5)

(23)

subject to −(yi(< xi, ψ > +b) + ξi− 1) ≤ 0 and ξ ≥ 0, i = 1, ..., n where

ψ, b and the slack variables ξi are unknown, n is the number of individuals,

and γ > 0 is a regularization parameter controlling the maximal margin and misclassification penalties. Instead of solving the optimization problem in the primal space, the optimization problem is transformed to the dual problem, and the Lagrange function is optimized in the dual space.

The SVM method can be extended to cases examining survival analysis [27, 28, 29]. This function implements the different optimizers for training a survival SVM. Training data consists of n triples xi, yi, δi where xi is a d-dimensional

feature vector, yi> 0 the survival time of censoring and δi∈ 0, 1 the binary event

indicator. Using the training data, the objective is to minimize the following function [30]: arg min w,b 1 2w T_w+α 2[r X i,j∈P max(0, 1−(wTxi−wTxj))2+(1−r) n X i=0 (ζw,b(yi, xi, δi))2] (6) ζw,b(yi, xi, δi) = max(0, yi− wTxi− b) if δi= 0 yi− wTxi− b if δi= 1 (7)

P = {(i, j)|yi> yj∧ δj = 1}i,j=1,...,n (8)

The hyper-parameter α > 0 determines the amount of regularization to apply: a smaller value increases the amount of regularization and a higher value reduces the amount of regularization. The hyper-parameter r ∈ [0; 1] determines the trade-off between the ranking objective and the regression objective. If r = 1, it reduces to the ranking objective, and if r = 0, it reduces to the regression objective.

3.2 Gradient Boosting

The gradient boosting machine (GBM) is an ensemble learning method, which constructs a predictive model by additive expansion of sequentially fitted weak learners [31]. The general problem is learning a functional mapping y = F (x; β) from data xi, yini=1, where β is the set of parameters of F , such that some cost

functionPn

i=1Φ(yi, F (xi; β) is minimized. Boosting assumes F (x) follows an

additive expansion form F (x) =PM

m=0ρmf (x; τm), where f is called the weak

or base learner with wight ρ and a parameter set τ . Accordingly, ρm, τmMm=1

compose the whole parameter set β. Their value is obtained in a greedy stage-wise process:

1. set an initial estimator f0(x)

2. for each iteration m ∈ 1, 2, ..., M , solve:

(24)

GBM approximates the second step with two extra ones. First it fits f (x; τm)

by τm= arg minτPn_i=1(gim−f (xi; τ ))2, where gim= −[∂Φ(y_{∂F (x}i,F (xi))

i) ]F (x)=Fm−1(x).

Secondly, it learns ρ by ρm= arg minρP n

i=1Φ(yi, Fm−1(xi) + ρf (xi; τm)).

Af-terwards, it updates Fm(x) = Fm−1(x) + ρmf (x; τm). In practice, however,

shrinkage is often introduced to control overfitting, and the update becomes Fm(x) = Fm−1(x) + νρmf (x; τm), where 0 < ν ≤ 1. If the weak learner is

the regression tree, the complexity of f (x) is determined by tree parameters, for example, the tree size (or depth), and the minimum number of samples in terminal nodes.

Compared to parametric models such as generalized linear models and neu-ral networks, GBM does not assume any functional form of F but uses additive expansion to build up the model. This nonparametric approach gives more free-dom since it combines predictions from the ensemble of weak learners and so tends to yield more robust results than the single learner.

Ridgeway [32] adapted GBM for the Cox model. The cost function is the negative log partial likelihood: Φ(y, F ) = −Pn

i=1δiF (xi) − log(Pj:tj≥tie

F (xj)₎

3.2.1 Concordance Index of GBM

A gradient boosting algorithm to learn the Concordance Index (C-index) is needed. Instead of focusing on Cox’s partial likelihood, we choose to examine the C-index since it is a widely used metric to evaluate survival models. The contribution is to tackle the problem from a nonparametric ensemble perspec-tive based on gradient boosting [31].

The C-index is a commonly used performance measure of survival models. Intuitively, it is the fraction of all pairs of patients whose predictions have correct orders over the pairs that can be ordered. Formally, the C-index is

CI = 1 |P | X (i,j)∈P I(F (xi) < F (xj)) = 1 |P | X i∈E X j:tj>ti I(F (xi) < F (xj)) (9)

P is the set of validly orderable pairs, where ti < tj; |P | is the number of

pairs in P , F (x) is the prediction of survival time; I is the indicator function of whether the condition in parentheses is satisfied or not. In the PH setting, the predicted survival time can be equivalently represented by the negative log relative hazard. The C-index estimates the probability that the order of the predictions of a pair of comparable patients is consistent with their observed survival information.

(25)

3.2.2 Variable Importance of GBM

Variable Importance is another aspect worth exploring. In the gradient boosting method, it provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance. The importance is calculated for a single decision tree by the amount that each attribute split point improves the performance mea-sure, weighted by the number of observations the node is responsible for. The performance measure is the purity used to select the split points, known as Gini index [33]. The feature importances are then averaged across all of the the decision trees within the model.

3.3 Survival Random Forests

The high level algorithm of the survival random forests [34] is presented below: 1. As a first step, B bootstrap samples from the original data must be drawn. It is important to point out that each bootstrap sample excludes on aver-age 37% of the data, called out-of-bag data (OOB data), which they will be used later on.

2. A survival tree is grown for each bootstrap sample. At each node of the tree, p candidate variables are randomly selected and the node is split using the candidate variable that maximizes survival difference between daughter nodes.

3. The growth of the tree to full size is terminated under the constraint that a terminal node should have no less than d0> 0 unique deaths.

4. For each tree, the Cumulative Hazard Function (CHF) has to be calcu-lated. The ensemble CHF is derived by the total average.

5. Finally, using OOB data, the prediction error for the ensemble CHF is calculated.

Central elements of the RSF algorithm are growing a survival tree and con-structing the ensemble CHF. It is necessary to provide details [35] in order to understand these.

Binary survival tree Survival trees are binary trees grown by recursive splitting of tree nodes. A tree is grown starting at the root node, which is the top of the tree comprising all the data. Using a predetermined survival criterion, the root node is split into two daughter nodes, known as left and right daughter nodes. In turn, each daughter node is split with each split giving rise to left and right daughters respectively. The process is repeated in a recursive fashion for each subsequent node. The optimum split for a node is found by searching over all possible x variables and split values c, and choosing the optimum values

(26)

for x and c, x∗ and c∗, that maximize survival difference. As a result, the tree pushes dissimilar cases apart. Eventually, as the number of nodes increase, and dissimilar cases become separated, each node in the tree becomes homogeneous and is populated by cases with similar survival.

Terminal node prediction Eventually the survival tree reaches a satura-tion point when no new daughters can be formed because of the criterion that each node must contain a minimum of d0 > 0 unique events; these nodes are

called terminal nodes. Denoting these by T , let (T1,h, δ1,h, ..., (Tn(h),h, δn(h),h))

be the survival times and δ contain the 0-1 censoring information for individuals (cases) in a terminal node h ∈ T . An individual i is said to be right-censored at time Ti,h if δi,h = 0. Otherwise, if δi,h = 1, the individual is said to have

experienced an event at Ti,h. The calculation of CHF is derived based on those

nodes. Let t1,h< t2,h< ... < tN (h),h be the N (h) distinct event times. Once

dl,h and Yl,h are defined as the number of events and individuals at risk at

time tl,h, the CHF estimate for h is the Nelson - Aalen estimator is set to be

ˆ

Hh(t) =Ptl,h≤t dl,h

Yl,h [36]. We have to acknowledge that all cases within h have

the same CHF and each case i has a d-dimensional covariate xi (the notation

x above refers to one coordinate of xi). Let H(t|xi) be the CHF for i and in

order to determine its value, we would have to drop xi down the tree. Because

of the binary nature of a survival tree, xi will fall into a unique terminal node

h ∈ T . Thus, the CHF for i is the Nelson - Aalen estimator for xi’s terminal

node: H(t|xi) = ˆHh(t), if xi∈ h. This defines the CHF for all cases and defines

the CHF for the tree.

The bootstrap and OOB ensemble CHF The CHF is derived from a single tree. To compute an ensemble CHF, we average over B survival trees. By describing both an OOB and bootstrap estimate, at first it must be noted that each tree in the forest is grown using an independent bootstrap sample. Define Ii,b = 1 if i is an OOB case for b; otherwise, set Ii,b = 0. Let Hb∗(t|x) denote

the CHF for a tree grown from the bth bootstrap sample. The OOB ensemble CHF for i is He∗∗(t|xi) =

PB

b=1Ii,bH∗b(t|xi)

PB

b=1Ii,b . This is an average over bootstrap

samples in which i is OOB. Equivalently, H_e∗∗(t|xi) can be calculated as follows.

Drop OOB data down a survival tree grown from in-bag (bootstrap) data and keep track of i’s terminal node and its CHF. Afterwards, take the average of these CHFs. This produces the later equation. In contrast to the equation, the bootsrap ensemble CHF for i is H_e∗(t|xi) = _B1 P

B

b=1Hb∗(t|xi). This uses all

survival trees and not just those where i is OOB.

3.3.1 Concordance Index of Survival Random Forests

To estimate prediction error, we use the concordance index, as it was explained in the subsection of GBM. Its estimation differs in the case of survival random forests. It estimates the probability that, in a randomly selected pair of cases,

(27)

the case that fails first had a worst predicted outcome. The interpretation of the C-index as a misclassification probability is attractive, and is one reason that is used for prediction error. Another attractive feature is that, unlike other measures of survival performance, the C-index does not depend on a single fixed time for evaluation. The C-index also specifically accounts for censoring. The C-index is calculated using the following steps:

1. Form all possible pairs of cases over the data.

2. Omit those pairs whose shorter survival time is censored. Omit pairs i and j if Ti = Tj unless at least one is an event. Let Permissible denote

the total number of permissible pairs.

3. For each permissible pair where Ti 6= Tj, count 1 if the shorter survival

time has worse predicted outcome; count 0.5 if predicted outcomes are tied. For each permissible pair, where Ti= Tj and both are events, count

1 if predicted outcomes are tied; otherwise, count 0.5. For each permissible pair where Ti= Tj, but not both are events, count 1 if the event has worse

predicted outcome; otherwise, count 0.5. Let Concordance denote the sum over all permissible pairs.

4. The C-index, C, is defined by C = Concordance P ermissible

As for the OOB prediction error, calculating C requires a predicted outcome. The OOB ensemble CHF is used to define a predicted outcome. Because this value is derived from OOB data, it can be used to obtain an OOB estimate for C, and, consequently, an OOB error rate. Let t0

1, ..., t0m denote pre-chosen

unique time points (the unique event times are used, t − 1, ..., tN). Ranking

two cases i and j, by denoting that i has a worse predicted outcome than j ifPm

l=1H ∗∗

e (t0l|xi) >P_l=1m He∗∗(t0l|xj). Using this rule, compute C as outlined

above. Denote the OOB estimate by C∗∗. The OOB prediction error, P E∗∗, is defined as 1 − C∗∗. Note that 0 ≤ P E∗∗ ≤ 1 and that a value P E∗∗ _{= 0.5}

indicates prediction no better than random guessing.

3.3.2 Variable Importance of Survival Random Forests

As for the variable importance in survival random forests, variables can be se-lected by filtering on the basis of their variable importance (VIMP) [35]. To calculate VIMP for a variable x, drop OOB cases down their in-bag survival tree. Whenever a split for x is encountered, assign a daughter node randomly. The CHF from each such tree is calculated and averaged. The VIMP for x is the prediction error for the original ensemble subtracted from the prediction error for the new ensemble obtained using randomizing x assignments.

Large importance values indicate variables with predictive ability, whereas zero or negative values identify non predictive variables to be filtered. Under

(28)

the index, one can interpret VIMP in terms of misclassification. Recall the C-index estimates the probability of correctly classifying two cases. Thus, VIMP for x measures the increase (or drop) in misclassification error on the test data if x were not available.

One should be careful interpreting VIMP. Although tempting, it is incorrect to think VIMP estimates change in prediction error for a forest grown with and without a variable. For example, if two variables are highly correlated and both predictive, each can have large VIMP values. Removing one variable and re-growing the forest may affect the VIMP for the other variable (its value might get larger), but prediction error will likely remain unchanged. VIMP measures the change in prediction error on a fresh test case if x were not available, given that the original forest was grown using x. Although, in practice, this often equals change in prediction error for a forest grown with and without x, con-ceptually the two quantities are different.

(29)

4 Data Exploration and Preprocessing

Data exploration is a procedure whose goal is to achieve a basic understanding of the data. The results of data exploration can be extremely useful in grasping the structure of the data, the distribution of the values, and the presence of extreme values and interrelationships within the data set [37]. This section provides insights of the raw files and dataset on which the experiments were based on, discusses the preprocessing technique that was proposed and the datasets that were derived.

4.1 Raw files

The origin of the data is Stizon, an organization that collects GP data on a national scale and now has up to 3 million data with a follow-up of over 10 years on average per patient. Stizon has bilateral arrangements with around 1200 GP practices, where yearly GP practices let Stizon have access to all data in their EHRs. Stizon then performs basic transformations to acquire the data from various EHR providers in a similar, standardized format. However, the data that was available for immediate analysis in the scope of this thesis was the subset of the Leiden & The Hague region called the ELAN data (Extramuraal Leids Academisch Netwerk). This dataset includes around 110K GP patients and the files that were given provided by Leiden University Medical Center are presented below:

• Verrichtingen.tab - Contains the dates of the recorded visits of each pa-tient.

• Patient.tab - Contains info of each patient.

• Journaal.tab and ICPC.tab - Contains the ICPC and medication codes of each patient alongside the date that they were recorded. The Inter-national Classification of Primary Care (ICPC) is the most widely used international classification for systematically capturing and ordering clin-ical information in primary care [38].

• Bloeddruk.tab - Contains info on blood pressure measurements of each patient.

• Roken.tab - Contains info on smoking habits of each patient.

• Labbloed.tab - Contains info on blood related measurements of each pa-tient.

4.1.1 Verrichtingen.tab

The Verrichtingen.tab is a file that contains the recorded visits of each patient. It consists of solely two columns - one for the patient’s ID and another one for the date of the visit, starting from 01/01/2007 until 31/12/2017. After the

(30)

data has been cleaned by dropping the duplicates and cases of NaNs, 99877 unique cases of patients are derived. These visit dates are based on financial administration of the GP practice resulting from GP visits, and are as such relatively very trustworthy data for definition of patient visits.

Figure 5: Distribution of visits in Verrichtingen.tab

Figure 5 denotes that the vast majority of patients are being recorded just once. The cases that hold thousands of visits refer to cases that may have been hospitalized and therefore have been recorded each day.

4.1.2 Patient.tab

The Patient.tab is a file that contains each patient’s info; ID, age, birth and death year, as well as entry and exit date. Due to numerous irregularities related to absence of data and date missmatch, the columns concerning the entry and exit date of each patient have been eliminated and the corresponding info is derived based on Verrichtingen.tab. The info regarding the death year of each patient has also been eliminated, since there is no sign of correlation between the death of the patient and an event of stroke. After the data has been cleaned by dropping the duplicates and cases of NaNs, 131098 unique cases of patients are derived.

Gender Total number Percentage Females 68646 52.32

(31)

Figure 6: Distribution of birthyears in Patient.tab

Figure 6 present an overview of the patients’ birth year. It is important to point out that only the data of patients that are 18 years old or over are used in the experiments, which were derived by this file.

4.1.3 ICPC.tab

The ICPC.tab is a file that contains various ICPC codes for each patient. For each patient’s ID, there is the name of the ICPC code that has been recorded as well as the corresponding date. The ICPC codes which were selected as features of the prediction model are explained below.

ICPC Codes Outcome T90 / T90.01 / T90.02 Diabetes

T93 Hypercholesterolemia K85 / K86 / K87 Hypertension K90 / K90.01 / K90.02 Stroke

After the data has been cleaned by dropping the duplicates and cases of NaNs, 103473 unique cases of patients are derived.

(32)

Figure 7: Distribution of recorded ICPC codes in ICPC.tab

Figure 7 presents an overview of the recorded ICPC codes that were used later on the experiments. It is clear that just a fraction of unique patient cases have recorded these particular ICPC codes.

4.1.4 Journaal.tab

The Journaal.tab is a similar file to ICPC.tab; it contains various ICPC codes for each patient alongside the date of each record. The focus is still on the ICPC codes that will be later used in the experiments. After the data has been cleaned by dropping the duplicates and cases of NaNs, 98723 unique cases of patients are derived.

Figure 8: Distribution of recorded ICPC codes in Journaal.tab

Figure 8 presents an overview of the recorded ICPC codes of interest. It is clear that just over half of unique patient cases have a recorded ICPC code of interest.

(33)

4.1.5 Bloeddruk.tab

The Bloeddruk.tab contains info on blood measurements of patients; each pa-tient’s ID has been assigned a blood related type of measurement, its value as well as its recorded date. The focus is on the following blood measurements which are explained below.

Blood Measurements Outcome

RRDIKA Diastolic Blood Pressure RRSYKA Systolic Blood Pressure

After the data has been cleaned by dropping the duplicates and cases of NaNs, 16411 unique cases of patients are derived. The distribution of RRDIKA and RRSYKA measurements are presented on the Figures 9 and 10 below.

Figure 9: Distribution of recorded RRDIKA measurements

(34)

4.1.6 Roken.tab

The Roken.tab is a file that contains the smoking habits of each patient, along-side the date that the measurement has been recorded. After the data has been cleaned by dropping the duplicates and cases of NaNs, 8131 unique cases of patients are derived. The distribution of smoking habits of patients are pre-sented on the Figure 11 below, where ”Ja” represents regular smokers, ”Nooit” non-smokers and ”Voorheen” casual smokers.

Figure 11: Distribution of smoking habits

4.1.7 Labbloed.tab

The Labbloed.tab is a file that contains cholesterol related measurements of each patient. The focus is on the following cholesterol measurements which are explained below.

Blood Measurements Outcome CHHDBMI Cholesterol Ratio CHOLBMT Cholesterol GLUCBNU Glucose

After the data has been cleaned by dropping the duplicates and cases of NaNs, 14384 unique cases of patients are derived. The distribution of CHHDBMI, CHOLBMT and GLUCBNU measurements are presented on the Figures 12, 13 and 14 below.

(35)

Figure 12: Distribution of recorded CHHDBMI measurements

Figure 13: Distribution of recorded CHOLBMT measurements

(36)

4.2 Main Dataset

The raw files were acquired in order to be preprocessed in a manner described in the following subsection of ”Aspect of Time - Time-Interval Preprocessing”. In addition to those files, a dataset, referred to as main dataset from now on, containing the patient’s age and gender, various ICPC and medication codes, blood and cholesterol related measurements, smoking habits as well as the oc-currence and the time till the stroke event was acquired. The main dataset was created in the scope of the Hacking Women’s Stroke study by Ted Etoehamowo [39] and contains info derived by the previously mentioned raw files as well as additional ones. This dataset has also incorporated the absence of each mea-surement; containing variables that hold info on whether each patient is missing a measurement or not. This preprocessing method offers the chance to include a patient in the experiment regardless of whether there are complete data on him or her. The focus is on a particular set of variables which will be called from now on the ’benchmark’ variables, which capture the most influential factors in a stroke event. These variables are the variables of interest which are extracted above from the raw files, including the following medication related variables:

ATC-Code Medication A10 Diabetes Related B01 Antithrombotic C01 Cardiac Therapy C02 Antihypertensives C03 Diuretics C04 Peripheral Vasodilators C05 Vasoprotectives C07 Beta Blocking Agents C08 Calcium Channel Blockers C09 RAAS-System Blockers

C10 Statins

In order to be usable by the prediction model, each entry in this dataset has to represent one patient. Each entry contains the patient’s info and mea-surements that are derived by preprocessing raw data. The patients in the dataset that were later used in the experiments satisfy the following three re-quirements; they have been examined by GPs between the period of 01/01/2007 to 31/12/2017, they have been followed for at least 6 years after their initial visit and they are over 18 years old by the time of their first visit. In order to get an estimate of each measurement for each patient throughout the years in just one entry, the mean value of the corresponding measurement throughout the years of the patient’s visits or up until the occurrence of the event has been set. Afterwards, each measurement is characterized as High, Medium or Low by comparing them to the overall value of all patients. In case there is not a particular measurement, a column showing the absence of the relative event is set to 1. The overall preprocessing procedure lasted 572.689 seconds on Intel

(37)

Core i5-6500 @ 3.20GHz (16GB).

Figure 15: Distribution of age of

pa-tients in main dataset Figure 16: Distribution of time to event_{in main dataset}

The figures above illustrate the distribution of the age and time to event of the dataset’s patients respectively. It is vital to point out that the patients of interest are the ones over 18 years old.

4.3 Aspect of time - Time-Interval Preprocessing

Due to its design, the main dataset does not efficiently incorporate the fluctu-ations of measurements over the years because it only includes data from the first year of follow-up (roll-in period); its value has been determined based on the mean recorded value of each measurement. In order to provide an insight on how each measurement fluctuates over time, an alternative method is being proposed; taking time - intervals into account for each patient.

The time intervals are defined based on the visits of the patient. A recorded date in the Verrichtingen.tab denotes the start of the time interval, which ends by the date of the patient’s next visit. This last visit defines also the start of the new interval and so on. Since the purpose of this preprocessing method is to present how each variable of interest varies over time, every variable is monitored for each time interval based on the dates that have been recorded in the corresponding .tab files. However, in order to avoid acquiring extremely specified and short-term time intervals, whose fluctuation of values might give a false impression of the patient’s condition, the visits have been merged into three different cases of granularity; 2 weeks, 1 month and 2 months.

Even though this preprocessing method provides an insight on the variation of the variable over time, the need for a baseline is still not satisfied. Therefore, the time intervals are being split into a roll-in and a follow-up period. The roll-in period serves as the baseline and it lasts up to 1 year after the patient’s

(38)

first visit. The follow-up period is defined as the period right after roll-in and could last up to 9 years.

Following these steps leads to the acquisition of 6 datasets; containing the patients’ info and measurements throughout their roll-in and follow-up period, which were defined based on the 3 different cases of granularity. In each dataset, a patient might be portrayed by multiple rows, since each row represents each one of his time intervals. The goal is to change the format of the dataset and suppress the information into one row sufficiently, without losing time variance as a factor in the prediction model. Thus, the Reliability metric is introduced and is defined as

Reliability = Number of Visits

Duration of Time Intervals (10) The claim that supports this idea is that smaller duration of time intervals and more frequent visits mean more confidence in the validity of doctors’ mea-surements. The multiple rows of each patient may be merged into one by using the Reliability metric and examining whether each variable of interest has low, middle or high values more frequently within these time intervals. The main drawback of this approach is the duration of the procedure which lasted up to 40 hours on Intel Core i5-6500 @ 3.20GHz (16GB).

4.4 Preprocessed Data

This section presents the different datasets that are created based on the time - interval preprocessing method.

4.4.1 Roll-In Period

The patient’s age and gender are unaffected by the different cases of granularity. In addition, it is important to point out that the recorded times to event are also unaffected since the focus of the experiments lies till the occurrence of the first event. Figures 17 and 18 present the corresponding information.

(39)

Figure 17: Distribution of age of pa-tients in time interval preprocessing of roll-in period

Figure 18: Distribution of time to event (in days) in time interval preprocessing of roll-in period

Apart from the patient’s age, gender and time to event, the datasets contain info regarding the recorded ICPC codes, their blood and cholesterol measure-ments, as well as smoking habits. The corresponding figures can be found in the Appendix. As the granularity gets smaller, more measurements are being recorded for each patient. However, the changes are not symmetrical since not every possible measurement is being recorded during a patient’s visit. This leads to loss of information of measurements that are frequently recorded, and there-fore merged into smaller time periods, whereas measurements that are more sparsely are not as influenced by this approach. In addition, a noticeable differ-ence from the main dataset is the fact that the majority of measurements have either high or low values; a cause of the different preprocessing technique that the time - interval preprocessing is based on.

4.4.2 Follow-Up Period

As discussed in the previous subsection, the recorded times to event are un-affected by the different cases of granularity. Figure 19 denotes that the vast majority of events occur within the first year of follow-up.

(40)

Figure 19: Distribution of time to event (in days) in time - interval preprocessing of follow-up period

As it was the case for the dataset that contain info about the roll-in period of the patients, the datasets that contain info about the entire run-in and follow-up period incorporate the patients’ age, gender, time to event, ICPC codes, blood and cholesterol measurements as well as smoking habits. The distributions illus-trating that information can be found in the Appendix, too. The observations concerning the number of recorded measurements and symmetry as well as the influence of the preprocessing technique hold true for the follow-up period, as they did in the roll-in. However, due to the fact that the follow-up period has a notably bigger duration than the roll-in, the number of recorded cases is significantly higher and the changes in the distribution are less distinguishable.

(41)

5 Experimental setup

This section contains the hyperparameter tuning and validation techniques that were used in the experiments, as well as a presentation of the different groups of variables that were part of each model that was build.

5.1 Hyperparameter Tuning

Hyperparameter tuning is defined as choosing a set of optimal hyperparameters for a learning algorithm. This procedure has been performed by grid search which works by searching exhaustively through a specified subset of hyperpa-rameters.

5.1.1 Hyperparameter Tuning in Survival Support Vector Machines As it was shown in Equation 6, the hyperparameter of interest is α. The values for which the performance of the model has been tested for are 0.1 to 1 with a step of 0.1. The chosen value for the model is 0.4, which means that the amount of regularization is slightly increased.

5.1.2 Hyperparameter Tuning in Gradient Boosting

The hyperparameters that were chosen to be tuned in the gradient boosting method are:

1. learning rate: Shrinks the contribution of each tree by its value. The values for which the performance of the model has been tested for are 0.1 to 0.9 with a step of 0.1. The chosen value for the model is 0.5.

2. n estimators: Number of boosting stages to perform. The values are 100 to 1000 with a step of 100. The chosen value of the model is 400.

3. max depth: Maximum depth of the individual regression estimators. The corresponding values are 1, 3 and 5. The chosen value for the model is 3. 4. min samples split: The minimum number of samples required to split an internal node. The corresponding values are 1, 2 and 4. The chosen value for the model is 2.

5. min samples leaf: The minimum number of samples required to be at a leaf node. The corresponding values are 1, 2 and 4. The chosen value for the model is 1.

The chosen values for each factor result in the highest C-index out of all possible combination.

(42)

5.1.3 Hyperparameter Tuning in Survival random Forests

The procedure was repeated for the case of survival random forests, since it was the only method that was run in R 3.5.3, whereas all the others were run in Python 3.7.3. The hyperparameters that were chosen to be tuned in the survival random forests method are:

1. ntree: Number of trees in the forest. The values are 100 to 1000 with a step of 100. The chosen value of the model is 200.

2. nodedepth: Maximum depth to which a tree should be grown. The corre-sponding values are 1, 3 and 5. The chosen value for the model is 3.

5.2 Validation

The validation technique that has been used for the model is the k-folds cross validation. Cross-validation is a resampling procedure used that evaluate ma-chine learning models on a limited data sample [40]. The procedure has a single parameter called k which refers to the number of groups that a given dataset is to be split into and, thus, the procedure is often called k-fold cross-validation. Cross-validation is primarily used in applied machine learning to estimate the predictive performance of a machine learning model on unseen data. By using a limited sample, an estimation of how the model is expected to perform in general when used to make predictions on data not used during the training of the model is derived. Its simplicity makes it a popular method because and it generally results in a less biased estimate of the model’s performance than other methods, such as a simple train/test split.

The steps of the k-fold cross validation are: 1. Shuffle the dataset randomly.

2. Split the dataset into k groups.

3. For each unique group take it as a hold out or test data set and take the remaining ones as a training data set. Afterwards, fit a model on the training set, evaluate it on the test set, retain the evaluation score and discard the model.

4. Summarize the skill of the model using the sample of model evaluation scores.

Each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. Therefore, each sample is given the opportunity to be used in the hold out set once and used to train the model k-1 times. The results of a k-fold cross-validation run are summarized with the mean of the model skill scores. In literature, k is usually 10 or 5. Due to the large size of the datasets in the experiments, the number of folds chosen was 5, since the alternative would exponentially increase the computational time.

Dealing with the Temporal Structure of Routine Primary Care Data