• No results found

Can predictive modeling be used to optimize the timing of discharge from the intensive care?

N/A
N/A
Protected

Academic year: 2021

Share "Can predictive modeling be used to optimize the timing of discharge from the intensive care?"

Copied!
68
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Can predictive modeling be used to optimize the

timing of discharge from the intensive care?

Annanina Koster University of Groningen

(2)
(3)

Can predictive modeling be used to optimize the

timing of discharge from the intensive care?

Annanina Koster

Abstract

In this thesis it is researched whether, in order to diminish rising health care costs, the timing of discharging patients from the intensive care (IC) at the VU Medical Center (VUMC) can be optimized using predictive modeling. Since discharging a patient too early can lead to higher probabilities of mortality and readmission, which bring about unnecessary high costs, and keeping patients on the IC for too long also brings about unnecessary high costs, accurate predictions of mortality and readmission probabilities can help physicians in deciding when the best moment is to discharge patients and thus reduce costs.

Using a Logistic Regression and Neural Networks to model the binary classification prob-lem of readmission/mortality versus no readmission/mortality and using Logistic LASSO as a feature selection method and as a model on its own, it was found that Logistic LASSO out-performs a Logistic Regression and that it out-performs equally good as a Neural Network with or without feature selection. The models try to minimize both the errors of incorrectly predicting readmissions and mortality and the incorrect prediction of no readmission and mortality.

(4)

Acknowledgements

First of all, I would like to thank my supervisor from the University of Groningen Tom Boot for all the critical questions he asked me and the time he took discussing this thesis with me.

Then I would like to thank Pacmed for providing me with this interesting dataset, showing me all the ins and outs of medical data analysis and providing me with a comfortable and inspiring work environment. I especially want to thank my supervisor Hidde Hovenkamp for all his hard work on the data, his bright ideas and all the brainstorm sessions he had with me. Moreover, I would like to thank my team-member Mattia Fornasa for helping me out with some algebraic diffi-culties. Additionally, I want to thank the rest of the Pacmed team for helping me out with random Python and data science questions and providing me with a welcoming and inspiring surrounding for writing this thesis.

I would also like to thank Robert Feyerharm (SAS) for helping me out with the Logistic LASSO Coordinate Descent algorithm and Robert Tibshirani and Trevor Hastie for answering my question regarding Logistic LASSO unexpectedly quick.

(5)

Contents

1 Introduction 4 1.1 Problem definition . . . 4 1.2 Modeling . . . 5 1.3 Hypotheses . . . 6 2 Data Description 8 2.1 Medical background and hypotheses . . . 8

2.2 Preprocessing . . . 11

2.2.1 Patient selection . . . 11

2.2.2 Aggregation of time-dependent variables . . . 12

2.3 Final dataset . . . 14

2.3.1 Summary statistics . . . 15

2.3.2 Dealing with missing observations . . . 17

3 Methodology 18 3.1 Theory. . . 18 3.1.1 Logistic Regression [13, 36, 42] . . . 18 3.1.2 Logistic LASSO [29, 31] . . . 19 3.1.3 Neural Networks [36, 47, 70, 85] . . . 23 3.1.4 N-fold cross-validation [36, 42] . . . 27

3.1.5 ROC curve, AUC & PR curve [36, 42] . . . 27

3.1.6 DeLong test [23, 34, 91] . . . 29

3.1.7 Feature importance. . . 29

3.2 Implementation . . . 29

3.2.1 Logistic Regression & LASSO. . . 30

3.2.2 Neural Networks . . . 30

3.2.3 Models . . . 31

3.2.4 N-fold cross-validation . . . 31

3.2.5 Feature importance. . . 31

4 Results 33 4.1 Missing observations imputation . . . 33

4.2 ROC curves, AUC values & PR curves . . . 34

4.3 Predictions . . . 35 4.4 Feature importance. . . 36 4.4.1 Variable categories . . . 37 4.4.2 Variables . . . 38 5 Conclusion 40 6 Discussion 42 References 44 Appendices 50 Appendix A: Tables with descriptives initial dataset . . . 50

Appendix B: Table with summary statistics final dataset. . . 53

Appendix C: Table with results after imputation with the median. . . 60

Appendix D: Figures with ROC and PR curves . . . 61

Appendix E: Figures with feature importance plots . . . 63

(6)

1

Introduction

Over the past decades, concerns about rising health care costs have popped up all over the world [1,3,4,11, 22,27,52,58,64,66,75,95]. These rising costs generate problems such as increasing tax rates [1], higher unemployment rates in the health care sector due to higher labor costs [1] and eventually lower welfare for the world population due to lower quality health care services [1,22]. Several factors are believed to contribute to this phenomenon, such as the ever growing and aging population [22,52,58, 64,75], rapidly advancing medical technology [1,11, 22, 27,58,75], rising labor costs in the medical sector [75], the lack of cost-containment measures for new treatments and drugs [4,58] and the increasing occurrence of chronic diseases [1, 75]. There is also research suggesting an overall increasing unhealthy lifestyle as the main cause of the rising health care costs. This last problem could be solved by long-term solutions such as promoting a healthier lifestyle [1,58,66,95].

Nevertheless, a short-term solution to diminish these rising costs would be to try and reduce the immediate costs such as those of using expensive technologies. However, this should preferably not negatively affect the level of welfare [1,66]. Hence, a decision should be made as to which costs should be reduced and how. These type of decisions can be made using the large amount of data available in the medical sector [1,15] for support, as can be seen in this thesis.

One instance where health care costs are especially high are intensive care (IC) units [20,97,

101]. The high frequency with which actions are performed, the expensive technologies and treat-ments used and the continuous attendance of high-skilled, scarce and expensive staff-members make the specific care needed by patients on the IC exceptionally expensive. Moreover, the demand for intensive care is rising [15,20], causing an increase in these already astonishing costs.

According to the Alliance for Health Reform (2012) [1] and the PWC Health Research Institute (2017) [4], unnecessary costs are frequently made in health care [58]. These costs, when detected, can in theory be cut without further consequences. On the IC, unfortunate timing of discharge of a patient can bring about two types of unnecessary costs [12,24,46,53].

First of all, keeping a patient on the IC while the intensive care provided there is no longer needed brings about unnecessary high costs and risks for the patient [9, 46]. Spending a day on the IC is costly, as mentioned before, and invasive for the patient [24] and spending unnecessary time on the IC would therefore preferably be avoided.

Secondly, it was found that being discharged from the IC too early leads to a higher chance of readmission and mortality [14, 24, 43, 46]. When patients are discharged from the IC, they are believed to be in a stable condition [46] and are expected not to return to the IC or pass away [48], with the exception of patients who explicitly have a no-readmission status after being dis-charged. When patients decease after being discharged, the intensive care provided to them was not effective. As this expensive care has only limited availability, it might have been used to save another life. Patients readmitted to the IC often arrive in a worse condition than at the initial admission [24], requiring more expensive treatments and intensive care. This results in an increase of the length of stay of a patient on the IC [9, 12,43, 46,48, 53], each day being expensive, and increased mortality probabilities [9,12, 14,24, 43, 48,53]. Moreover, readmissions to the IC are unexpected, causing a higher chance of an adverse outcome [46]. These unexpected readmission may also cause early discharge of more stable patients [43], which might increase their chance of being readmitted.

Hence, both too early and too late discharges can bring about unnecessary costs. As these costs can be high due to the costly intensive care environment, finding the optimal moment to discharge a patient can reduce total IC costs and decrease both mortality and readmission rates after dis-charge [24,43,97].

1.1

Problem definition

(7)

length of stay and consequently costs at intensive care units may decrease. Therefore, in collab-oration with Pacmed and the Intensive Care Unit (ICU) at the VU Medical Centre (VUMC) in Amsterdam, this thesis focuses on the following question.

Can predictive modeling be used to optimize the timing of discharge from the inten-sive care?

In order to answer this research question, the substantial dataset of the VUMC intensive care unit will be used to try and obtain accurate predictions of mortality and readmission probabilities after discharge. The amount of incorrectly predicted cases will be minimized. The predicted probabil-ities can then be used by physicians when deciding whether or not to discharge a certain patient at a specific moment in time. Perfect predictions would result in optimal timing of discharge.

Optimally timing discharge from the intensive care does not only benefit the hospital, but the patient as well. Staying on the intensive care is unpleasant for patients and being readmitted even more [24]. Thus, finding an optimal moment for discharge will minimize their length of stay and their chance of readmission and mortality.

In recent years, as in the rest of the medical sector, intensive care units have become quite understaffed [18, 30, 40]. Employees are expensive and scarce. The physicians have long shifts and have their minds on many matters at the same time. It has been shown by Tarnow-Mordi et al [94,48] that when the workload on the IC increases, more patients decease. Since the issue of whether or not to discharge a patient is a pressing and costly matter that is of great importance, any automated (inexpensive) tool supporting the physician in making this decision faster and more precise could be of great value.

1.2

Modeling

The data collection of an IC unit is immense. Apart from known demographic data of a patient, during a stay on the IC many values are recorded of which most with a high frequency, such as several (non-)invasive heart rate measurements, various blood measurements and mechanical ventilation rates. The dataset provided by the VUMC did not only contain measurements from a lot of variables over the entire stay at the IC, but also contained many patients. This results in enormous amounts of data which can be very informative for physicians when analyzed properly [1,2,15,49]. Predictive models can often find patterns in large amounts of data that humans tend to overlook [49]. This may provide data-based answers to some pressing questions IC physicians may have, such as whether to perform certain routine tests [6, 15], what the risk factors are for mortality or readmission [9,20, 43,81,82] or when to discharge a patient [19,33,38, 39,49,67]. Concerning the final, at the moment of discharge from the IC physicians believe that the patient is in a stable condition [46,80]. If the patient’s stability deteriorates afterwards, this was not foreseen by the physician. Appropriate models might be able to find patterns in the data that can predict readmission and mortality rates after discharge accurately. Physicians can use these predictions to determine whether or not to discharge a patient at a certain moment in time. Risk factors for readmission and mortality may also be provided by these models, possibly providing physicians with new insights as to what increases readmission or mortality probabilities after discharge [43,46]. This immense, unique dataset does provide some challenges. First, the variables measured over time do not all have the same frequency. Therefore, they have to be summarized by some statistics per patient. This increases the dimensionality of the data immensely. Moreover, the measurements taken at the IC are all used to determine the stability of a patient and are therefore quite correlated to each other. Furthermore, not all variables are expected to be linearly related to the outcome. Finally, the amount of readmissions and deceases is small (5%), resulting in an imbalanced dataset. Predictive models have trouble finding patterns in highly imbalanced datasets [102]. The fact that the IC at the VU is a general IC, with a mixed population of patients, makes it even harder to find patterns to identify patients with possible adverse outcomes after being discharged.

(8)

techniques for classification. In this thesis, the predictive power of a Logistic Regression (LR) and a Neural Network (NN) will be compared. Logistic LASSO (LL) will be used as a feature selection technique and its predictive power as a model on its own will also be compared.

Logistic regression is an econometric model used for classification problems [13]. In the field of medicine this model has also been used quite regularly to model and/or predict the outcome of patients [9,12,14,19,20,27,33,38,39,41,43,44,49,61,67,79,81,82,84,98]. Also in research on intensive care units, logistic regression has often been used to model and/or predict mortality or readmission probabilities [9,12,14,19,20,33,43,49,67,81,98].

Nowadays, Neural Networks are regularly used in many different fields, from agriculture to energy to medicine, in order to predict outputs accurately [5,19,27,38,41,49,57,61,71,84]. Moreover, in research on IC units they are used quite often for classification problems on the outcome of patients [19,27,49]. Neural Networks have also been implemented in Econometrics starting from the 90’s, mainly in the fields of finance and macro economics. Both fields used them for forecasting time series. Moreover, nonparametric estimation and learning by economic agents can also be performed by Neural Networks [8,17,28,32,37,45,51,69,96].

Logistic Regression and Neural Networks have been found to be the most often used models in the field of biomedicine by Dreiseitl and Ohno-Machado (2002) [26]. A Neural Network is in fact a generalization of a Logistic Regression that can detect non-linearities and multicollinearity in the data itself [19, 24, 26, 27, 36, 38, 41, 45, 70, 84]. It can also, computationally seen, be fit more easily on high dimensional datasets, although overfitting is often a problem [19,26,36,70]. Logis-tic LASSO can be used to decrease the dimensionality and mulLogis-ticollinearity of the data [42, 103] and fit the models on the important variables only, or as a predictive model on its own. Due to the shrinkage of estimates, it should generalize better and thus have good predictive power [103]. Overfitting can also be prevented by using cross-validation, as explained in Section3.1. Another advantage of Neural Networks over a Logistic Regression is that when using online updating or mini-batch updating instead of batch-updating, the network can be updated when a single new observation or small number of observations becomes available, instead of fitting the entire model on the new, complete dataset again [70]. This means that for the IC data, when the outcome of a new patient is known, the model can directly learn from this new information and apply it to the next patient. A disadvantage of Neural Networks is that they have many unknown parameters that have to be tuned in order for the model to perform well, such as the number of hidden layers and their respective sizes [36,47]. All three models can experience difficulty with finding patterns in imbalanced datasets with a large variety of patients. However, the large number of available observations may help to overcome this problem.

The main conclusions from research comparing Neural Networks to a Logistic Regression are often that the Neural Network slightly outperforms the Logistic Regression. However, the in-crease in accuracy is negligible compared to the interpretability of the classical logistic model. Non-interpretable "black box" models such as Neural Networks are only preferred when they sig-nificantly improve the predictive power. Therefore, in the end, both models are usually rated equally well and a Logistic Regression would be preferred for implementation [19,49,61,84]. Previous similar research in the medical field on readmission and mortality probabilities at inten-sive care units often shows an improvement of a Neural Network over a Logistic Regression [27,49] or the physicians themselves [39,62, 88] or similar performances [19]. These researches, however, include only a few variables [14,19,81], only include variables that are obtained at the beginning or the first 24 hours of the admission on the IC [19, 33, 81], have only very few observations [43,48,98] or are outdated [81].

1.3

Hypotheses

Performance of the models will be measured using the Areau Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, since a higher AUC value indicates a high true positive rate and a low false positive rate (high true negative rate) which indicates good perfor-mance. Most similar research uses these metrics, making it good for comparison [26, 33,39, 41,

(9)

Making use of all information included in the dataset of the VUMC, it is expected that a Neural Network can obtain an AUC higher than 0.7 than in Desautels et al (2017) [24] and [14] or than the Logistic Regression used on only data from the first and last day as in [43], which gives an AUC of 0.75. However, improving upon the 0.9 AUC of [49] will be hard as this only predicts mortality, which is easier to predict.

Moreover, it is expected that a Neural Network will outperform a Logistic Regression as it can handle non-linearities in the data without pre-specifying them [36, 42]. It is also expected that Logistic LASSO will outperform a regular Logistic Regression, because it generalizes better on never-before-seen data. Moreover, a Neural Network with LASSO used for feature selection is expected to perform slightly better than without, due to a lower chance of overfitting. Finally, a Neural Network is expected to outperform Logistic LASSO due to its ability to handle non-linearities in the data.

When there are no non-linearities present in the data, Logistic LASSO is expected to perform best as it can deal with high dimensional data (LR) and does not have the problem of overfitting (NN). The best model is expected to give accurate predictions of mortality and readmission probabilities that can be used to determine the best moment to discharge a patient from the IC. This will reduce costs and mortality and readmission rates.

This thesis will contribute to the existing literature by fitting a predictive model to this unique intensive care dataset to predict readmission and mortality probabilities after discharge using all available data. The fitted models will be critically reflected in terms of performance. Since every ICU has its own characteristics, models can hardly ever be generalized to other units [48], making this research unique. Similar approaches, however, can be used for other IC units in the Nether-lands or abroad. If the predicted probabilities are accurate enough, they can be used to support physicians when deciding when to discharge patients from the IC. The models may also provide important predictors which may not have been known to have an impact on these probabilities by the physicians beforehand. Better timing of discharge can reduce unnecessary costs made in the expensive IC environment and be valuable for a patient’s well-being.

(10)

2

Data Description

The dataset used in this thesis was provided by Pacmed and obtained at the Intensive Care Unit (ICU) at the VU Medical Center in Amsterdam. The dataset was already pre-processed by Pacmed and this procedure could not be altered. An overview of this process will be provided in Section

2.2, as it may clarify some assumptions the analyses in this thesis are based on.

The original dataset of the VU contains 217 variables for 24446 patients treated on the 22 bed ICU from 2003 until 2016. The patients in this dataset agreed to their data being used for scientific research. The 217 variables include background information on the patients and recorded values from their entire stay at the IC unit. Since the IC at the VU is a general IC, critically ill patients with a variety of diagnoses are admitted. The recorded measurements are generally measured for all patients at the VUMC ICU and provide information on the stability of their vital functions, which should work properly in order to stay alive. These vital functions are breathing and cir-culation (heartbeat and blood pressure), which make sure the organs, especially heart, brain and kidney, receive enough oxygen and can function properly [21,46,48,39, 56, 86].

2.1

Medical background and hypotheses

Patients are admitted to the IC when their vital functions are threatened, thus their condition is unstable or at high risk of becoming unstable. They may require, among other things, life sup-port for organ failure for two or more organs, advanced respiratory supsup-port, invasive treatments only provided in the IC environment and/or intensive monitoring. In general this means that pa-tients with potentially reversible organ failure or other life-threatening conditions that are believed to be able to return to a stable healthy condition receive temporary physiologic support [60,68,90]. The goal of a stay on the IC is to stabilize patients such that they can be safely discharged to a Medium Care (MC) Unit or a regular hospital ward where they can be treated for the underlying cause of the life-threatening symptoms. Core measurements for stability are heart rate, blood pressure, respiratory rate, pulse oximetry, hourly urine output, temperature and blood gases [90]. These can be measured in various variants, resulting in the 217 variables.

Reasons for admissions vary among patients, but the most common ones are in general res-piratory problems with ventilator support needed, heart problems, intracranial (within the skull) bleeding or cerebral (brain) infarction, and sepsis (blood poisoning, infections) without mechanical ventilation. Other conditions and procedures regularly leading to IC admission are poisoning and toxic effects of drugs, excessive fluids in the lungs and respiratory failure, heart failure and shock, cardiac arrhythmia and conduction disorders, renal (kidney) failure with major complication or comorbidity (side disease), gastrointestinal hemorrhage (bleeding) with complication or comorbid-ity, and diabetes with complication or comorbidity [2,48,74]. Airway, respiratory, circulation and neurological interventions are top-priority [90].

The general IC at the VUMC admits patients with serious illnesses, patients that suffered in a se-rious accident or patients that are recovering from a major operation and patients that suffer from

Table 1: Description Initial Dataset: Simple Variables

Type of Variable name Explanation Units Logical Logical

variable min max

general admission_weight Weight at admission kg 35 200

admission_length Length at admission cm 125 225

age Date of birth 0 120

urine_catheter Urinary-Catheter (CAD)

respiration tracheo Tracheostomy tube

detub_intub_ratio Tube

respirated Mechanically ventilated

(11)

one of the previously mentioned or other life threatening conditions. It has more cardio patients than the average ICU in the Netherlands. Many of these have their origin in planned cardiothoracic (heart, lungs) surgeries. Since patients recovering from surgeries have expected admissions, they have a lower chance of readmission according to Kaben et al (2008) [46].

Reasons for readmission are often the same as for the first admission when the patients are readmitted quickly [9, 12, 14]. Other regularly occurring reasons for readmission are infections [14,20,41], respiratory and cardio [12,14,97] or acute liver failure [20].

All 217 variables in the dataset have been assigned to either one of 5 groups: time-independent (simple) variables (7) (they don’t need aggregation as they have only been measured at one point in time), numeric variables (95), categorical variables (51), ordinal variables (5) and medication types (59). Each group will be aggregated differently later on. Table1on page 10 and tables16,

17, 18 and 19 in Appendix A show the division of the variables over these 5 groups. The vari-able names can be found in the second column of these tvari-ables and their explanation in the third. When available, the units in which the variables were measured and their logical minimum and maximum are also given. Values outside of these ranges were discarded. The only exception being the height of a patient, which was sometimes entered in centimeters instead of meters and could then be divided by 100 in order to obtain the correct value. In the end, most variables rarely had values outside of these bounds. The only exception were three variables with around 10% of their vales discarded. This indicates that, in general, the data points were measured quite accurately and almost no typing mistakes were made when manually entering them into the system. The variables were divided into 21 categories, such as respiration, cardio and blood pressure, as can be seen in the first column of these tables.

The category "respiration" contains 19 variables concerning (mechanical) ventilation and oxygen administering. Since mechanical ventilation is the most common technological sup-port used on the IC [9,74, 90] and respiratory problems is the most important reason for admission to the IC [9,43, 46,48,97], this is hypothesized to be an important category of variables. Many vital problems come down to respiratory issues as the viral organs need oxygen in order to function properly and respiratory problems can indicate mortality and readmissions [48].

"Arterial_blood_gases" contains 5 variables mainly concerning the oxygen and carbon diox-ide saturation of the blood. When these values are off they can cause serious damage to organs. However, there is a high correlation with mechanical ventilation [65] and according to [10] patient outcomes can still be predicted accurately with fewer measurements of these blood gases. Due to these reasons it is hypothesized not to be of great importance. "Blood_pressure" consists of 11 variables on blood pressure. Low blood pressures occur when patients suffer from sepsis or heavy blood loss and can through them influence mor-tality rates. High blood pressure can be indicative of mormor-tality [89, 92]. However, it is a long-term problem and can only indicate mortality when causing other life-threatening symptoms. All in all, this category may not come up high in the importance ranking. The 2 "cardio" variables measure the heart activity, a vital function of the body. It is one of the main causes of admission to the IC [48,49]. Therefore it is an important predictor for mortality for more patients than just cardio patients [43,48]. Consequently, it is expected to be important in this process.

"Chemistries" contains 8 variables which measure various blood values such as protein, glu-cose and bilirubin. Wide variations in blood gluglu-cose and protein levels, for example, have been closely tied to increased mortality. [46, 54] The main reason for this being that IC patients are not in a condition to show signs of the consequences of these extreme blood measurements. Therefore, this category may be indicative.

(12)

(delir-ium or confusion assessment method for the ICU). A significantly increased consciousness when being discharged was found to indicate good quality of life afterwards [43, 93]. Con-sciousness scores can also be used to predict mortality and readmissions [48]. This may reduce readmission and could therefore be an important variable category.

The 11 "electrolytes" measurements, such as ureum or kalium, can indicate mortality when imbalanced [55]. Elevated ureum levels can result in brain dysfunction whereas low levels of kalium can cause hearth rythm dysfunction. This occurs quite frequently on the IC. Therefore, it is hypothesized to be an important variable group.

The category "hematology" contains 4 variables measuring several blood values concerning thrombosis, bleeds and transfusions. This can indicate bad outcomes of the IC stay. How-ever, it can also be highly correlated with other types of variables such as infections and cardiac problems and consequently it is hypothesized not to be the most important group of variables.

"Infect_blood" consists of 11 variables that are all types of white blood cells that can indi-cate infections. Through sepsis, the over active and toxic response to an infection, infections can cause mortality and this may therefore be an important group of variables [99]. Infec-tions can also indicate readmissions [14,43,46,48].

"General well-being" has 2 variables representing the Nurse Activity Score both on the medium care and intensive care units. In Brazil it was found that a high NAS at discharge predicts higher mortality probabilities and a lower chance of readmission. However, a high NAS on the first day at the IC lowers the mortality probability [20]. Thus, NAS is expected to be predictive in this problem definition.

The 12 different places "temperature" is measured, can give slightly varying values. These may be combined later on. High values mainly indicate infections [59, 72]. Thus, tempera-ture can be an important indicator for mortality or readmissions [48,59,72,73]. However, due to its strong relation to "infect_blood", and a high temperature being a side-effect of having an infection, it is expected to not be very important. When predicting, two corre-lated variables will not be used at the same time. If one is slightly more informative than the other it will be kept in the model, otherwise one will be selected at random.

The "urine" category contains only 1 variable which gives urine output in ml. Urine output is often used as an indicator for acute kidney injury, and as the kidney is one of the vital organs and renal failure can indicate an increased probability of readmission [43,46,48] or mortality [48] this category is hypothesized to be of great importance.

The 41 "ECG-characteristics" measured are expected to be predictive since they also mon-itor cardiac measurements and therefore assess heart function [97]. Thus, they are compa-rable to cardio and hematology measurements. Probably not all of them will be used for predicting at the same time, and this category may or may not be important.

The category "general" contains 8 values that did not fit into any of the previous cate-gories. These are the admission weight and length, age, gender, a binary variable indicating whether a urinary catheter was present during at one point of the stay and 3 ordinal vari-ables indicating the number of broncoalveolar lavages present, the scale of coughing and the tube size of the traceostomy. According to Kaben et al [46] and [48,50] a higher age may also be predictive. Sex was also found to be predictive by some. Some research found the male sex to have great influence [43] whereas other research found the female sex to be an important predictor [48]. This strongly depends on the IC unit at hand.

(13)

should not be discharged from the IC. Thus, it is expected that some of these medication types might be indicative. For instance, sedative medication was found to be an important indicator for readmission in [9].

It can be concluded that most categories of variables might be of predictive importance, as may combinations of some of them, as they all reflect the main vital functions that are essential for every patient’s stability. However, not all variables within one category have to be predictive and combinations of single variables might be as well.

For the single 217 variables it is much harder to hypothesize which ones are of great importance. However it can be assumed now that some respiratory-related variables, some heart-related vari-ables, some blood-measurements such as white blood cells (infections), O2-saturation and glucose, a temperature measurement, urine output, age, gender, length of stay [9,48,50] and some types of medications could be important variables when predicting mortality and readmission probabilities.

2.2

Preprocessing

In order to prepare this dataset with 24446 patients and observations from their entire stay at the IC, several assumptions were made and acted upon by Pacmed. The process described in this section could not be adjusted.

First of all, variables with fewer than 50 observations available were removed since they are likely not to be informative. However, variables with very low variability among the observations were not removed at this point in time. Moreover, the "length_of_stay" on the IC was manually added to the dataset and was derived as the departure time minus the arrival time and rounded to a number of days.

Afterwards, some filters were applied in order to select only relevant patients for this research. Then, the time-dependent variables were aggregated to daily levels and then to patients levels.

2.2.1 Patient selection

First of all, the few patients admitted before 2003 (who were still on the IC in the beginning of 2003) were removed as their records were not complete. Patients younger than 18 were also dis-carded as children need special pediatric care [9,12,24,33,43,46,48]. Moreover, all patients were required to have at least one measurement during their entire stay of all of the following four core measurements in order to be considered valid patients. Patients had to have at least 1 heart rate measure, 1 hematocrit value, 1 creatinine measurement and had to have received 1 intravenous (directly into the veins) type of medication. These core measurements were derived from the vital life functions. Afterwards, patients who could not be found in the MDB (Microsoft Database: in the VU this is the Electronical Patients File of the whole hospital, the Hospital Information Sys-tem) were removed as for these patients it is unclear what happened to them after being discharged. Finally, the length of stay of a patient had to be between 0.5 and 7 days in order for the stay to be informative. Too short stays don’t give enough data [39] and extremely long stays don’t occur often enough to be informative [12,39]. Furthermore, the mortality and readmissions were only counted when they occurred between 8 hours and 7 days after being discharged. When a patient "fails" too quickly, which is either having deceased or being readmitted, this should not be taken into account [24] and when failure occurs after 7 days it will be considered an event uncorrelated with being discharged beforehand. The value of 7 was chosen as some research used quite low values of 48 or 72 hours resulting in a significantly smaller dataset [9, 12, 14, 24, 48] and other research [39] used much larger values (30 days). This does not add many more useful patients since after 7 days most patients have left the IC [9,12,39].

(14)

Figure 1: Histograms of interesting variables

(a) Mean admission weight in kg (b) Mean heartrate per minute

(c) Mean diastolic PAP (average) in mmHg (d) Mean respiratory rate per minute

(e) Mean urine output in ml (f) Mean temperature in degrees Celcius

(g) Admission age in years (h) Mean glucose in the blood

(i) Average o2 saturation of the blood (j) Length of stay in days

Note: This figure shows the histograms of both the non-readmitted/deceased patients (blue) and the readmitted/deceased patients (red). The height of the bars does not represent the count, but the percentage of values in this certain bin. All variables are the daily averages of all days of the entire stay. Thus, for one patient more than one observation can be present.

2.2.2 Aggregation of time-dependent variables

(15)

To daily level

Since most variables are not measured with the same frequency for all patients, these variables were first aggregated to a daily level to obtain similar data points for all patients. Since at the VUMC a morning shift starts at 08.00 am, it was assumed that a day constitutes of 24 hours and starts and ends at this time [2]. This means that a first or last day can contain values measured over less than 24 hours. This assumption was made since it gives, with the exception of the first and last day, all patients the same 24 hours over which daily measurements are calculated. Thus, daily patterns are the same for all patients. Therefore, these measurements can be easily compared, something that would not be the case for a rolling day start [39].

In order to capture as much information on a daily level as possible, for all numeric and ordi-nal variables the minimum, maximum, mean, standard deviation, the slope of a linear regression over all the observations that day, a binary variable of whether the event even occurred that day and a count variable of how often it occurred that day are calculated. These capture the most important daily information according to C. Hug (2009) [39] and keeps more information than in [24], where only the average is used. Categorical variables are transformed into dummy variables per category. For each dummy the sum, maximum value (0 or 1) and mean is calculated per day. For the measurements on medication, for each day the value administered per dose is used as well as the value per dose per kilogram of body weight of the patient and a binary variable indicating whether the medication was administered at all [39]. This resulted in 777 variables.

In Figure 1, on page 12, the histograms of some background variables are plotted. Per plot, the blue histogram is for the patient without readmission/mortality and gives the percentage of the total number of patients that are in a specific bin of values of the variable being considered. The red histograms represent the same, but for the patients who were readmitted or deceased.

Since admission age is time-independent, the histogram for this variable counts the percentage of observations for each value. For the time-dependent variables, on the other hand, the histograms contain observations for each patient for each day for their entire length of stay. From the 7 daily measurements available the mean was used.

The average admission weight seems to be centered around 75 kg, and the peaks are proba-bly rounded values that occur more often. There seems to be almost no difference between the readmitted/deceased patients and the healthy ones. The problematic patients might have slightly lower weight on average.

The average heart rate seems to be slightly right-skewed and peak around 75 per minute for non-problematic discharges and is nicely distributed around 90 per minute for problematic discharges.

The average diastolic Pulmonary Artery Pressure (PAP) blood pressure measurements seem to peak around 10 mmHg for the non-problematic group and 20 mmHg for the adverse out-come patients.

The mean respiratory rate is skewed to the right but clearly peaking around 15 per minute for both groups. For the readmission/mortality group this peak seems to be slightly more skewed.

Average urine output is also skewed to the right with its peak just below 100 ml. It seems to be slightly lower for the problematic discharge group.

The mean temperature is nicely centered around 37 degrees Celsius for both groups with a somewhat longer tail on the left. The admission age is skewed to the left for both groups with its peak for the non-problematic patients around 75 years and for the problematic patients a bit higher.

(16)

o2-saturation seems to peak at 5 liters per minute for both groups.

Finally, the histogram1jshows that when looking at the first 7 days, longer lengths of stay may negatively influence the outcome of a patient. However, when looking at more days (the farthest right tail has been cut of at 15 days in this plot), it seems that longer stays may positively influence the outcome of a patient. This plot confirms the assumption that after 7 days there are only very little patients left at the IC, which cannot give enough data to make solid predictions.

The non-available variables in this dataset mean that for an entire day no measurements were available for a certain patient. These values were disregarded when plotting these histograms and set to zero afterwards.

To patient level

In the end, the daily values have to be aggregated to patient level in order to be used in a Logistic Regression or Neural Network. This can be done in several ways. Separate models per day can be estimated or the daily values can be aggregated using various techniques [39].

According to Hug [39], aggregating the data may result in more useful models and not lose predictive power. Previous research was often based on the first day of the ICU admission [19,33,

49,67,81], since this indicates the severity of the illness, whereas [39] states that the final day can also be of great importance, and [20,43,97] use both. Just before being discharged measurements show more about a patient’s stability and this can influence the probability of readmission [12,24]. Therefore, a backward weighted average

wbackward= [1 d, ..., 1 2, 1 1] d P i=1 1 i , (1)

where d is the number of days the patient spent on the IC, was used to aggregate the daily data to patient level. It gives more weight to the final day the patient spends on the IC than the days before. The value of the weights decrease for every earlier day.

2.3

Final dataset

The final dataset provided by Pacmed included 777 variables and 13374 patients. For this thesis, this dataset could not be adjusted.

Table 2: Summary statistics background variables

Variable Mean Std Min Max NaN

mean_admission_weight 79.20 15.98 36.00 186.00 469 mean_heartrate 59.32 25.89 0.00 168.88 0 mean_pap_dia 3.50 5.21 0.00 31.84 0 mean_resp_rate_pat 4.43 4.57 0.00 35.40 0 mean_urine_output 88.98 69.42 0.00 683.33 0 mean_temperature 21.66 11.61 0.00 40.05 0 ad_age 64.08 15.05 18.03 104.24 0 mean_glucose 5.10 2.74 0.00 29.36 0 mean_resp_o2 2.62 2.04 0.00 18.00 0 gender 1.33 0.47 1.00 2.00 276 length_of_stay 1.84 1.50 0.50 7.00 0

(17)

Table 3: Two-sample t-tests different means

Variable Mean_0 Mean_1 Std_0 Std_1 T-statistic P-value

mean_admission_weight 79.37 75.83 15.96 16.00 -5.41 6.49e-08 mean_heartrate 59.43 57.21 25.67 29.72 -2.17 0.030 mean_pap_dia 3.63 0.99 5.26 3.17 -12.85 1.44e-37 mean_resp_rate_pat 4.49 3.41 4.59 4.51 -5.95 2.72e-09 mean_urine_output 89.85 72.42 69.81 59.06 -6.33 2.46e-10 mean_temperature 21.80 19.00 11.59 11.78 -6.08 1.27e-09 ad_age 64.01 65.34 15.03 15.29 2.23 0.026 mean_glucose 5.10 5.20 2.74 2.88 0.92 0.36 mean_resp_o2 2.62 2.59 2.04 1.98 -0.46 0.65 gender 1.33 1.37 0.47 0.48 2.23 0.026 length_of_stay 1.82 2.17 1.49 1.70 5.90 3.81e-09

Note: This table shows the means and standard deviations for readmitted/deceased patients and non-readmitted/deceased patients for 11 background variables. These are then used to calculate the t- and p-values according to equation2. All variables have n0= 12706 observations in the non-readmitted/deceased

group and n1= 668 in the readmitted/deceased group.

2.3.1 Summary statistics

Descriptive statistics of the variables can be found in Table20in Appendix B. It should be noted that the values might look odd since they are a backward weighted average of the daily values and the missing daily values were set to zero. Using zero values in the backward weighted average instead of not taking them into account decreases some of the values at patient level significantly. Since these values should not have been considered, the resulting values are not always representing the true data points. This can negatively influence the results. However, the data preprocessing process could unfortunately not be changed.

The dependent variable is a binary variable with 13374 observations, one for each patient in the final dataset. A value of one means that a patient was either readmitted to the ICU after being discharged or passed away after being discharged from the ICU or was first readmitted and then passed away. This adverse outcome occurred for 668 patients (4.99%). From these 5%, 4.5% was readmitted to the IC and 1.5% passed away. This shows that some people indeed were first readmitted to the IC and then deceased. Having only few cases where y = 1 may give the models trouble in finding a pattern that recognizes for which patients it should predict y = 1. Therefore, the model will be likely to predict zeros where the true value would be one [102]. However, lit-erature shows that in similar research it has not been tried to balance the datasets, as the true population is also imbalanced, and it is therefore believed that the models can be estimated on this imbalanced dataset [14,19,24,46].

For "gender" and the background variables from the histograms in Figure 1, the averages of these variables in the final dataset for the readmitted/deceased and non-readmitted/deceased groups have been compared using Welch’s t-test. The t-statistic was computed and compared to a t-distribution with v degrees of freedom as

t = meanq 0− mean1 std2 0 n0 + std2 1 n1 , where v ≈ ( std2 0 n0 + std2 1 n1 ) 2 std4 0 n2 0(n0−1) + std41 n2 1(n1−1) (2)

and mean0 and mean1 are the means of the non-readmitted/deceased and readmitted/deceased

groups respectively, std0 and std1 their respective standard deviations and n0 = 12706 and

n1 = 668. The values for these tests can be found in Table 3. Since 11 tests are performed,

(18)

The average "mean_admission_weight" of the patients was 79.20 (15.98) kg with a mini-mum of 36.00 kg and a maximini-mum value of 186.00 kg. For 469 patients no admission weight was measured. For the non-problematic group of 12706 patients, the mean is 79.37 (15.96) and for the problematic group of 668 patients this is 76.83 (16.00). This results in a p-value of Welch’s t-test smaller than 0.0045, thus rejecting the null hypothesis of the two averages being equal on a 5% confidence level.

The average "mean_heartrate" is 59.32 (25.89) per minute with a minimum of 0 and a maximum of 169. For the non-readmitted/deceased patients this is 59.43 (25.67) and for the readmitted/deceased patients this is 57.21 (29.72). This results in a p-value of 0.030, thus the null hypothesis of the two being equal cannot be rejected at a 5% confidence level. The mean "mean_pap_dia", which is the mean average diastolic Pulmonary Artery Pres-sure (PAP) has an average of about 3.50 (5.21) and ranges from 0.00 to 31.84. For the non-readmissions/mortality the mean is 3.63 (5.26) and for the readmissions/mortality it is 0.99 (3.17). This results in a p-value of almost zero, confirming that the two means differ significantly at the 5% confidence level.

The average "mean_resp_rate_pat" is 4.43 (4.57) per minute with a minimum of 0.00 as well and a maximum of 35.40. For the group of 12607 patients there is a mean of 4.49 (4.59) and for the group of 668 patients there is a mean of 3.41 (4.51). These values differ significantly.

The average "mean_urine_output" is 88.98 (69.42) ml with a minimum value of zero and a maximum of 683.33. The difference between the two groups of 89.85 (69.81) for the healthy groups and 72.42 (59.06) for the unhealthy group is significant.

The mean "mean_temperature" is 21.66 (11.61) with a minimum of 0.00 and maximum of 40.05 degrees Celsius. This value is considerably lower than was seen in the histogram in Figure 1. This can be explained by the fact that whenever temperature is not measured for an entire day somewhere during the stay, this value will count as zero in the weighted average, thus lowering the mean. For the group not deceased or being readmitted the mean is 21.80 (11.59) and for the group either deceased or being readmitted this mean is 19.00 (11.78). This difference is significant.

The average "ad_age", which is admission age, is 64.08 (15.05) years with a minimum of 18.03 and a maximum of 104.24. For non-problematic patients this is 64.01 (15.03) and for problematic patients this is 65.34 (15.29). A p-value of 0.026 of Welch’s t-test shows that the null hypothesis of equal means cannot be rejected at a 5% confidence level.

For "mean_glucose" the difference in means for the respectively healthy and non-healthy discharges are 5.10 (2.74) and 5.20 (2.88). This difference is not significant at the 5% con-fidence level as the p-value is 0.36 and thus larger than 0.0045. For the entire sample the average is 5.10 (2.74) with zero as the minimum and 29.36 as the maximum value.

The mean "mean_resp_o2" is 2.62 (2.04) with a minimum value of 0.00 and a maximum of 18.00. The mean for the non-readmitted/deceased group is 2.62 (2.04) and 2.59 (1.98) for the readmitted/deceased group. These means do not differ significantly.

Of the 13374 patients, 8789 (2/3) have value 1 (men) and 4309 (1/3) have value 2 (women). For 276 patients the gender is not available, it was never entered into the system. Of these 8789 men, 418 were readmitted or passed away after being discharged. This equals 4.76%. For women this is 5.66%. This gives an average of 1.33 (0.47) for the non-problematic pa-tients and 1.37 (0.48) for the problematic papa-tients. A p-value of 0.026 indicates that the null hypothesis of equal means cannot be rejected at a 5% confidence level.

(19)

discharges have a mean of 1.82 (2.17) and the adverse-outcome discharges have a mean of 1.49 (1.70). This difference is significant.

2.3.2 Dealing with missing observations

Due to the aggregation of the data points to a daily level, many observations were discarded. How-ever, the 7 descriptives are believed to capture the most important daily information. If a daily measurement was not present at all, it was replaced by zero. No distinction can then be made by the model for a true "NaN" (Not-A-Number) value or an actual value of zero. Moreover, it also negatively affects the final values of the backward weighted averages. However, in the final dataset received, these values were already set to zero before the aggregation to patient level step. This is believed to decrease the predictive power of the models as information is lost.

After aggregating the time-dependent variables, the time-independent variables still contained some missing observations which had to be dealt with. Two types of simple imputation were tried. Simple imputation has been shown to be better than simply removing all observations containing a missing value [16]. This was also not possible in this case since then zero observations would be left in the dataset.

First of all, all NaN values that were still in the dataset have also been set to zero, meaning that all values of zero and all non-available measurements received a value of zero. This seems a natural follow-up to the procedure used to obtain this final dataset. Nevertheless, as the non-availables and zeros can now no longer be distinguished it is still expected to negatively influence the performance of the models.

Secondly, all zero values in the final dataset were replaced by NaNs, and then all NaNs per variable were replaced by the median of all available observations of that variable. Simple impu-tation by the median of each variable has been used as a baseline technique in previous research [16,80,105], as it is easy to implement. However, a drawback of this method is that for the time-dependent variables the backward weighted average has already been taken with the zero-values used for the NaNs instead of ignoring them. Thus, the median is downward biased because of this and imputing with this value causes a loss of information. Moreover, the distinction between true non-available and true zero values is gone, as they were all treated as non-available. This method is not expected to outperform the imputation with zero in this case.

(20)

3

Methodology

In order to model the binary classification problem of whether or not a patient is readmitted or deceases after being discharged from the IC a Logistic Regression and a Neural Network will be used. Both can be used to predict the probabilities of readmission/mortality. The amount of false positives and false negatives should be minimized in order to be able to optimally time the dis-charge of patients. Logistic LASSO will be used to reduce the dimensionality and multicollinearity of the data. This is especially important for the Logistic Regression as it assumes little multi-collinearity of the independent variables. The Logistic LASSO model itself has lower variance than a regular Logistic Regression model and may therefore also be used for predicting. Subsection3.1

will discuss the theory behind these three models and how to fit them, methods to evaluate and compare their performance and how to assess the predictive power of the included variables and categories of variables. Subsection 3.2 will explain how these methods have been applied to the data in order to obtain the results presented in Section4.

3.1

Theory

Let y be a n × 1 vector containing all observations of the dependent variable. Let X∗be an n × k matrix where each column is one of the k independent variables. Moreover, let X be a n × (k + 1) matrix, where a column of ones is added as the first column of the matrix X∗. Finally, for the classical econometric models let β∗ be a k × 1 vector containing the coefficients to be estimated. β∗ does not include a β0 for the intercept. Let β be a (k + 1) × 1 vector of coefficients to be

estimated that does include this intercept β0.

3.1.1 Logistic Regression [13, 36, 42]

When modeling a binary classification problem where the dependent variable y has values 0 or 1, as in this thesis, the linear regression

y = Xβ +  (3)

should not be applied for estimation of the coefficients β. The predicted probabilities ˆy could fall outside the 0-1 range, which is not possible for a probability. Therefore, a cumulative distribution function (cdf) is applied to the linear regression outcomes to scale them to a 0-1 range. A commonly used cdf in this case is the logit distribution [13]. Applying the logit distribution to equation 3

results in

pi= Λ(Xiβ) =

eXiβ

1 + eXiβ, (4)

where pi is the probability that yi= 1. It follows that 1 − pi is the probability that yi= 0.

The model in equation 4 can be estimated using the method of Maximum Likelihood (ML). The likelihood function is

L(β) = n Y i=1 f (yi|Xi) = n Y i=1 pyii (1 − pi)1−yi= n Y i=1 [Λ(Xiβ)]yi[1 − Λ(Xiβ)]1−yi, (5)

where independence over i = 1, ..., n is assumed. Since admission of patients to the IC are supposed to be uncorrelated, this assumption most likely holds. In order to obtain the likelihood function, the probability density function is multiplied over all observations i. In this case the probability density function (pdf) is given by the probability mass function of the binomial distribution with one trial (f (yi|Xi)) = pyii (1 − pi)1−yi as y is Bernoulli distributed. The log of the likelihood

l(β) = n X i=1 yiln(pi) + (1 − yi)ln(1 − pi) = n X i=1 yiln[Λ(Xiβ)] + (1 − yi)ln(1 − [Λ(Xiβ)]) (6)

will be maximized, since sums are easier to maximize than products.

(21)

Solving this equation will give the ML estimates ˆβM L. Equation7can be simplified to n X i=1 yi− Λ(Xiβ) Λ(Xiβ)[1 − Λ(Xiβ)] Λ0(Xiβ)X0i= 0. (8)

In the logit case, Λ0(Xiβ) = Λ(Xiβ)[1 − Λ(Xiβ)] and this model can be written as

g =

n

X

i=1

[yi− Λ(Xiβ)]X0i= 0. (9)

Since there is no explicit solution for this equation, an iterative procedure such as Newton-Raphson can be used to solve it. It should converge quickly as equation6 is globally concave [13].

The Newton-Raphson method updates ˆ

βs+1= ˆβs− H−1s gs, (10)

where Hsis the k × k Hessian

Hs= − n

X

i=1

Λ0(Xiβˆs)X0iXi (11)

of the objective function (6) and gs (9) is the first derivative of the objective function (6) evalu-ated at ˆβs. The updating proceeds until convergence. A convergence criteria such as the difference

between ˆβs+1 and ˆβs having to be less than some small threshold  can be used. The estimates after convergence will be called ˆβM L.

When predicting probabilities, the estimated coefficients ˆβM Lare substituted into the original

model (4) and the resulting probabilities ˆpi are the predicted probabilities. These can be used on

their own or classified as zero or one using an appropriate threshold.

3.1.2 Logistic LASSO [29,31]

The Least Absolute Shrinkage and Selection Operator (LASSO) adds a penalty term (λPk

j=2|βj|)

to the objective function of the logistic regression (6) that makes sure that the estimated coeffi-cients of the least important variables are set to zero and all estimates are shrunken. Thus, it is a method of selection, in contrast to the Ridge penalty (λPk

j=2β 2

j) which only shrinks the estimates

(a small value of βj (<1) will result in an even smaller penalty, never resulting in βj= 0). Feature

selection can be of great value when modeling problems that have large amounts of features that are highly correlated and might slow down the fitting procedure, cause overfitting or reduce inter-pretability of the results. Selecting only the relevant features might help overcome these problems and is therefore believed to be helpful in this research.

In order to use LASSO, the variables should be standardized to mean zero  1 n n P i=1 xij = 0  and

standard deviation one    s n P i=1 (xij−n1 n P i=1 xij)2 n−1 = 1   [29].

The LASSO penalty added to the log-likelihood function of the Logistic Regression as given in equation6is −λPk

j=2|βj|, as penalizing an objective function that is maximized means subtracting

a positive term. When multiplying by minus one, the penalized objective function can be minimized over β. This penalized objective function can be written as

(22)

Equation12 can be minimized using the cyclic Coordinate Descent algorithm where, in each iteration, all coefficients βj are updated one after the other. The update-step for each βj can be

found by first taking the second order Taylor approximation f ( ˜β) + 1 1! ∂f ( ˜β) ∂ ˜βj (βj− ˜βj) + 1 2! ∂2f ( ˜β) ∂ ˜β2 j (βj− ˜βj)2 (13)

of the negative log-likelihood part of equation12with respect to βj. This results in

− n X i=1 (yiXiβ − ln(1 + e˜ Xi ˜ β)) − (β j− ˜βj) n X i=1 (xij(yi− ˜pi(xi))) + 1 2(βj− ˜βj) 2 n X i=1 x2ijw˜i(xi) + λ k X j=2 |βj|, (14) where ˜βj is the value of βj of the previous iteration of the algorithm, ˜wi(xi) = ˜pi(xi)(1 − ˜pi(xi))

and ˜pi(xi) = e Xi ˜β 1+eXi ˜β.

When minimizing equation 14with respect to βj, in some circumstances it is optimal to set

βj to zero. This is because in equation 14, −βj n P i=1 (xij(yi− ˜pi(xi))) − (βjβ˜j) n P i=1 x2ijw˜i(xi) + λ|βj|

are the only parts depending on βj where a large value of |βj| might actually minimize equation

14, depending on the value of λ. 12β2

j always increases equation14and should therefore always be

minimized. If | n P i=1 (xij(yi− ˜pi(xi))) + ˜βj n P i=1 (x2

ijw˜i(xi))| ≤ λ, the value of equation14will increase

and therefore it is optimal to set βj to zero. In order for this expression to be small, all values

should be small. Since (yi− ˜pi(xi)) gives the error between the predicted and true value and

˜

wi(xi) = ˜pi(xi)(1 − ˜pi(xi)) is smaller for more certain predictions (closer to either 0 or 1), it can

be seen that when the previous value of βj, ˜βj, is close to zero and the prediction is close to 0 or

1 with a small error, βj should be set to zero. Thus, strong predictions with a small ˜βj indicate

irrelevant predictors. This condition also shows that for larger values of λ, more βj’s are set to zero.

Otherwise, if | n P i=1 (xij(yi− ˜pi(xi))) + ˜βj n P i=1 (x2

ijw˜i(xi))| > λ, in order to minimize equation14

over βj the first derivative of equation14with respect to βj is taken and set to zero

− n X i=1 xij(yi− ˜pi(xi)) + (βj− ˜βj) n X i=1 x2ijw˜i(xi) + λ(βj> 0) − λ(βj< 0) = 0 (15)

and solved for βj.

In order to solve equation15for βj, all values depending on βj will be taken to the

right-hand-side, such that − ˜βj n X i=1 x2ijw˜i(xi) − n X i=1 xij(yi− ˜pi(xi)) = −βj n X i=1 x2ijw˜i(xi) − λ(βj> 0) + λ(βj < 0). (16)

Equation16can be rewritten to ˜ βj n P i=1 x2ijw˜i(xi) + n P i=1 xij(yi− ˜pi(xi)) n P i=1 x2 ijw˜i(xi) =     βj− λ n P i=1 x2 ijw˜i(xi)     (βj< 0) +     βj+ λ n P i=1 x2 ijw˜i(xi)     (βj > 0). (17) Let ˜βj n P i=1 x2 ijw˜i(xi)+ n P i=1

xij(yi− ˜pi(xi)) = zj. Then, as shown before, if |zj| ≤ λ, βjnewshould be

set to zero. When |zj| > λ, zj can either be positive or negative. When zj > 0, βjnew=

(23)

and when zj< 0, βjnew= zj+λ n P i=1 x2 ijw˜i(xi)

. This follows from equation 17.

The numerator of this update step can be summarized as S(zj, λ) = sign(zj)(|zj| − λ)+. Thus,

the update-step of βj can be written as

βnewjnS(zj, λ) P i=1 x2 ijw˜i(xi) . (18)

In one iteration, this procedure is subsequently performed for all values of j. This continues for a maximum number of iterations or until convergence, which can be found by calculating an error vector after each iteration where

j= 1 n n X i=1 ˜ wi xij(βjnew− βj) 2 (19)

for all variables j. Convergence is when max() < 10−4 [35].

It should be noted that the intercept should not be penalized, thus its value of λ equals zero. Consequently, the update step for β0 is always

β0new← n z0 P i=1 ˜ wi(xi) , (20) since x2

i0 is a vector of ones. When other variables should not be excluded from the model, their

value of λj can also be set to zero and the estimates will not be penalized in the estimation

pro-cedure.

An implementation of this algorithm in Python can be found in Algorithm1on page 22. This code can actually implement a combination of Ridge and LASSO penalties given by λPk

j=2(1 −

α)β2

j + α|βj|. Hence, for LASSO, α should be set to 1. Then γ = λ. The algorithm works as

following.

The starting values used are those of the regular Logistic Regression. The error is initialized to values of one in order to pass the convergence criterion and perform at least one iteration. Then, for the first feature j the numerator z and denominator of equation 18 are calculated by first calculating ˜pi(xi) and ˜wi(xi) for all observations i. ˜wi(xi) is programmed such that it can never

be zero when it tends to get small, as multiplication or division by zero is undesirable. Using the numerator, the denominator and the update step in18, βj is updated and its error is calculated.

(24)

Algorithm 1: Logistic LASSO # i n i t i a l i z i n g v a r i a b l e s a l p h a = 1 #LASSO lamb = 0 . 1 gamma = a l p h a ∗ lamb LR = s k l e a r n . l i n e a r _ m o d e l . L o g i s t i c R e g r e s s i o n ( p e n a l t y= ’ l 2 ’ , t o l = 0 . 0 0 0 0 1 , C=99999999999 , max_iter =10000 , warm_start=F a l s e , f i t _ i n t e r c e p t = F a l s e ) model = LR . f i t (X, y ) b e t a = model . coef_ [ 0 ] b e t a 1 = np . z e r o s ( len ( b e t a ) ) e r r o r = np . o n e s ( len ( b e t a ) ) #c o o r d i n a t e d e s c e n t a l g o r i t h m f o r l in range ( 0 , 1 0 0 ) : i f ( np .max( e r r o r ) > ( 1 0 ∗ ∗ ( − 4 ) ) ) : f o r j in range ( 0 , len (X [ 0 ] ) ) : z = 0 denom = 0 f o r i in range ( 0 , len (X ) ) : yhat = b e t a [ 0 ] f o r k in range ( 1 , len (X [ 0 ] ) ) : yhat += X[ i , k ] ∗ b e t a [ k ] prob = 1/(1+np . exp(− yhat ) )

i f ( prob < 0 . 0 0 0 0 1 ) : prob = 0 w = 0 . 0 0 0 0 1 e l s e : i f ( prob > 0 . 9 9 9 9 9 ) : prob = 1 w = 0 . 0 0 0 0 1 e l s e : w = prob ∗(1− prob ) z += X[ i , j ] ∗ ( y [ i ]− prob )+w∗ b e t a [ j ] ∗X[ i , j ] ∗ ∗ 2 denom += w∗X[ i , j ] ∗ ∗ 2 i f ( j ==0): b e t a 1 [ 0 ] = z /denom e l s e : i f ( j > 0 ) :

i f ( z > 0 and gamma < np . abs ( z ) ) :

b e t a 1 [ j ] = ( z−gamma ) / ( denom+(lamb−gamma ) ) e l s e :

i f ( z<0 and gamma<np . abs ( z ) ) :

(25)

3.1.3 Neural Networks [36,47, 70,85]

A Neural Network is a generalization of a Logistic Regression model [26]. A simple Neural Network can model a Logistic Regression, as will be shown at the end of this section, but it can easily be generalized to account for non-linearities and multicollinearity in the data and can be fit on high dimensional data. Therefore it is expected to work well for this particular dataset. Neural Net-works assume, just like Logistic LASSO, that all variables have mean zero and standard deviation one.

Biological Background

An Artificial Neural Network (ANN) is a network of many perceptrons and is a simplified model of the human brain. A perceptron can be seen as a simplified version of a neuron. The human brain contains approximately 1012neurons and one neuron has about 1000 neighbours. The neurons are

connected by synapses. Thus, the human brain has a high connectivity [85].

Figure 2: Visualizations of a neuron & a synapse

(a) A neuron (b) A synapse

Note: This figure visualizes a basic neuron and synapse.

A single neuron receives an electric pulse from a synapse in the incoming dendrite. If this pulse is above some threshold, it moves through the dendrite, through the soma (cell body) and leaves along an outgoing axon. The neuron has fired. The activity of a neuron increases when it fires more frequently. At the end of the axon the neuron releases neuro-transmitters into a synapse, releasing more of them as the activity increases. Figure 2 visualizes the neuron and synapse. Since there are many axons releasing different amounts of neuro-transmitters into a synapse, these neuron activities are accumulated at the incoming dendrite of a neighboring neuron. In mathematical notation the just described incoming activity S of a neuron i can be written as

Si= n

X

j=1

wijSj, (21)

where wij is the synaptic weight between neuron j and i and Sj denotes the activity of a neuron

j before the synapse. The dendrite of a neuron processes this pulse with a non-linear activation function g(·). If g(Zi) passes a certain threshold, neuron i will fire. The process will repeat through

(26)

there is no active synapse between neuron j and i. If wij < 0, the synapse is inhibitory, it decreases

total activity. This is already a simplification of reality, since when wij < 0 and Sj < 0 an

in-active neuron through an inhibitory synapse is able to activate another neuron. This is not realistic.

Figure 3: A one layer feedforward Neural Network with k=2 and p=5

Artificial Neural Network

Since ANNs are a simplified model of the human brain, they have a lower connectivity. The input data is pushed through a structure of several neurons, resulting in the output. These neurons can be connected in all possible ways. Thus, a few more simplifications have to be made. First of all, the structure of the network will be limited to the input, moving through several hidden layers each containing some amount of neurons and finally resulting in the output layer. This is called a feedforward network as the weights only connect neurons to neurons closer to the output layer. Moreover, an activation function such as the logistic or tanh function will give an output between 0 and 1.

Since Neural Networks have their origin in the fields of Artificial Intelligence and Machine Learning, the terminology used to define its properties can differ from that of Econometrics.

Considering an ANN with k neurons in the input layer, 1 hidden layer with p neurons and a single output as in Figure3, there will be two types of connecting weights. First of all, a k × p matrix W connecting the input to the hidden layer and a p × 1 vector v connecting the hidden layer to the output layer. All k neurons and the output will have a bias, respectively w0p or v0,

functions as an intercept. The input data used will be the n × k matrix X∗ with n observations and k variables as defined before and the output variable used is the n × 1 dependent variable y. Thus each of the k input neurons has a vector with n observations.

Forward Propagation

In order to go from the input matrix X∗ to output vector y, for each neuron two steps will have to be performed using random initial weights. Each input neuron has n observations of variable j, X∗. First of all, to obtain the corresponding values Ap at hidden neuron p, a linear combination

of these inputs and weights will be added to the bias of that neuron w0p to obtain

Zp= ( k

X

j=1

X∗jwjp) + w0p, (22)

to which the non-linear activation function g(·) will be applied to obtain

Ap = g(Zp) (23)

in order to create an output between 0 and 1.

When applying this to all neurons in the hidden layer, using some matrix algebra this would result in the n × p matrix

A = g(X∗W + w0). (24)

(27)

is made, the bias term is added and then the logistic activation function h(·) from equation4 is applied to this, in order to get the n-vector of predicted probabilities

ˆ

p = h(Av + v0). (25)

This is how a one layer neural network makes predictions between 0 and 1. However, these predictions might differ from the actual values the model is trained on. The Network wants to learn from these errors and update the weights in an appropriate manner. This can be done by back-propagating the error.

Backward Propagation

Back-propagation is applied in order to improve the current model’s fit and consequently get the predicted values closer to the true values. The first derivative from the error function to all the weights and biases are used to update those weights and biases. This means that 4 partial deriva-tives of the error have to be calculated: to the weights v, the weights w and the biases w0 and v0.

First of all, the error function should be defined. The error usually used in binary classification problems is the cross-entropy error

H(y, ˆp) = −

n

X

i=1

yilog(ˆpi) + (1 − yi)log(1 − ˆpi) (26)

where y is the vector of true output values [70]. This equals the negative log-likelihood function of the Logistic Regression, as can be seen in equation6.

When filling in the function for ˆp obtained before, the derivative to the weight vector v is ∂H(y, ˆp) ∂v = ∂H(y, ˆp) ∂ ˆp ∂ ˆp ∂v = n X i=1  −yi ˆ pi +1 − yi 1 − ˆpi  h0(Aiv + v0)A0i. (27)

Since h(·) denotes the logistic function, its first derivative is given by h0(x) = h(x)(1 − h(x)), which in this case gives ˆpi(1 − ˆpi). This results in

∂H(y, ˆp) ∂v = n X i=1 (ˆpi− yi)A0i. (28)

The derivative with respect to the bias v0 is, for the same reasons,

∂H(y, ˆp) ∂v0 = n X i=1 (ˆpi− yi). (29)

The first derivative with respect to the weights w is ∂H(y, ˆp) ∂W = ∂H(y, ˆp) ∂ ˆp ∂ ˆp ∂A ∂ ˆA ∂W = n X i=1 (ˆpi− yi)(v0g0(Xi∗W + w0)0X∗i)0 (30)

Finally, the first derivative with respect to the weights w0 is

∂H(y, ˆp) ∂w0 = n X i=1 (ˆpi− yi)v0g0(X∗Wi+ w0). (31)

For networks with more than 1 hidden layer, these calculations can be performed for all weights and biases between all layers in the same backward manner.

At each iteration, the error after forward propagation is calculated using the current weights. The partial derivatives as given in equations28, 29,30and31can then be calculated. Then

W = W + α∂H(y, ˆp)

Referenties

GERELATEERDE DOCUMENTEN

It should be noted that for binary outcome variables, that are much more common than multinomial ones, with missing values a multinomial model with three categories is obtained that

Test 3.2 used the samples created to test the surface finish obtained from acrylic plug surface and 2K conventional paint plug finishes and their projected

As can be seen in Table 1 and accompanying Figure 2, our OpenCL implementation of the neural net- work is faster than the single core CPU implemen- tation only when the network

To be precise, LIA contributes to four benefits for INBUS, namely (1) the use of LIA eliminates the need of having an employee who has high competency in accounting, (2) the

The package is primarily intended for use with the aeb mobile package, for format- ting document for the smartphone, but I’ve since developed other applications of a package that

The first part of the results presented will focus on the evolution of the termination shock, outer boundary, and average magnetic field in the PWN, while the second part will focus

Davidoff (2007) defined quality improvement as: ‘’the combined and unceasing efforts of everyone – healthcare professionals, patients and their families, researchers, payers,

Keywords: pre-roll advertisement, YouTube, skipping behavior, advertisement attitude, product involvement, type of device, duration model, Cox proportional hazards