Interpretable Prediction of A cute K idney I njury in the I ntensive Care Unit

(1)

MASTER THESIS

FACULTY OF MEDICINE

Interpretable Prediction of Acute Kidney Injury in the Intensive Care

Unit

Student Oleksandra Lvova, MSc Student number: 12348929 Email: o.lvova@amc.uva.nl Mentor

Dr. Iacopo Vagliano, assistant Professor Medical Informatics Email: i.vagliano@amsterdamumc.nl

Amsterdam UMC, location AMC, Department of Medical Informatics Tutor

Dr. Martijn Schut, assistant Professor Medical Informatics. Email: m.c.schut@amsterdamumc.nl

Amsterdam UMC, location AMC, Department of Medical Informatics

Location of Scientific Research Project

Department of Medical Informatics Amsterdam Public Health Research Institute Amsterdam UMC, University of Amsterdam Meibergdreef 9

1105 AZ Amsterdam, The Netherlands Project period June 2020 – February 2021

(2)

ACKNOWLEDGEMENT

When I decided that I wanted to do a masters in Medical Informatics I had no programming skills or experience of this kind at all. This is why this part of the study had been quite challenging for me. However, I find such skills essential and of high practical benefit for a medical informatician. During the “Fundamentals of data science” module that was taught to us by Dr. Martijn Schut, for the first time in my life I encountered the topic of machine learning. It immediately got my full attention and sparkled my curiosity. Therefore, when picking the project for the thesis my primary goal was to learn more about machine learning solutions in healthcare settings and to acquire practical skills to apply it. I would like to thank Martijn not only for his thorough and considerate supervision throughout this thesis project but also for the knowledge, inspiration and motivation that he provided me during my studies.

I would also like to thank my mentor, Iacopo Vagliano, who accepted me for his supervision even though I confessed about my very basic programming skills at the beginning of this project. He had endless patience and went above and beyond to support me in my study process by constantly sharing his knowledge, providing useful advice and believing in me even when I felt lost in process and doubtful of my skills.

Also, I would like to thank Ms Wei-Hsiang Hsu (a MSc student, who completed an internship on this project before I joined it) for her hard work and her input. She wrote the draft code for data pre-processing, Logistic Regression and XGBoost models that I used as a starting point to modify and adjust as needed further for my project. Elaborating on this code the main changes were: data analysis and description, features selection and data clearing adjusted, variables transformation added, long records truncated, made code optimisations to address the memory problem and execution time, adapted pre-processing for the LSTM, developed RF and LSTM models, added hyperparameter tuning for all models, added evaluation plots, added interpretability assessment. On a personal note I would like to thank my husband, who convinced me that it is never too late to pursue the study of my dreams and made it possible for me to have these exciting and rewarding two years of my life. Finally, I would like to thank my parents, who love me regardless of the grades, diplomas and achievements and thereby empowering me with the confidence needed to complete this study successfully and to choose this challenging project as a thesis project at the end.

(3)

SUMMARY

Introduction and objective: Acute kidney injury (AKI), is a common condition that is causing significantly increased risks for long-term adverse health outcomes and is a frequent reason for mortality. Its occurrence is particularly frequent in the intensive care unit (ICU) patients. Thus, prediction of this condition in the ICU setting is of high relevance. Accurate and timely prediction of AKI could empower clinicians to act before a major clinical decline to improve care outcomes and to optimise the use of resources. Machine learning applications showed promising results in the field of prognosis in nephrology. However, integration of machine learning products in clinical settings requires their understanding and interpretability. This is why the development of prediction models aimed for clinical use should focus not only on the predictive performance, but also on interpretability. The objective of this study was to develop a set of prognostic models to predict AKI using different machine learning algorithms and to assess and compare their predictive performance and interpretability.

Population: The MIMIC-III dataset was used, which includes patients admitted to the ICUs at Beth Israel Medical Center between 2001 and 2012. We selected patients with at least one measurement of serum creatinine or urine output, who were older than 18 years old and whose length of stay in the ICU was not shorter than 48 hours (the time span for AKI definition). AKI was defined according to the KDIGO (Kidney Disease: Improving Global Outcomes organisation) guidelines.

Methods: We developed prediction models using the following algorithms: Logistic Regression (LogR), Extreme Gradient Boosting (XGBoost), Random Forests (RF) and Long-Short term memory Recurrent Neural Network (LSTM). We used Area Under the receiver operating characteristic Curve (AUC) and the Brier score to evaluate models’ discrimination. With regards to calibration, we visualised calibration curves. Interpretability was assessed using feature importance techniques. Results: The final dataset contained information on 47,751 ICU stays of 34,516 patients and 35 features including age, gender, ethnicity, vital signs, laboratory measurements and other predictors that are considered relevant for the AKI prediction. RF showed the highest AUC of 0.93 and lowest Brier score of 0.105, XGBoost achieved an AUC of 0.93 and a Brier score of 0.108. LogR performance reached an AUC of 0.90 and a Brier score of 0.128. The lowest predictive performance of an AUC of 0.83 and a Brier score of 0.202 was demonstrated by LSTM. Feature importance assessment showed that ensemble-tree based algorithms mostly rely on the same features, while the most important features in LogR and especially neural network differ more from these models. Conclusion: Using machine learning algorithms for AKI prediction for ICU patients have promising prognostic capabilities. Additionally, predictive performance and the requirement of an interpretability effort do not necessarily have a trade-off relationship. In our case, the LSTM model that is most obscure (due to its high complexity), whilst requiring the largest effort in interpretability, did not achieve the highest predictive performance.

(4)

SAMENVATTING

Introductie en doelstelling: Acuut Nierfalen (ANF) is een frequente aandoening met een significant verhoogd risico op langdurige gezondheidsproblemen en is een veel voorkomende doodsoorzaak. ANF komt voornamelijk voor bij patiënten op de intensive care (IC). Vandaar dat het vooraf voorspellen van ANF bij IC-patiënten van groot belang is. Het nauwkeurig en tijdig voorspellen van ANF zou clinici in staat kunnen stellen om tijdig te kunnen handelen (voordat een grote klinische achteruitgang zich voordoet), het verbeteren van gezondheidsresultaten en gebruik van middelen te optimaliseren. Binnen het specialisme nefrologie hebben machine learning toepassingen veelbelovende resultaten laten zien op het gebied van prognose. Echter, de integratie van machine learning producten in een klinische omgeving vereist begrip en interpreteerbaarheid. Hierdoor moet de ontwikkeling van voorspellingsmodellen, welke gericht zijn op klinisch gebruik, zich niet alleen richten op de voorspellende prestaties, maar dus ook op de interpreteerbaarheid. Het doel van deze studie was om een reeks prognostische modellen te ontwikkelen om ANF te voorspellen met behulp van verschillende machine learning-algoritmen en om hun voorspellende prestaties en interpreteerbaarheid te beoordelen en te vergelijken.

Populatie: De gebruikte data was afkomstig van de MIMIC-III dataset, welke data bevat van patiënten die tussen 2001 en 2012 werden opgenomen op de IC van het Beth Israel Medical Center. Patiënten werden geïncludeerd indien ze niet ouder waren dan 18 jaar, een IC-opnameduur niet korter dan 48 uur hadden (de periode voor het diagnosticeren van ANF), en tenminste één meetwaarde voor creatinine (bloed) en urineproductie aanwezig was. De definitie van ANF is overgenomen uit de KDIGO (Kidney Disease: Improving Global Outcomes) richtlijn.

Methode: The MIMIC-III dataset is afkomstig uit elektronische patiëntendossiers en was gelabeled volgens de KDIGO richtlijnen. We vertrouwden op de verstrekte gelabelde dataset. We hebben voorspellingsmodellen ontwikkeld gebruikmakend van de volgende algoritmes: logistische regressie (LogR), Extreme Gradient Boosting (XGBoost), Random Forests (RF) en Long-Short term memory Recurrent Neural Network (LSTM). We hebben de AUC gebruikt om modellen te evalueren op hun discriminerend vermogen. Tevens hebben wij de Brier score berekend welke een criteria is dat gecombineerde informatie over precisie en robuustheid van een model verstrekt. Voor kalibratie hebben we voorspellende waarschijnlijkheden gevisualiseerd. De mate van interpreteerbaarheid van de modellen hebben we beoordeeld op basis van het belang van elke ‘feature’.

Resultaten : De definitieve dataset bevatte informatie van in totaal 47.751 IC-opnames van 34.516 patiënten en 35 features, waaronder onder andere leeftijd, geslacht, etniciteit, vitale functies en laboratoriumbepalingen. Deze geselecteerde features waren relevant voor de bepaling van ANF. RF had de hoogste AUC (0.93) en laagste Brier score (0.128). Het LSTM-model had de laagste AUC en Brier scores met waardes van respectievelijk 0.83 en 0.202. De beoordeling van het belang van elke feature liet zien dat algoritmen gebaseerd op ensemble-trees vaak dezelfde features gebruiken, terwijl LogR en neurale netwerken andere features gebruiken.

Conclusie: Het toepassen van machine learning algoritmen voor het voorspellen van ANF bij IC-patiënten laat veelbelovende resultaten zien. Daarnaast blijkt dat het voorspellende vermogen en interpreteerbaarheid van een model niet per se een compromis hoeft te zijn. In ons geval had het model met de hoogste complexiteit, en daarmee het lastigste om te interpreteren, niet het best voorspellende vermogen.

Trefwoorden: ANF, IC, predictiemodellen, machine learning, interpreteerbaarheid van predictiemodellen.

(5)

CONTENTS

1. INTRODUCTION 7

2. PRELIMINARIES 8

3. DATA AND METHOD 10

3.1. Data description 10

3.1.1. Population selection 10

3.2. Method 10

3.2.1. Feature selection 10

3.2.2. Data pre-processing 11

3.2.3. Label description and pre-processing 14

3.2.4. Models’ study setting 15

3.2.5. Machine learning algorithms 17

3.2.6. Hyperparameters tuning 18

3.2.7. Models’ predictive performance assessment 19

3.2.8. Models’ Interpretability assessment 20

3.2.9. Statistical analyses 21

4. RESULTS 23

4.1. Population description 23

4.2. Features description 26

4.3. Models’ predictive performance 27

4.4. Models’ interpretability analysis 30

5. DISCUSSION 33

5.1. Relation to other work 33

5.2. Main findings and conclusion 34

5.3. Strengths 35

5.4. Limitations 35

5.5. Recommendations for future work 36

REFERENCES 37

(6)

ABBREVIATIONS AKI Acute kidney injury

AUC Area under the receiver operating characteristic curve ICU Intensive care unit

KDIGO Kidney Disease: Improving Global Outcomes organisation LIME Local Interpretable Model-agnostic Explanations

LogR Logistic regression LOS Length of stay

LSTM Long-Short term memory Recurrent Neural Network

MIMIC-III Medical Information Mart for Intensive Care III dataset v1.4

RF Random Forests

RNN Recurrent neural networks SHAP Shapley Additive Explanations XGBoost Extreme Gradient Boosting

(7)

1. INTRODUCTION

Acute kidney injury (AKI) is a common and potentially life-threatening condition [1, 2, 3]. Mortality among hospital patients with AKI is significantly higher compared to patients without AKI [4]. Those patients, who survive after AKI, face an increased risk for cardiovascular, chronic kidney or end-stage renal diseases and other long-term adverse health outcomes [5]. Clinically AKI detection uses serum creatinine increase as a marker of acute decline in the renal function. Such an increase is lagging behind the renal injury. It results in delayed diagnosis and therefore attenuates the opportunity for early successful treatment [6, 7]. Therefore, preventative alerts generated by medical prognosis could assist clinicians to act prior to a major clinical decline, improve care outcomes and optimise the use of resources [8, 9]. AKI occurrence in the Intensive care unit (ICU) setting is particularly high and often exceeds 50 percent [10, 11, 12]. This is why prediction of AKI in the ICU setting is of high relevance.

The field of prognosis in nephrology has seen rapid growth with machine learning applications [13]. Although machine learning models can provide accurate predictions, lack of transparency in their functioning (i.e. their “black-box” nature) can be a major obstacle to their practical application. In the medical domain the stakes are high. Hence, in order for a machine learning prediction model to be accepted in a clinical environment not only should it provide precise predictions but also be interpretable [14].

The first goal of this thesis is to develop machine learning prognostic models to predict AKI in ICU settings on a level of an individual ICU stay. The second goal is to compare the respective developed models both from the point of their predictive performance as well as interpretability. Therefore, the following research questions were set:

1. How accurately can we predict AKI in the ICU settings with different machine learning models?

2. How interpretable are these prediction models and is there a trade-off between interpretability and predictive performance of the designed models?

To achieve the above set goals and answer the research questions we chose the following approach: First, to develop and optimise the following models: Logistic Regression (LogR), Extreme Gradient Boosting (XGBoost), Random Forests (RF), Long-Short term memory Recurrent Neural Network (LSTM). Secondly, to evaluate their predictive performance and assess the interpretability. Finally, to compare the models, considering both their predictive performance and interpretability.

This thesis is organised as follows: Section 2 provides an overview of the concept of interpretability and a taxonomy of its methods. Section 3 describes the study approach and the reasons for each choice. Section 4 is devoted to the results of this study. The thesis is concluded with Section 5, which outlines the related work on the subject of AKI prediction and efforts to interpret such models, followed by an analysis of the results and answers to the research questions to form a conclusion. It also lists strengths and limitations of this study and provides recommendations for future work.

(8)

2. PRELIMINARIES

This section provides an overview of the concept of interpretability and a taxonomy of its methods, since the topic of interpretability is not as widely known and used as other techniques described in this thesis.

Functioning of machine learning prognostic models is often not transparent and they are said to have a “black-box” nature. This means that their understanding requires explanatory and interpretational efforts to support their integration. Lawmakers are working towards regulating the effect that automated data processing has on the lives of individuals. Regulation from the European Parliament [15] describes the rights of the individuals, influenced by algorithms. In particular, the right to know when the decisions with regard to such individuals are influenced by the automated decision making and the right to obtain explanations about such influences. In addition to legal requirements, understanding of the predictions in the medical domain is needed for the safety of patients. Thus, integration of the machine learning prediction models in healthcare requires better interpretability of models to establish social acceptance and safety of their application. This is why it is necessary to exploit the topic of interpretability in research and whenever machine learning solutions are considered to solve problems of healthcare it is important to regard interpretability.

First of all, there is no real consensus on the definition of interpretability in machine learning and on how to measure it. However, there is initial research on this and the techniques being applied [16]. There are many attempts to define interpretability. I find the following two being the most general whilst complementary to each other. Interpretability is the degree to which a human can understand the cause of a decision [17]. And, Interpretability is the degree to which a human can consistently predict the model’s output [18]. One way to make machine learning interpretable is to use only interpretable models, such as linear models or decision trees. However, more complex models usually perform better. Therefore, the other option is the use of interpretation methods that can be applied to a supervised machine learning model. In this project, we used the feature importance method to interpret the developed models. This method was chosen as it provides an intuitive explanation about the predictions and could be applied to all of the considered models. Below is the taxonomy of interpretability methods based on several aspects [19]. These aspects are overlapping. Following each aspect description there are comments regarding the place of the feature importance method in this aspect of taxonomy.

● Ad-hoc (intrinsic) vs post-hoc methods : intrinsic interpretability is achieved by restricting the complexity of the model. It is focused on constructing new models. Post-hoc interpretability methods are applied to analyse the models after they are trained. In this project the models are developed and subsequently assessed for their interpretability, requiring a post-hoc interpretability technique. Feature importance is a post-hoc interpretability method. ● Model-specific vs model-agnostic methods : model-specific interpretability is limited to

certain model classes, while model-agnostic methods can be applied to any algorithm. An example of a model-specific method is the analysis of a tree structure for a decision tree. We selected to use a model-agnostic method (feature importance), as we wanted to apply the same method to each of our models and compare the results afterwards. However, sometimes the separation between these two groups of methods is somewhat blurred. For example, implementation and analysis of the feature importance depend on the underlying algorithm of the model (i.e. the way the features influence models’ prediction). This makes application of the feature importance method tuned to each algorithm being analysed and therefore having traits of a model-specific method.

(9)

● Global vs local methods : global interpretability methods explain the entire model behaviour, while local methods focus on individual instance predictions. For the purpose of models’ comparison we aimed to discover relevant information concerning the whole model to understand its predictions. Thus, we selected a global method, namely, feature importance, which showed how useful the features were for the models’ predictions and predictive accuracy. Another example of a global interpretability method is a global surrogate model: developing an interpretable model to approximate predictions of a “black-box” model. Examples for local methods are Local Interpretable Model-agnostic Explanations (LIME) [20] and Shapley Additive Explanations (SHAP) [21]. These are instance-wise techniques and therefore could compliment the global method in future work to improve the understanding of the models.

● Interpretability methods classification by their result : these are feature summary statistic methods (e.g. feature importance), feature summary visualisation methods (e.g. partial dependence plots), model internals interpretability methods (e.g. tree structure for a decision tree), data point interpretability (methods that return data points: either newly created or existing) and approximation to intrinsically interpretable models (e.g. global surrogate model). Feature importance belongs to the feature summary statistic methods by its results.

Overall, as interpretability methods are designed to increase understanding of the prediction model from different angles, it is fair to state that different methods complement each other revealing additional aspects of the model’s functioning. Thus, each next method adds explanations to those already applied. This iterative process of interpreting the model can keep improving our knowledge about model’s predictions, modules and the predictor as a whole.

(10)

3. DATA AND METHOD 3.1. Data description

We used data from the MIMIC-III dataset, which contains de-identified medical information for approximately sixty thousand patient admissions to the ICUs at Beth Israel Medical Center between 2001 and 2012 [22]. AKI was defined according to the Kidney Disease: Improving Global Outcomes organisation (KDIGO) guidelines [23, 24]:

1. Increase in serum creatinine by ≥ 0.3 mg/dl (≥26.5 mol/l) within 48 hours or; 2. Increase in serum creatinine ≥1.5 times baseline within the prior seven days or; 3. Urine output volume < 0.5 ml/kg/h for six hours.

3.1.1. Population selection

The data was extracted for patients, having at least one measurement of serum creatinine or urine output (two important predictors for AKI) [23, 24]. This subset contained information for 56,661 unique ICU stay identifiers for 42,078 patients. Only patients older than 18 years old, whose length of stay (LOS) in the ICU (calculated as the difference between first and last recorded charttime within the ICU stay) was not shorter than 48 hours, were selected. The 48-hour interval is the time span before the prediction.

The unit for each prediction was the ICU stay identifier. This means that while in the MIMIC-III database some patients had multiple ICU admissions, this information was not used and ICU stay records were treated as independent. Because of the temporal features representation described in the data pre-processing section below, zero padding depended on the maximal LOS in the dataset. To prevent gradient vanishing in LSTM [25] and excessive zero padding in other models we decided to truncate the longest ICU stay records. As 95 percent of ICU stays were shorter than 35 days, for those stays that exceeded 35 days, only records for the last 35 days of the stay were considered. The choice for selecting records for the first or last days of stay depends on the preference of the clinical application of the prediction and in a real life project should therefore be agreed with the clinicians. After selecting the records corresponding to the last 35 days of stay, one ICU stay had just one timestamp recorded (meaning the LOS was less than 48 hours) and was therefore excluded. Thus, the final dataset consisted of 47,751 ICU stays for 34,516 patients.

3.2. Method

3.2.1. Feature selection

The main AKI predictors based on the KDIGO guidelines are urine output and serum creatinine. In addition to these main two features a number of potentially useful predictors based on [26] and [27] were selected if the rate of missing values did not exceed 50%. This concluded a features set of 35 features that are listed in Table 1.

(11)

Table 1. Features specified by data type.

3.2.2. Data pre-processing Data clearing

There were timestamps in the initial tables, having the measurements only for the features that were not selected for this study (i.e. all the features selected had missing values at those timestamps). Such empty rows were deleted. Entries representing the same feature were merged: glucose and creatinine had inputs recorded in two tables with different timestamps and thus were merged in a single feature. Features from the vitals measurements table each had minimum, maximum and mean values. As these features were strongly correlated, same as in [26], only mean value was used to have one value per feature. Wherever there were missing values, they were missing for all three values: minimum, mean and maximum measures.

Variable transformation

Features from laboratory measurements each had minimum and maximum values which were identical in most cases. Wherever there were missing values, they were missing for both minimum and maximum measures. Therefore, to obtain one value per feature, the mean of maximum and minimum measures was calculated.

Time series processing

Time series measurements were made irregularly and their sequence length (number of timestamps) varied significantly from one ICU stay to another. Therefore, resampling grouping by a time span of six hours within each ICU stay identifier was applied, similarly to [27]. Each day was broken into four six-hour periods and all records that occurred within the same six-hour period were grouped together. As a summary statistic for numeric features, the mean value was applied over the

Numerical features

Age, serum anion gap, serum bicarbonate, blood urea nitrogen, serum chloride, serum creatinine, diastolic blood pressure, serum glucose, heart rate, serum hematocrit, serum hemoglobin, serum potassium, respiratory rate, serum sodium, blood oxygen saturation, systolic blood pressure, urine output at six hours, urine output at 12 hours, urine output at 24 hours, white cell count.

Discrete (binary) features

Sedative medications, vasopressor medications, mechanical ventilation of lungs. Categorical features (transformed into binary features)

Gender Female, Male.

Ethnicity Asian, black, hispanic, native american, white, other ethnicity, ethnicity unknown.

(12)

resampling period. As a summary statistic for discrete (binary) features and the labels, the maximum value was used. After resampling was performed, the maximum sequence length was 141 timestamps and the minimal was one timestamp. Mean length was 45 timestamps, while the median was 33. Figure 1 depicts the density function of the sequence length of the ICU stays (i.e. the number of timestamps).

Figure 1. Timestamps sequence length density function for the kernel density estimation.

Imputation of missing values

The table with ventilation, vasopressor and sedative data, containing categorical non-time-series data, did not contain missing values. The remaining tables did have missing values. Our main purpose was to fairly compare models in terms of predictive performance and interpretability, not to obtain the best possible performance by applying the best pre-processing. Thus, the following simple imputation methods were applied in order to treat all models equally and to ascertain fair evaluation:

1. No imputation: replaced all missing values with zeros.

2. With imputation: A two step approach was taken whereby, firstly, the most common value of a feature within each ICU stay identifier was imputed to be closer to the values common for a particular ICU stay. Secondly, in those instances where a feature was not measured within the ICU stay and therefore was not imputed in the first step, such missing values were imputed with the most common value of this feature for all ICU stays, which was assumed to be closer to the common value of the feature in the population.

The missing labels were carried forward within the resample limit. Even though AKI is considered to last for up to seven days after the onset [28], being conservative, we limited the carry forward to a resample limit of four days, as it was done by [27]. The risk that the label is no longer representative of a patient's condition increases with time, as the information becomes less relative with time and the patient’s situation can change. The remaining missing labels were populated with zeros.

Capping

To reduce the influence of data entry errors whilst preserving most of the information, outlier values were replaced with the highest/lowest normal value for that particular feature (rather than deleting the outlier values). Extreme values were capped at the 1st and 99th percentile.

(13)

Categorical features encoding

Categorical variables (gender, ethnicity group, admission type) that were represented as text, were converted to numeric form using one-hot encoding: each feature was represented as a vector, where the respective category had a value “1” while other categories had a value “0”. This way the dataset consisted of numeric values exclusively, which could be fed to the prediction models. We noticed that for a few patients the ethnicity was stated inconsistently throughout the ICU stays (e. g. first time a patient was admitted to the ICU with ethnicity stated as “native american” and the second time as “other”). As the prediction was made for each ICU stay identifier independently, this should not have caused significant distortion in our predictions.

Normalisation

We used a min-max scaler to normalise all numerical observations on a scale from zero to one. All categorical variables, which were encoded through the one-hot encoding, were therefore not affected.

Temporal features representation

The first dimension of the data was the number of ICU stays (instances). The second dimension was the number of features. Every ICU stay had multiple timestamps (of different sequence length) and thus, time could be viewed as a third dimension of our data. Because of the model requirements for LogR, XGBoost and RF we flattened our data into two dimensions as it was done by Bret et al. [29]. We repeated the features as many times as there were timestamps in the ICU stay identifier with maximum timestamps (maximal sequence length). Hence, the flattened data had two dimensions - ICU stays and features times maximal sequence length (timestamps). Figure 2 illustrates this.

Figure 2. Flattening 3D data to 2D.

To achieve this 2D shape for ICU stay identifiers that had fewer timestamps than the maximum sequence length, zero padding was performed. For the LSTM model the data was not flattened. However, to achieve the same sequence lengths within a batch, zero padding was performed separately for each batch. To minimize zero padding, records were preliminary sorted by their sequence length.

(14)

3.2.3. Label description and pre-processing

The unit for each prediction was the ICU stay identifier. The labels for AKI corresponded to the three KDIGO severity stages (1, 2, and 3: listed in increasing severity). The no-AKI label was recorded as “0”. Each ICU stay had more than one label and the number of labels depended on the number of timestamps. For the binary prediction three different thresholds could be used as follows:

1. Predicting any AKI (AKI stages 1, 2 or 3): threshold is 1.

2. Predicting moderate and severe AKI (AKI stages 2 or 3): threshold is 2. 3. Predicting severe AKI (AKI stage 3): threshold is 3.

For the purpose of this study we designed, tuned and compared all models for the threshold equal to one (None vs any AKI). For the binary prediction if an ICU stay identifier at a certain timestamp had an AKI stage equal or greater than the threshold, it was assigned a positive label (i.e. “1”).

Our approach can be extended to other severity thresholds of AKI prediction and different number of hours before the predicted event. However, the goal of this project is to predict any AKI stage within the next 48 hours.

As we resampled the data, the timestamps were presented for each six hours. Thus, 48 hours corresponded to eight timestamps (48/6 = 8). We needed to exclude the measurements that were made less than 48 hours before the event and to place the labels at a timestamp that was at 48 hours before it. Hence, we shifted the labels for eight timestamps to the past for each ICU stay. The timestamps in the period less than 48 hours before the prediction were removed, as they did not have a known label.

To compare LSTM predictive performance with the other models, and because of the data representation requirement of LogR, XGBoost and RF, we needed to obtain a single label for each ICU stay identifier: positive “1” or negative “0”. Figure 3 illustrates examples of assigning a single negative and positive label. The ICU stays without a positive label were assigned with a single negative label. For such ICU stays, the time point at which we performed a prediction was set to be the last charttime of the ICU stay. The ICU stays containing at least one positive label (AKI occurrence) were assigned with a single positive label. In such ICU stays the time point at which we performed a prediction was the first occurrence of AKI within an ICU stay.

(15)

Figure 3. Illustration of the single label assignment for a single negative label (above) and a single positive label (below). For a single negative label, all labels within an ICU stay were negative. The last timestamp (T10 on the figure) was set as a time point of prediction. Thus, T1 - T10 timestamps were used for the prediction as they corresponded to 48 hours before the event and earlier. For a positive label, the first label that turned positive (T7) was set as a prediction point and thus, T8 - T10 were disregarded, as these timestamps were less than 48 hours before the prediction point, while T1 - T7 were used for the prediction.

3.2.4. Models’ study setting

To answer the first research question “How accurately can we predict AKI with different machine learning models?” prediction models had to be developed and their predictive precision had to be evaluated. Table 2 summarises the study setting and lists control, independent and dependent variables.

(16)

Table 2. Study setting for the prediction models development and performance evaluation. Applicability

to an algorithm

Control variables Dependent variables

All algorithms

Seed: 42 .

Features set: 35 features .

Maximal ICU stay length: 35 days . Sampling interval: 6 hours . Resample limit: 4 days . Normalisation: min-max . AKI prediction threshold:

None vs any AKI .

Hours before the prediction: 48

hours . AUC. Brier score. Independent variables Algorithm used: LogR, XGBoost, RF, LSTM . Imputation:

most frequent (with imputation), impute zero (no imputation) .

LogR specific

Solver : saga.

Max iterations: 1000. Class weight: balanced.

Penalty: l1,l2.

C: 0.001, 0.01, 0.1, 1, 10, 100. Fit intercept: True, False.

XGBoost specific

Subsample ratio of the training instance: 0.8.

Subsample ratio by each tree: 0.8. Objective: binary logistic.

Number of threads to use when loading data: 4.

Number of estimators:

20, 30, 40, 50, 60, 70, 80.

Maximum number of levels in a tree: 5, 7,

9.

Minimum sum of instance weights in a child: 1,3,5,7.

Gamma: 0, 0.5, 1. Learning rate: 0.01, 0.1. RF

specific

Features to consider at a split: auto. Criterion: gini.

The minimum weighted fraction of the sum total of weights required to be at a leaf node: 0.0.

Number of leaf nodes: unlimited. Min impurity decrease: 0.0. Min impurity split: None.

Number of estimators: 100, 200, 300,

400, 500, 600, 700, 800, 900.

Maximum number of levels in a tree: 10,

20, 30, 40, 50,60, none.

Minimum number of samples required to split a node: 2, 5, 10.

Minimum number of samples required at each leaf node: 2, 4.

Bootstrap: True, False. LSTM

spiecific

Number of epochs: 80. Embedding size :

35 (number of features).

Output size: 1 (binary prediction).

Bidirectionality: True,False. Number of LSTM layers: 1, 2, 3. Dropout: 0% (none), 20%. Learning rate: 0.001, 0.0001.

(17)

3.2.5. Machine learning algorithms Logistic regression algorithm

Binary LogR is a statistical model that predicts a binary outcome. In the logistic model, the log-odds for the target labeled "1" is a linear combination of one or more independent variables. The function that converts log-odds to probability is the logistic function [30]. Based on a set probability threshold, the obtained estimated probabilities that the binary target equals to “1” on input X are converted into classes. Fitting the data is directed by the cost function derived from statistics using the principle of maximum likelihood estimation [31]. LogR is a well-known model for classification tasks. However, when relationships between the features and the outcome are non-linear and features are interacting with each other, decision trees are known to perform better [19].

Random Forests algorithm

A decision tree is a building block of a RF and is an intuitive model that can be seen as a series of yes/no questions concerning the data and eventually leading to a prediction, guided by reduction in Gini Impurity. The Gini Impurity of a node is the probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node [32]. The technical details of a decision tree are in how the questions about the data are formed. A decision tree is a non-linear model built by constructing many linear boundaries. RF method belongs to bagging algorithms and is an integrated classifier: it applies multiple grown decision trees to bootstrapped data that produce the combined output [33]. RF and XGBoost are both popular ensemble tree-based methods and they differ in how the trees are built and in the manner they combine results. RF builds each tree independently while gradient boosting builds one tree at a time considering the previously built trees. RF combines results at the end of the process by averaging ("majority rule"), while gradient boosting combines results along the way.

Extreme Gradient Boosting algorithm

Same as in RF, the basic building block of the algorithm is a decision tree that is a non-parametric algorithm. Each node represents a variable condition and the leaf corresponds to the prediction. Boosting is an ensemble technique that converts weak learners to strong estimators by sequential iterative model training. In each iteration, a new model tries to correct the error made in the previous iteration. Gradient boosting is an approach where in each iteration a new predictor is built to fit on the pseudo-residuals of the previous predictor which are subsequently added together to make the final combined prediction. The gain calculated for the roots of a tree is based on the similarities calculated for all nodes. The gain is an average training loss reduction gained when using a feature for splitting. It helps to determine how to split the data. Gamma is a parameter to which the gain is compared to determine pruning. Lambda is a regularisation hyperparameter and its increase results in more pruning [34].

Long-Short term memory Recurrent Neural Network algorithm

Deep learning has achieved great success in predictive performance. When dealing with time-series data, recurrent neural networks (RNN) are known to perform well [35]. RNNs are a class of artificial neural networks that contain loops, allowing information to be stored within the network. This allows them to exhibit temporal dynamic behavior. Derived from feedforward neural networks , RNNs can use their internal state (memory) to process sequences of inputs of variable length. We chose a LSTM model for our project. LSTM is a kind of RNN. LSTMs additionally have a gating mechanism that enables to store long-term memory in addition to the working memory. The gating mechanism

(18)

consists of an input gate that determines which information will be stored in the long-term memory; a forget gate that determines which information to discard; and an output gate that produces a new short-term memory after processing current input, previous short-term memory and newly computed long term memory [36]. Thus, LSTM is a model for multiple predictions for an instance over time, i.e. a model for continuous prediction. In our case LSTM used multiple labels within each ICU stay. It used one label per six hour interval (interval after resampling) for training and therefore produced predictions for each six hours.

As our target is binary, we used Binary Cross-Entropy Loss with logits in our model [37]. For optimisation we used the adaptive moment estimation algorithm Adam, as it frequently results in competitive overall performance [38]. Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments [39]. The LSTM had the following structure: the first layer in our model was the embedding layer. This layer can compress the number of input features into a selected embedding size or retain the input size. This is a fully connected layer and it learns the constant. The embedding layer provided the data to the LSTM layer(s) that can have from 1 to 3 layers. As we considered both unidirectional and bidirectional LSTM, LSTM layer(s) were followed by a combination linear layer that can combine the bidirectional output. This layer was followed by a dropout layer. When the dropout parameter was set to zero, no dropout applied. The final layer was a projection layer that produced the output of size one (binary prediction).

3.2.6. Hyperparameters tuning

The procedure used for hyperparameters tuning is described in Section 3.2.7. Table 2 lists the hyperparameters that were tuned. Hyperparameters were tuned for LogR, XGBoost and RF using a randomized grid search with 5-fold cross validation on a training dataset. Information on the dataset split is provided in the next subsection. As a result, the following final parameters were selected: for LogR - penalty “l1” (lasso regression), regularisation parameter C “0.1”, fit intercept “True”; for XGBoost - number of estimators “70”, maximum number of levels in a tree “7”, minimum sum of instance weight in a child “3”, gamma “0.5”, learning rate “0.1”; for RF - number of estimators “700”, maximum number of levels in a tree “40”, minimum number of samples required to split a node “10”, minimum number of samples required at each leaf node “2”, bootstrap “False”.

Appendix A shows hyperparameters tuning options that we initially considered for LSTM hyperparameters tuning. However, such tuning using available resources would require more time than the whole thesis project period. Therefore, in order to limit the computation time to an acceptable amount, we had to reduce the options that we would consider. The parameters grid used for the model’s tuning is shown in Table 3.

(19)

Table 3. Possible parameters combinations used for the hyperparameters tuning of the LSTM.

The following LSTM setting proved to have the best performance after the hyperparameters optimisation: no imputation; 3 bidirectional LSTM layers; no dropout; learning rate of 0.001.

3.2.7. Models’ predictive performance assessment

For LogR, XGBoost and RF the data was randomly split into training and testing sets with a ratio of 9:1. For the LSTM the dataset was split into training, validation and testing sets with a ratio of 8:1:1. The validation set was used to avoid overfitting of the neural network and for the selection of the best model based on AUC during the training process. Subsequently, the predictive performance of the selected model was evaluated on the independent testing dataset. This dataset was retained during the model development phase. For the fair comparison we used the same ICU identifier’s split. The test dataset that was used for each model consisted of the data of the same ICU identifiers. The training set of the LogR, XGBoost and RF was identical and it was used for the hyperparameters tuning with a 5-fold cross validation. For the LSTM model part of the training dataset was used as a validation set during the training process, as explained above. Cross validation was not applied for LSTM hyperparameters tuning due to the time constraint of the project. A train/validation/test split was used in this case.

LSTM model is a continuous prediction model, meaning that it produces multiple predictions for each ICU stay. The number of predictions per ICU stay equals the sequence length. It produces a prediction of AKI occurring within the next 48 hours at each time point (every six hours). For the purpose of predictive performance comparison of LSTM with other models, we used predictions made at the time of a single label assignment as described in 3.2.3. Please refer to Figure 4 for an illustration. In Figure 2, which provides examples of assigning a single label, this time point would constitute T10 for the negative label and T7 for the positive label. To additionally evaluate LSTM

(20)

continuous performance we used all its predictions, noting that these were made with a six hours interval.

Figure 4. Schematic illustration of picking the time point and LSTM prediction for performance evaluation for comparison with other models.

Precision-recall AUC is more informative than the AUC parameter when evaluating binary classifiers on imbalanced datasets [40]. The imbalance ratio of the dataset is the number of majority instances divided by the number of minority instances. In our dataset the imbalance ratio was around 1.4. Datasets are considered imbalanced if the imbalance ratio exceeds 1.5 [41]. Consequently, our dataset was determined not to be imbalanced. Therefore, we chose to use AUC for evaluation of the models’ discrimination. Additionally, we computed the Brier score that is a criterion that provides combined information on the accuracy and robustness of a model. For the calibration of the best performing models in each algorithm we visualised predicted probability distributions and calibration curves.

3.2.8. Models’ Interpretability assessment

To interpret the developed models we decided to apply the feature importance method. Feature importance refers to techniques that assign a score to input features based on how useful they are for the prediction [42]. We selected the feature importance method as it can explain different types of data and can be applied to various algorithms, rendering its suitability for the comparison of the models. Moreover, as this method is global, it has higher coverage in interpretability compared to the local methods that are often only effective for a limited number of data instances. Another advantage of this method is that it is rather intuitive and provides insight into the relationship between the prediction of each model and the features. For each model we computed feature importances on the test dataset. The argument for using the test data is that we should use unseen data, since the feature importance relies on measurements of the model error. Feature importance based on training data can lead to a misleading high importance score while in reality the predictor was simply overfitting. Therefore, with the test dataset we can reason upon how much the feature contributes to the performance of the model on unseen data [19].

Feature importance for LogR

In the LogR the set of coefficients to be used in the weighted sum to make a prediction is found during the model training. As our data had been scaled prior to fitting the model, the coefficients can

(21)

be used as a type of feature importance score. Positive coefficients would mean that an increase of the feature value increases the probability for class “1”, while negative coefficients would mean that an increase of the feature value decreases the predicted probability for class “1”. The coefficients do not sum up to one. As our data was represented in such a way that each feature repeated as many times as there were timestamps, we needed to apply an aggregate function to the coefficients. Thus, we calculated the mean of the obtained coefficients for each feature to obtain a single averaged coefficient per feature that represented the relative feature importance. As coefficients in logistic regression are wrapped into a logistic function, they do not influence the probability linearly, as in linear regression. The interpretability for a numeric feature should be interpreted as follows: if we have a coefficient X, then increasing the respective feature by one unit multiplies the odds (probability of class “1” divided by the probability of class “0”) by exp(X). For binary features, the interpretation is the following: changing the feature to another category changes the estimated odds by exp(X) [19].

Feature importance for XGBoost and RF

For the ensemble tree models, i.e. XGBoost and RF, the feature importance method was embedded within the respective software packages. The importance of the features of the XGBoost model was measured by the gain criteria, namely, average training loss reduction gained when using a feature for splitting [43]. In the RF model the importance is determined by the total decrease in node impurity (weighted by the probability of reaching that node) averaged over all trees of the ensemble. Clearly, node impurity and loss reduction gained are related, therefore these score methods are similar. As our data was represented in a way that each feature repeated as many times as there were timestamps, we needed to apply an aggregate function to obtain a single score per feature. We applied the sum to obtain the resulting feature importance for the ensemble tree models that consists of the scores for all features that summed up to one.

Feature importance for LSTM

To obtain the feature importance for the LSTM model we applied the integrated gradients method [44]. It provided attributions for each instance. To understand the attributions for the model as a whole, they were averaged across all the inputs to obtain an aggregated value for each feature. Obtained feature importances show the relative contribution of each feature to the prediction. Similarly to LogR coefficients, the derived feature attributions do not sum up to one and can be positive and negative meaning positive and negative association with the prediction respectively.

3.2.9. Statistical analyses

Data was extracted in the form of six CSV files. First, SQL views in the local MIMIC-III database were created. Subsequently, the data from each view was exported as a CSV file with pgAdmin III v1.22.2. Each exported file was named after the corresponding view. The extracted tables (our raw data) were:

1. kdigo_stages_measured.csv contained time-series measurements of creatinine, urine output for the last six, 12 and 24 hours and the respective labels.

2. icustay_detail-kdigo_stages_measured.csv contained non-temporal variables of patient demographics such as: age (numerical), gender (binary), ethnicity group (categorical) and type of admission (categorical).

3. labs-kdigo_stages_measured.csv contained time-series data of the laboratory tests.

4. vitals-kdigo_stages_measured.csv contained time-series data of the measurements of vital signs.

(22)

5. vents-vasopressor-sedatives-kdigo_stages_measured.csv contained non-temporal information on whether mechanical ventilation, vasopressor or sedative medications were applied. 6. diagnoses_icd_aki_measured.csv contained information on a number of diagnoses potentially

relevant for AKI, but related to the whole hospital stay. Most of the diagnosis (935 out of 939) occurred in fewer than 20% patients, and also there was no certainty on when during the hospital stay such diagnosis became known. Thus, to avoid potential biases it was decided not to include this table in this thesis. Moreover, the current representation of this table caused a problem with the memory usage. Therefore, only the tables 1-5 were used for this study. The SQL scripts are provided in the MIMIC code repository [45], containing more information about the data extraction.

All data pre-processing work and machine learning modeling tasks were performed on a MacBook Pro with a 2,3 GHz Dual-Core Intel Core i5 processor (processor memory 8 GB 2133 MHz LPDDR3) using Python version 3.8.3, python-bits: 64, OS: Darwin, OS-release: 19.6.0. We used Jupyter Notebook and main packages pandas 1.0.5, numpy 1.18.5 and scikit-learn version 0.23.1. LogR and RF classifiers were implemented using scikit-learn version 0.23.1. XGBoost was implemented using the xgb version 1.1.1. LSTM was implemented using the PyTorch version 1.6.0. Automated grid search was performed using scikit-learn version 0.23.1. The figures were drawn using matplotlib version 3.2.2 and seaborn version 0.10.1. For the feature importance analysis of LSTM the package captum version 0.3.0 was additionally applied.

(23)

4. RESULTS 4.1. Population description

The characteristics of the selected population are summarised in Table 4, stratified by the outcome (whether the patient had AKI of any stage or not). In the MIMIC-III dataset the age of all patients that were older than 89 years was stated as 300 years old. To arrive at more realistic statistics of this feature for patients aged older than 89 we set their age to 90 years old for the purpose of calculations of mean and respective percentiles. For the same reason, capping was applied, as described in methods, to avoid data description distortions by the outliers.

Table 4. Characteristics and ICU stay details of the selected population stratified by presence or absence of AKI.

Overall No AKI AKI of any stage

Patients Number of patients 34,516 (100 %) 9,353 (27.1%) 25,163 (72.9%) Gender Male patients 19,603 5,251 14,352 Female patients 14,913 4,102 10,811 Ethnicity of patients White 24,739 6,548 18,191 Black 2,675 763 1,912 Hispanic 1,098 400 698 Asian 829 237 502 Native 18 6 12 Unknown 4,220 1,022 3,198 Other 937 287 650 Hospital admissions

Admissions to the hospital 44,136 9,974 34,162

Hospital admission types

Emergency admissions 36,547 8,496 28,051

Elective admissions 6,462 1,250 5,212

Urgent admissions 1,127 228 899

(24)

Overall, 95 percent of the ICU length of stays were shorter than 35 days. Probability density function for the length of stays for all ICU stays are shown in Figure 5, while for the stays that did not exceed 35 days - in Figure 6.

Figure 5. ICU length of stays’ probability density function for the kernel density estimation. Patients with only one ICU

stay

26,824 8,630 18,194

Patients with 2-6 ICU stays 7,491 721 6,770

Patients with 7-23 ICU stays 201 2 199

Number of ICU stays

Number of ICU stays 47,751 10,228 37,532

Age during ICU stays

Mean, years 75.1 68.3 77.0

25% / 50% / 75% , years 53.6 / 66.2 / 77.9 46.3 / 60.1 / 74.4 55.6 / 67.6 / 78.5 LOS of ICU stays

Mean, days 11.1 7.3 12.1

(25)

Figure 6. ICU length of stays’ probability density function for the kernel density estimation for those stays that did not exceed 35 days (95% of all ICU stays).

The occurance of the AKI of different stages varied amongst ICU stays. During some ICU stays only AKI of a certain stage manifested, while during others all three stages of AKI developed within the same ICU stay. Figure 7 illustrates the intersection of occurrences of AKI of different stages in the ICU stays that we studied. The black dots mark the presence of AKI of the respective stage, while the grey dots represent an absence of such. The bars represent the number of ICU stays that belong to each intersection category.

(26)

4.2. Features description

Data descriptives for the numerical features and the only categorical feature that had missing values (ethnicity) are shown in Table 5.

Table 5. Features descriptive statistics.

Feature Mean (SD) 25th/50th/70th

percentiles

Percentage of missing values within the feature Table kdigo_stages_measured.csv Creatinine after merging 1.57 (1.55) 0.70 / 1.00 / 1.70 81.24 Urine output (last 6 hours) 1.54 (1.57) 0.60 / 1.06 / 1.88 12.31 Urine output (last 12 hours) 1.43 (1.26) 0.63 / 1.07 / 1.78 12.31 Urine output (last 24 hours) 1.38 (1.11) 0.66 / 1.09 / 1.73 12.31 Table vitals-kdigo_stages_measured.csv Heart rate 101.04 (32.43) 77 / 92 / 118 12.95 Systolic blood pressure 121.49 (23.09) 104 / 119 /137 36.02 Diastolic blood pressure 60.21 (14.02) 50 / 59 / 69 36.04 Respiratory rate 20.18 (5.85) 16 / 20 / 24 30.55 Oxygen saturation 97.18 (2.73) 96 / 98 / 99 32.68 Table icustay_detail-kdigo_stages_measured.csv

Ethnicity (unknown) categorical categorical 10.40

Table labs-kdigo_stages_measured Anion gap 13.71 (3.64) 11 / 13 / 16 43.09 Bicarbonate 25.26 (4.92) 22 / 25 / 28 42.19 Chloride 103.82 (6.09) 100 / 104 / 108 36.89 Glucose after merging 137.50 (51.76) 104 / 126 / 157 15.42