Predicting the length of stay for optimizing hospital level capacity planning in Dutch hospitals

(1)

Predicting the length of stay for optimizing hospital

level capacity planning in Dutch hospitals

Master thesis

Vincent Stangenberger

Master of Medical Informatics, University of Amsterdam

Graduate intern, LOGEX bv

(2)

Personal details

Vincent Stangenberger, MSc

10320830, vincent.stangenberger@gmail.com

Supervision

Rolf Bremmer, PhD

Senior Associate, rolf.bremmer@logex.nl Lucas van Rossem, MSc

Senior Associate, lucas.vanrossem@logex.nl Danielle Sent, PhD

Assistant Professor, d.sent@amc.uva.nl Ameen Abu-Hanna, PhD

Full professor of Medical Informatics, a.abu-hanna@amc.uva.nl

Place of the scientific research project

(3)

Abstract

A hospital needs to treat its patients given a limit amount of resources. Policy decisions have to be made in order to optimally allocate resources within a hospital. An import metric often used for these policy decisions is the Length of Stay (LoS). The LoS is influenced by multiple patient related and organizational factors and is difficult to predict on a patient level due to its skewed nature. We hypothesize that the LoS on patient level could be accurately predicted by incorporating the patient history into a predictive model. However, the factors of influence on the LoS and a method to accurately predict the LoS are still unknown. Here we show factors that are independently associated with the LoS and how to model the LoS to create accurate predictions. We found that factors related to patient history have a strong association with the LoS, while time related factors did not have an association with the LoS. Furthermore, we found that a Support Vector Regressor could accurately predict the LoS with a Mean Absolute Error of 1.4 days. In addition, a Bayesian attention based recurrent neural network is able to predict patients with a higher LoS more accurate. Our results demonstrate a clear set of factors that are associated with the LoS and show the ability to accurately predict the LoS. We anticipate that our research of the identified factors and the predictive modelling of the LoS to be a starting point for improving decision making related to capacity planning in healthcare. With the ability to predict the LoS, the discharge policy could be improved. By making tailor made treatment plans, patient satisfaction will increase and a cost reduction obtained. Furthermore, we hope researchers will use our methodology for building and evaluating prediction models to improve LoS prediction.

Keywords — Length of Stay, Supervised Machine Learning, Neural Networks, Knee Replacement

(4)

Samenvatting

Een ziekenhuis moet zijn patiënten behandelen met een beperkte hoeveelheid midde-len. Voor het zo optimaal mogelijk verdelen van deze middelen binnen een ziekenhuis moeten beslissingen over het beleid gemaakt worden. Een belangrijk aspect van deze beslissingen is de ligduur (LoS) van patiënten. De ligduur wordt be¨ınvloed door meerdere patiënt gerelateerde en organisatorische factoren en is moeilijk te voor-spellen op patint niveau vanwege zijn scheve verdeling. Wij veronderstellen dat de ligduur voor een patiënten accuraat voorspelt kan worden door gebruik te maken van de voorgeschiedenis in het voorspellende model. Echter, er is onbekend welke factoren van invloed zijn op de ligduur en welke methode accuraat de ligduur kan voorspellen. In deze thesis laten we zien welke factoren van invloed zijn op de ligduur en hoe de ligduur op een accurate manier te voorspellen is. We hebben sterke associ-aties tussen factoren gerelateerd aan medische voorgeschiedenis en ligduur gevonden, terwijl we geen associaties voor tijd gerelateerde factoren met ligduur vonden. Daar-naast, vonden we dat een Support Vector Regressor het meest accuraat de ligduur kan voorspellen met een Mean Absolute Error van 1.4 dagen. In aanvulling hierop was ons eigen gebouwde voorspellend model beter instaat om patiënten met een lan-gere ligduur te voorspellen. Onze resultaten laten een duidelijk set van factoren zien die een associatie hebben met de ligduur en laten daarnaast de mogelijkheid zien om de ligduur accuraat te voorspellen. We verwachten dat dit onderzoek naar de gevonden factoren en de voorspelling van de ligduur als startpunt kan dienen voor het verbeteren van besluitvorming in de gezondheidszorg. Met het vermogen om de ligduur te voorspellen, kan het ontslagbeleid van ziekenhuizen verder worden verbe-terd. Door het maken van persoonlijke behandelplannen zal de patiënt tevredenheid toenemen en zal een kostenreductie worden gerealiseerd. Daarnaast, hopen wij dat onderzoekers onze methodologie voor het bouwen en evalueren van voorspellende modellen zullen gebruiken om de ligduur voorspellen te verbeteren.

Keywords — Length of Stay, Supervised Machine Learning, Neural Networks, Knee Replacement

(5)

Acknowledgement

During this internship and writing of this thesis, I learned a lot. I particularity enjoyed the room for experimentation at LOGEX where I was encouraged to push my boundaries further. I also enjoyed the welcoming and friendly atmosphere at LOGEX. On my first day I was welcomed with the words: “welcome to the LOGEX family”. From this day on, I truly felt that I was part of the family.

In this section I would like to thank everyone who has contributed to this thesis. First, I would like to thank my daily supervisors from LOGEX. Rolf and Lucas, thank you for the opportunity to do my internship at LOGEX. I enjoyed our discussions about my thesis and the future of healthcare in general. I hope that I fostered your enthusiasm about machine learning and predictive modeling. I also want to thank my supervisors from the university. Ameen, thank you for our in-depth discussions and your critical feedback. Although we approach predictive models from a different angle and tend to stick with it, I value the complete view it gave me. Danielle, thank you for your extensive supervision and helpful feedback during the first phase of the internship. It helped me to be successful later on. I would also like to thank the people who reviewed my thesis before I dared to hand it in. Willemijn, Alma, and Frank, thank you for your critical notes. Finally I would like to thank my family for their support during this whole thesis period.

(6)

Hospitals are faced with the task to treat their patients regardless of the given re-sources. In classical economics these resources typically include land, labour, and capital. However, with the rising costs of healthcare and the increasing number of treatment options the available resources are often scarce [3]. In combination with the wish of the government and the medical professionals to improve quality of care, an incentive to work cost-effective has emerged. Constant assessment, comparing, and optimizing resource allocation leads to improved quality and a reduction of overall healthcare costs [4]. Different efficiency indicators have been used to monitor qual-ity of care. These indicators can be grouped into structure, process, and outcome indicators [8]. For example, the volume of procedures performed by a physician is often an indicator of “being good” at this specific procedure. Other examples of indicators are the number of beds, the number of clinical personal, and the alloca-tion of budget [9]. Still today, it remains a challenge to make informed decisions based on these indicators. While these decisions impact adverse events and patient related outcomes which have a substantial contribution to the costs. Capacity plan-ning and data driven decision making on a tactical level proved to be beneficial for cost-effective utilization of the provided resources.

There are various methods to plan utilization of scarce resources [7]. Often these methods use data about the Length of Stay, arrival rates, and patient volumes. The Length of Stay (LoS) is defined as the duration of a single episode of hospitalization measured in number of inpatient days as shown in Definition 1.1. LoS is strongly related to costs and to quality of treatment and therefore an important measure for hospitals to monitor [9].

Definition 1.1. Length of Stay (LoS) is the duration of a single episode of hospital-ization measured in the number of consecutive inpatient days. The LoS is calculated by subtracting the day of admission from the day of discharge for a single patient.

Cost-effective capacity planning is only achievable when using predictions to an-ticipate on logistic events. We obtain such predictions using prediction model that use historic data in order to predict the future. Because the LoS is strongly cor-related with cost and quality of care, the prediction of the LoS is a research topic of interest in healthcare. Across many sub-domains the LoS is investigated often in combination with mortality and re-admissions. It is hard to predict the LoS because it is a skewed indicator between individual patients, often similar to a Poisson dis-tribution. Figure 1.1 shows a typical distribution of the LoS as found by Carter and Potts [1]. Electronic health record data of knee replacement surgery at one UK hos-pital was used to create Figure 1.1. New predictive models can be better at dealing with skewed data but have not been used on the problem at hand.

Most developed models did not have a high predictive performance for predicting the LoS [1, 10, 11]. Models that proved to have predictive performance could not be generalized and were only effective in very specific patient groups. Meaning it is hard to predict the LoS for an individual patient but easier for patients clustered in diagnosis groups. One approach often used in practise is the Average Length of Stay (ALoS). The ALoS is often used instead of the LoS because the ALoS is calculated

(9)

Figure 1.1: Typical distribution of the LoS

using patient groups. However, a LoS for each individual patient would be more useful for capacity planning since the ALoS is not usable due to the skewness as shown in Figure 1.1. For example, there is a wide range of the LoS within a specific patient group. The estimated ALoS does not represent this patient group.

It remains a challenge to build a predictive model for the LoS which is applicable for the general hospital population. However, the need for such a predictive model increases when adverse events, such as a prolonged LoS, could be anticipated on and costs could be reduced. Eventually with accurate LoS predictions on a patient level, personal treatment plans can be made which facilitates communication with the patient and makes the patient expectations manageable. This will result in increased patient satisfaction and a cost reduction.

knee replacements

In this thesis we will focus on the prediction of the LoS after a knee replacement surgery since it a highly standardized procedure. A knee replacement is a surgical intervention where the whole or half of the knee is replaced with a prosthesis. First the surgeon makes a vertical incision on the knee. The outworn cartilage becomes visible and is removed completely. The top of the femur and the tibia are also removed to make room for artificial prosthetic parts. Between these parts an artificial prosthetic disk will be placed to make it easy to move the knee. The patella is sometimes removed and replaced with a artificial one. However, recent scientific literature advises to keep the natural patella in place. There is a national medical guideline for knee replacements in the Netherlands [5].

The number of knee replacements increases every year [12]. In the Netherlands a total of 19,566 knee replacement surgeries took place in 2010 [2]. These surgeries are performed in any general or specialized hospital.

In most cases knee replacement is performed because of osteoarthritis in the knee. This means that the cartilage in the knee is no longer functional in absorbing

(10)

pressure. This causes friction on the femur and tibia. Patients experience pain and often immobility. There are several other reasons to consider knee replacement surgery such as earlier damage or removal of the meniscus, avascular necrosis, mucosal disease such as rheumatism, gout or an infection, and bleeding in the joint due to diseases such as hemophilia [5].

In addition to these causes there are certain risk factors identified for knee re-placement surgery. Knee rere-placement surgery often occurs with patients above the age of 50 years old. It is also known that women are more susceptible for a diagnosis related to knee replacements. Furthermore overweight, heredity, heavy load on the joints due to work or sports, Higher bone mass, and damage to the cartilage are associated with the need for knee replacements [6].

1.2 Research questions and objectives

This thesis addresses the need for a high-quality prediction model to predict the LoS. We answer the following two distinct research questions.

1. Which factors are associated with the LoS?

• Which patient characteristics are associated with the LoS?

• Which organizational and other factors are associated with the LoS? 2. What is the predictive performance of different families of predictive models on

a LoS data set?

The aim of this project is to investigate the factors that are independently associ-ated with the LoS and to build and compare models on the prediction of the LoS for an individual patient undergoing knee replacement surgery. The scientific contribu-tion of the thesis is twofold. First, we present a list of factors that are independently associated with the LoS. Second, we create a benchmark of predictive models for this specific learning task and possibility to use these predictive models in a real clinical setting.

1.3 Outline

This thesis is structured as follows. First literature is searched to find the factors that are associated with the LoS in general and for knee replacements in specific. Furthermore, the current state of predictive models in medicine will be researched. This leads to an understanding of available predictive models and a list of variables that are used for the predictive modelling. These two points are discussed in Sec-tion 2.1 of Chapter 2. Chapter 3 discusses the methodology and the data sources used. The predictive models found in the literature research and additional predic-tive models are compared in a benchmark. In addition, the benchmark is extended with a self-developed predictive model. All predictive models in the benchmark are evaluated on the same learning task. This is summarized in Chapter 4. Chapter 5 discusses our findings in relation to current scientific literature and provides an overall conclusion.

(11)

References

[1] Evelene M Carter and Henry WW Potts. “Predicting length of stay from an electronic patient record system: a primary total knee replacement example”. In: BMC medical informatics and decision making 14.1 (2014), p. 26.

[2] CBS. Operaties in het ziekenhuis; soort opname, leeftijd en geslacht, 1995-2010. 2014. url: http://statline.cbs.nl/StatWeb/publication/?VW=T& DM=SLNL&PA=80386NED&LA=NL (visited on 11/29/2017).

[3] _{CBS. Zorguitgaven; kerncijfers. 2017. url: http://statline.cbs.nl/Statweb/} publication/?DM=SLNL&PA=83037NED (visited on 05/18/2017).

[4] Mark R Chassin et al. Accountability measuresusing measurement to promote quality improvement. 2010.

[5] Federation of medical specialists. Guideline Totale knieprothese. https : / /

richtlijnendatabase.nl/richtlijn/totale_knieprothese/totale_knieprothese_-_korte_beschrijving.html. Online; accessed 21 November 2017. 2013.

[6] Henrik Husted, Gitte Holm, and Steffen Jacobsen. “Predictors of length of stay and patient satisfaction after hip and knee replacement surgery: fast-track experience in 712 patients”. In: Acta orthopaedica 79.2 (2008), pp. 168–173. [7] Spencer S Jones et al. “Forecasting daily patient volumes in the emergency

department”. In: Academic Emergency Medicine 15.2 (2008), pp. 159–170. [8] Jan Mainz. “Defining and classifying clinical indicators for quality

improve-ment”. In: International Journal for Quality in Health Care 15.6 (2003), pp. 523– 530.

[9] Petsunee Thungjaroenkul, Greta G Cummings, and Amanda Embleton. “The impact of nurse staffing on hospital costs and patient length of stay: a system-atic review”. In: Nursing Economics 25.5 (2007), p. 255.

[10] Ilona Willempje Maria Verburg et al. “Which models can I use to predict adult ICU length of stay? A systematic review”. In: Critical care medicine 45.2 (2017), e222–e231.

[11] Ilona WM Verburg et al. “Comparison of regression methods for modeling intensive care length of stay”. In: PloS one 9.10 (2014), e109684.

[12] Alexander M Weinstein et al. “Estimating the burden of total knee replacement in the United States”. In: The Journal of bone and joint surgery. American volume 95.5 (2013), p. 385.

(12)

Chapter 2

(13)

2.1 Introduction

In this chapter related literature is discussed. First the search strategy is discussed to explain how literature was found. Second, the factors found in literature that are associated with the LoS are presented. In addition the predictive models used to predict the LoS followed by the predictive models used in medicine in general that were found are presented and discussed.

2.2 Search strategy

A scoping literature review is performed to find all related literature. Pubmed and Google scholar were searched for relevant literature. The following search terms and variations of these search terms were used. Only Dutch and English articles were included.

• Length of Stay • Prediction • Predictive model • Knee replacement

Articles that presented factors that influenced the LoS or articles that described or used predictive models to predict the LoS were included. The search results were first assessed by their title and then abstract. Next the full text was read and the article was included if applicable to our research. The references of relevant articles were used to obtain more relevant articles.

2.3 Results

Factors associated with Length of Stay for knee replacements

Six studies are included which found associations between factors and LoS for knee replacements. However, often a limited set of factors is analyzed because of restric-tions of the available data set [3]. The factors found in literature, could be grouped into four categories:

• Static patient specific variables • Physiological variables

• Temporal variables • Organizational variables

Static patient specific variables are factors which do not change frequently and are specific for the patient. Examples are age of the patient, place of residence, and co-morbidity scores. Physiological variables are related to the physiological state of the patient. Blood levels, results of medical tests, and the ability to use muscles are examples of physiological variables. Note that co-morbidity scores are not grouped into the physiological variables since the co-morbidity scores are aggregates of physiological variables. The temporal variables have an aspect related to time or a duration. The day of admission, the time between visits, and the time from

(14)

surgery to mobility are examples of temporal variables. Jensen et al. found that diseases, symptoms, and treatments within a time span are related [17]. The last group of factors consists of organizational variables. These group of variables are related to structure, process, and outcomes of a single hospital. Examples of an organizational variable are the consultant which advised the surgery, the fact of a certain procedure, such as an x-ray scan is performed, and the number of annual knee replacement surgeries.

In general, most studies agree that age, gender, admission date, and co-morbidity have a high impact on the LoS. An extensive list of factors that are associated with LoS for knee replacements can be found in Appendix A. For the specific case of LoS for knee replacement surgery some other relevant factors were found in more specific settings. Carter and Potts found that age, gender, consultant, discharge destination (home, hospital), day of admission, social deprivation, and ethnicity had a significant impact on the LoS [3]. Kulinskaya, Kornbrot, and Gao found that admission method, discharge destination, provider (hospital) type, specialty, and National Health Service (NHS) region had a significant impact on the LoS [22]. Smith et al. found that year of admission, consultant, the stair score, the walking-aid score, and age had a significant impact on the LoS [29]. Escalante and Beardmore found that the use of use of bone cement and an operating time of 6 hours or longer also had an impact on the LoS [13]. It is hard to say what the influence of variables on the LoS will be and how big this effect may be since these studies are performed in different settings with often a low number of patients. Age, gender, admission date, and co-morbidity have the most effect on the LoS.

Furthermore, the Charlson co-morbidity index is an important measure to predict the LoS [12]. The Charlson co-morbidity index is calculated by assigning points to specific diagnosis and summing these points. In addition, the American Society of Anaesthesiologists (ASA) score is also a common used variable to predict the LoS [28]. The ASA score is determined by the anesthesiologist. This score is used to assess the fitness of a patient to undergo surgery. Dobbins et al. proved that the Charlson co-morbidity index is more informative when predicting the LoS than the ASA score [12].

Predictive modeling of the Length of Stay in general

Predictive modeling in medicine is a topic of interest for many research groups be-cause of the potential impact on healthcare [18]. Often conventional statistical data analysis is used in medicine [3, 31, 33]. The conventional predictive models use static captured data such as age or certain blood measures determined at one point in time. Conventional predictive models such as Cox regression or logistic regression models, are effective at finding associations between risk factors and outcome. How-ever, conventional predictive models do not excel in predictive performance since it are simple models which are not optimized for prediction and can not use all data, such as unstructured data [33]. Machine learning models are an addition to the conventional predictive models because they are optimized for prediction tasks and use computational intensive techniques. These machine learning models could be further divided into traditional machine learning models and deep learning models. Traditional machine learning models and statistical models are tuned by changing their input variables (feature engineering), while deep learning models are often used in combination with unstructured data and have a different approach to predictive modeling. With deep learning the architecture of the model determines good per-formance (architecture engineering). To address these differences, the conventional

(15)

statistical models and traditional machine learning models are discussed in this sec-tion. The deep learning models are discussed in the Representation learning secsec-tion. Jiang et al. researched the use of predictive models in medicine [18]. In general, the most popular application area to use predictive models is cancer, neurology and cardiology. Within these application areas the predictive models are applied to early detection and diagnosis, treatment, outcome prediction, and prognosis evaluation. Especially in diagnostic imaging deep learning is widely used. When we look at the predictive models themselfes Jiang et al. found that support vector machines and neural networks, especially convolutional neural networks are the two most used predictive models [18]. Appendix B can be consulted for further understanding of the discussed predictive models and predictive models in general.

The most used method for predicting LoS for knee replacements is multivariate regression [3, 16, 29, 32]. No literature was found using other predictive models for predicting the LoS for knee replacements. However, for prediction of the LoS other predictive models have been used. Barnes et al. compared three supervised machine learning algorithms to predict the probability of discharge for the general hospital population [1]. Normal logistic regression, tree based methods (random forests), and assemble techniques were compared with the judgment of clinicians. Carter and Potts recognized the skewed distribution of the LoS and incorporated this into there modelling approach. Both the Poisson regression and negative binomial regression are appropriate methods for modeling skewed data [3]. Jones et al. also compared different approaches to model temporal data for the purpose of predicting patient volume at a given time [20]. A simple linear regression is used as reference method and is compared with seasonal autoregressive integrated moving average, exponential smoothing, time series regression, and neural networks. Time series regression proved to be the best method in this specific case. However, simple linear regression also produced robust results and Jones et al. argues that it is easier to use simple linear regression than more complex methods such as time series regression. Furthermore neural networks are becoming a more popular predictive model to use in medicine [27, 35]. An overview of the models found in literature is shown in Table 2.1. Next to the models the learning task, the study that used the model, and the results of these models reported by its associated study is shown. Since the studies have different outcomes and report other error measures, we grouped the results of the study into four groups: poor results, moderate results, robust results, and mixed results. The measures reported and discussed by the study were used to assign models to a group. The results of the study and the expectations of the authors of the study were the decisive factor to assign models to a group. Poor results, moderate results, and robust results represent poor to high performance, while mixed results represents a non consistent performance of the predictive model as reported by the study. Linear and logistic regression appear to have mixed results on modeling the LoS because of the skewed nature of the LoS distribution. Neural networks also appear to have mixed results due to differences in architectural design.

(16)

Table 2.1: Predictive models found in literature

Predictive model Outcome Study Results

Linear regression LoS (continue outcome)

Husted, Holm, and Jacobsen, Smith et al., Verburg et al. [16, 29, 32]

Mixed results

Logistic regression LoS (binary outcome)

Barnes et al., Verburg et al. [1, 32]

Mixed results Random forests Probability of

discharge

Barnes et al. [1] Moderate results Seasonal autoregressive integrated moving average Patient volume at given time

Jones et al. [20] Moderate results

Exponential smoothing

Patient volume at given time

Jones et al. [20] Moderate results Time series

regression

Patient volume at given time

Jones et al. [20] Robust results Neural networks LoS, Patient

volume at given time

Jones et al. [20] Mixed results

Poisson regression LoS Carter and Potts, Rowan et al., Wrenn et al. [3, 27, 35]

Robust results

Negative binomial regression

LoS Carter and Potts [3] Robust results Random

conditional fields

Mortality and readmission

Venugopalan et al. [31] Robust results Temporal Cox

regression

Mortality and readmission

Venugopalan et al. [31] Moderate results

To predict the LoS different data sources are used. Often for specific diseases a small predictor set is used consisting of a combination of physiological and demo-graphic data data of patients [3]. These data sources lack generalizability since these specific physiological measurements are not captured for every patient in the gen-eral hospital population. Other studies use administrative data in combination with demographic data. This type of data source could be more generalized but lacks specific clinical parameters which are often more informative. Some data sources consists of both types of data [19]. However, all these data sources have in common that records are about individual patients and consists of recorded visits over time. Patient data has a static and temporal nature. For instance, the heart rate of a patient could be measured every hour. Between these data points there could exists a relation over time which could be valuable to the predictive task. Methods which utilize the temporal nature of medical data such as random conditional fields and temporal Cox regression, lack overall understanding of patient conditions and other related factors [31]. This raises a need for more complex predictive models which understand multiple different types of data and can use this data to make accurate predictions.

Table 2.1 shows the current state of the art predictive models in medicine. Four robust models were found. However, these models still have limitations as mentioned by the authors or are tested in very specific settings with for example small patient

(17)

populations. The models in Table 2.1 are recreated in the benchmark if fit for our use case and will be compared on the task to predict the LoS for knee replacement surgery.

Representation learning

Ideally to excel at a predictive task an effective representation of data is needed. This representation could be obtained by manually selection and transformation of features also know as traditional machine learning or in an automated way as seen with deep learning. In the field of machine learning predicting representations of data has been proved valuable for common predictive tasks [2]. Representation learning aims to learn effective representations of data regardless of the predictive task at hand. By transforming high-dimensional raw data into continuous fixed size vectors the under-laying patterns in the data is efficiently captured.

The predictions of a predictive model are as good as the quality of the representa-tion of the input data. In other words: “garbage in is garbage out”. Representarepresenta-tion learning emerged from Speech recognition and signal processing where convolutional neural networks were used [2]. In the field of natural language processing repre-sentation learning methods are widely adopted with the use of the popular word embedding models [24].

Various methods have been used for representation learning. A simple approach is one-hot encoding [26]. One-hot encoding encodes each possible value by a 1 in a 1 ∗ N matrix, where N equals the number of unique values or a number set by the user. However, this approach does not capture the relation between two variables since the resulting vectors are assigned and therefore arbitrary. The relation between two variables is often expressed by the distance between two vectors of two variables. The relations between variables are valuable for the prediction since it increases the predictive performance. With one-hot encoding the distance between the vectors of two separate variables does not represent the under-lying relationship.

A common statistical approach is to inspect and transform variables manually. This requires specific domain knowledge and it is uncertain that the optimal repre-sentation is found. Furthermore when the number of variables increase this process becomes labour intensive and this makes it not scalable [9]. More recent methods use neural networks or deep learning for representation learning. This shift is two fold since it promotes re-use of features and deep learning is able to capture high level features which improves the performance on the predictive task [2].

Representation learning has proved to be effective when used in combination with re-use of features or better known as transfer learning [2]. Transfer learning is the application of using general knowledge for specific learning tasks. It is about transferring knowledge across different predictive tasks from a shared understanding. This approach has two main advantages:

• When using a shared understanding, the generalizability increases which leads to increased predictive performance in comparison to task specific models. • The amount of (labeled) data needed to train the actual prediction model is

less since the data is more effective represented.

In medicine the approach of representation learning and transfer learning are quite new and not explored in their full potential. However, medicine in general could benefit from this approach to learn effective representations from medical con-cepts. Patient data often consists of chronologically ordered visits with unordered

(18)

medical concepts within these visits. In combination with the other static patient variables this becomes a complex data structure. By leveraging representation learn-ing approaches various data sources could be used to obtain high performance in multiple predictive tasks. However, representation learning in healthcare would re-quire additional constrains in relation to other application domains.

A requirement in medicine is that the learned representation must be interpretable because of the possible implication of a prediction for a patient. Interpretability is harder when using deep learning approaches such as recurrent neural networks. Another requirement is to have a scalable approach to learn representations over a large population. A first approach was designed by Forestier et al., who used the tf ∗ idf framework to transform clinical activities into vectors and obtained good performance on six tasks while having interpretable vectors. Another approach to address these requirements for effective representation the Skip-Gram approach by Mikolov et al. was adapted to learn vectors for medical concepts [8, 10, 11, 25]. This technique also from the natural language processing community seems promising for the healthcare domain.

Electronic health records store large amounts of codes, which are often not used. By modifying the Skip-Gram algorithm effective representations of medical codes can be found. However, the proposed modifications did not use the available data to its full extend since the temporal aspect is not used. A combination of the following properties of this data needs to be taken into account:

• Static patient variables • The sequence of visits

• The co-occurrence of medical codes (diagnosis codes and activity codes) An attempt to combine two these properties, the static patient variables and the co-occurrence of medical codes, is the Med2Vec system [9]. Med2Vec is a repre-sentation learning method which utilizes a combination of the popular Skip-Gram algorithm by Mikolov et al. and static patient data. The Med2Vec system proofed to be superior in the prediction of clinical codes over traditional representation learning approaches [9]. However, the temporal nature of the sequence of visits is not used to its full extend since no sequence modelling is performed. This gives room for improvement.

2.4 Discussion and Conclusion

In Chapter 2, we identified factors that are associated with the LoS. Age, gender, admission date, and co-morbidity have a high impact on the LoS. Other relevant factors that are associated with LoS can be found in the Appendix A.

In addition predictive models used in medicine are presented in his Chapter. More recent predictive models are not yet used to model the LoS. Four of the found models did have robust results as reported by the article of these models. However, we note that there is room for improvement by using all aspects of data related to LoS. We identified three aspects of data which need to be addressed in order effectively predict the LoS: static patient variables, the sequence of visits, and the co-occurrence of medical codes. Furthermore we note that deep learning and repre-sentation learning approaches could potentially be successful in modeling the LoS. Such a predictive model should satisfy the requirements of being interpretable and scalable. An explanation of the predictive models can be found in the Appendix B.

(19)

Our literature search is limited since it is a scoping literature review instead of a systematic literature review. This could potentially result in missed articles. In addition we used a subjective method to classify predictive models from the found articles as shown in Table 2.1.

This thesis will further investigate the prediction of the LoS for knee replace-ments. A benchmark consisting of all common predictive models and models used in the literature will be created. The models are evaluated on a data set containing of medical declarations enriched with patient characteristics. To this extend predictive models will be trained with the three described properties in mind. Finally recom-mendations will be made about predicting the LoS for knee replacements in order to compare performance between hospitals.

(20)

References

[1] Sean Barnes et al. “Real-time prediction of inpatient length of stay for discharge prioritization”. In: Journal of the American Medical Informatics Association 23.e1 (2015), e2–e10.

[2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives”. In: IEEE transactions on pattern analysis and machine intelligence 35.8 (2013), pp. 1798–1828.

[3] Evelene M Carter and Henry WW Potts. “Predicting length of stay from an electronic patient record system: a primary total knee replacement example”. In: BMC medical informatics and decision making 14.1 (2014), p. 26.

[4] CBS. Operaties in het ziekenhuis; soort opname, leeftijd en geslacht, 1995-2010. 2014. url: http://statline.cbs.nl/StatWeb/publication/?VW=T& DM=SLNL&PA=80386NED&LA=NL (visited on 11/29/2017).

[5] _{CBS. Zorguitgaven; kerncijfers. 2017. url: http://statline.cbs.nl/Statweb/} publication/?DM=SLNL&PA=83037NED (visited on 05/18/2017).

[6] Mark R Chassin et al. Accountability measuresusing measurement to promote quality improvement. 2010.

[7] Edward Choi et al. “Doctor ai: Predicting clinical events via recurrent neural networks”. In: Machine Learning for Healthcare Conference. 2016, pp. 301–318. [8] Edward Choi et al. “Medical concept representation learning from electronic health records and its application on heart failure prediction”. In: arXiv preprint arXiv:1602.03686 (2016).

[9] Edward Choi et al. “Multi-layer representation learning for medical concepts”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining. ACM. 2016, pp. 1495–1504.

[10] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. “Learning low-dimensional representations of medical concepts”. In: AMIA Summits on Translational Sci-ence Proceedings 2016 (2016), p. 41.

[11] Lance De Vine et al. “Medical semantic similarity with a neural language model”. In: Proceedings of the 23rd ACM International Conference on Con-ference on Information and Knowledge Management. ACM. 2014, pp. 1819– 1822.

[12] Timothy A Dobbins et al. “Assessing measures of comorbidity and functional status for risk adjustment to compare hospital performance for colorectal can-cer surgery: a retrospective data-linkage study”. In: BMC medical informatics and decision making 15.1 (2015), p. 55.

[13] A Escalante and TD Beardmore. “Predicting length of stay after hip or knee replacement for rheumatoid arthritis.” In: The Journal of rheumatology 24.1 (1997), pp. 146–152.

[14] Federation of medical specialists. Guideline Totale knieprothese. https : / /

richtlijnendatabase.nl/richtlijn/totale_knieprothese/totale_knieprothese_-_korte_beschrijving.html. Online; accessed 21 November 2017. 2013.

[15] Germain Forestier et al. “Finding discriminative and interpretable patterns in sequences of surgical activities”. In: Artificial Intelligence in Medicine 82 (2017), pp. 11–19.

(21)

[16] Henrik Husted, Gitte Holm, and Steffen Jacobsen. “Predictors of length of stay and patient satisfaction after hip and knee replacement surgery: fast-track experience in 712 patients”. In: Acta orthopaedica 79.2 (2008), pp. 168–173. [17] Anders Boeck Jensen et al. “Temporal disease trajectories condensed from

population-wide registry data covering 6.2 million patients”. In: Nature com-munications 5 (2014).

[18] Fei Jiang et al. “Artificial intelligence in healthcare: past, present and future”. In: Stroke and Vascular Neurology (2017), svn–2017.

[19] Alistair EW Johnson et al. “MIMIC-III, a freely accessible critical care database”. In: Scientific data 3 (2016).

[20] Spencer S Jones et al. “Forecasting daily patient volumes in the emergency department”. In: Academic Emergency Medicine 15.2 (2008), pp. 159–170. [21] Uri Kartoun et al. “The MELD-Plus: A generalizable prediction risk score in

cirrhosis”. In: PloS one 12.10 (2017), e0186301.

[22] Elena Kulinskaya, Diana Kornbrot, and Haiyan Gao. “Length of stay as a performance indicator: robust statistical methodology”. In: IMA Journal of Management Mathematics 16.4 (2005), pp. 369–381.

[23] Jan Mainz. “Defining and classifying clinical indicators for quality improve-ment”. In: International Journal for Quality in Health Care 15.6 (2003), pp. 523– 530.

[24] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013).

[25] Jose Antonio Minarro-Gimenez, Oscar Marin-Alonso, and Matthias Samwald. “Exploring the application of deep learning techniques on medical text cor-pora.” In: Studies in health technology and informatics 205 (2014), pp. 584– 588.

[26] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[27] Michael Rowan et al. “The use of artificial neural networks to stratify the length of stay of cardiac patients based on preoperative and initial postopera-tive factors”. In: Artificial Intelligence in Medicine 40.3 (2007), pp. 211–221. [28] Meyer Saklad. “Grading of patients for surgical procedures”. In:

Anesthesi-ology: The Journal of the American Society of Anesthesiologists 2.3 (1941), pp. 281–284.

[29] IDM Smith et al. “Pre-operative predictors of the length of hospital stay in total knee replacement”. In: Bone & Joint Journal 90.11 (2008), pp. 1435– 1440.

[30] Petsunee Thungjaroenkul, Greta G Cummings, and Amanda Embleton. “The impact of nurse staffing on hospital costs and patient length of stay: a system-atic review”. In: Nursing Economics 25.5 (2007), p. 255.

[31] Janani Venugopalan et al. “Combination of static and temporal data analysis to predict mortality and readmission in the intensive care”. In: Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE. IEEE. 2017, pp. 2570–2573.

[32] Ilona Willempje Maria Verburg et al. “Which models can I use to predict adult ICU length of stay? A systematic review”. In: Critical care medicine 45.2 (2017), e222–e231.

(22)

[33] Ilona WM Verburg et al. “Comparison of regression methods for modeling intensive care length of stay”. In: PloS one 9.10 (2014), e109684.

[34] Alexander M Weinstein et al. “Estimating the burden of total knee replacement in the United States”. In: The Journal of bone and joint surgery. American volume 95.5 (2013), p. 385.

[35] Jesse Wrenn et al. “Estimating patients length of stay in the emergency depart-ment with an artificial neural network”. In: AMIA Annual Symposium Proceed-ings. Vol. 2005. American Medical Informatics Association. 2005, p. 1155.

(23)

Chapter 3

(24)

This Chapter will discuss the materials and methods employed during this thesis. First the used data sources and the patient population studied are presented. Then the data processing approach and data pipeline are discussed. Furthermore, an analysis for finding associations between the variables and outcome is shown. The final sections explain the set-up used for the creation of the benchmark of predictive models. This include model evaluation, the used evaluation measures, the prediction timing and the prediction models themselves.

3.1 Data sources and study cohort

Processed and standardized data from multiple Electronic Health Records (EHR) is used in this thesis. The source data is first stored and processed by LOGEX, a health analytics company in the Netherlands. The original data sources can be found at EHR systems of local hospitals. Data is extracted from the EHR’s and de-identified and then standardized using mappings, processing rules, and an extensive manual data validation.

LOGEX has data from 90% of all hospitals in the Netherlands which are about 60 hospitals. Most of these hospitals are traditional second line of care hospitals and provide regular or specialized care. The data delivery of these hospitals consists of data of the three most recent years. This data is delivered by the hospital to LOGEX on a quarterly basis. The hospital provides three tables with patient data, registered activities, and registered care products. All data could not be related to individual patients. All care related activities need to be registered in the EHR for billing purposes, which makes the EHR a complete data source for research.

For the purpose of this thesis, data from 2015 to 2017 is extracted from the database of LOGEX and further processed to fit the needs for predictive modeling. The datasets contain patient demographics, provider orders, diagnosis, procedures, and expensive medications, from all inpatient and outpatient encounters. No free text or unstructured data is available for analysis. The selected patients had at least one care product with knee replacement surgery between 2015 to 2017 in the Netherlands and did not die during the care trajectory or immediately afterwards.

By combining data from multiple hospitals and processing using mappings, pro-cessing rules, and an extensive manual data validation, LOGEX makes sure the data quality is high. Although errors can occur, registration of activities is likely to be highly accurate for multiple reasons. First, there is a legal obligation to invoice this data to insurance companies for reimbursement. Second, hospitals use this data to analyze their practice and therefore have a clear benefit from correct registrations. Finally, there are many standard and random checks performed by hospitals and third parties, often including automated detection of deviant combinations of dec-larations. The database of LOGEX was used before in scientific research [14]. For this study no records were removed because it is assumed that the data is correct. Another argument to not remove any data, is that a prediction model in production also needs to cope with the possibility of a registration error as input.

Another problem we have to cope with is missing data. Initial analysis showed that 6% of all data points for three specific variables are missing. This is due to the data delivery process by the hospitals. In 2015 specific data for the operating theatre were delivered with the normal data delivery process. However, the hospitals did not adapt to this new specification at the same point in time. This results in a small portion of missing data in our data set. In similarity with the registration errors we decided to keep the missing data. The missing data was coded as minus one in the

(25)

data set. An overview of the variables we used and their definition can be found in Appendix D.

3.2 Data representation and processing

In order to process the LOGEX data a pipeline is build. The pipeline extracts, transforms, and processes data from the LOGEX SQL database and loads the result into a Mongo NoSQL database. We chose for a NoSQL database since it is easier to create a representation of data that fits the need for machine learning, since NoSQL gives a lot of freedom. An overview of the pipeline is shown in Figure 3.1 and will be further explained in the next section.

Figure 3.1: Pipeline used for data representation and processing

Overview of pipeline

Figure 3.1 shows two main components: the LOGEX production environment and the system used for this thesis. With custom Python code written by the author of this thesis, the data is mapped between these two components. The relational table format of the input data is transformed into a hierarchical structure where the patient is at the top level. One level below the top level the care trajectories within a hospital are mapped. Each trajectory consists of multiple activities performed by a medical professional. This is the lowest level within our hierarchy.

From the Mongo NoSQL database variables are selected and extracted. These variables are stored in a comma separated file for use and reuse by the predictive

(26)

models. The variables are also used to describe our research population as shown in Table 4.1 in Chapter 4. Table 4.1 describes the variables as either categorical variables or continues variables. Categorical variables are described with a count and a percentage of the total population. Continues variables are first checked for normality and then described with a mean and a standard deviation if normally distributed or a median and an interquartile range otherwise.

In Figure 3.1 we make a distinction between traditional models and deep learning models. This distinction is needed since deep learning methods are able to process unstructured data while traditional machine learning requires data in a tabular for-mat. This distinction is purely based on the input to the model. The used traditional models and deep learning models will be described in further detail in Section 3.6.

The following sections will address the approach to identify factors associated with the LoS and the creation of the benchmark of predictive models. First the statistical analysis is explained. Then we describe our evaluation measures and the set-up of our benchmark of the predictive models. Finally, the predictive models themselves will be described.

3.3 Statistical analysis of factors associated with

the LoS

To identify factors which are independently associated with the LoS a regression analysis is used. In this thesis we consider a factor as a variable extracted and often calculated from the original data source. In this thesis we consider a factor as a variable which is not part of the baseline model. A baseline model to correct for case-mix and co-founding was build. This baseline model used age, sex, and Charlson score as dependent variables [5]. Age and sex are common co-founding variables. We decided to also include the Charlson score as dependent variable since it is a measure for co-morbidity which is of direct influence on the LoS and other variables. All available variables, illustrated in Table 4.1, were used separately in the baseline model. The model was checked for multicollinearity which was not found. Odds ratio’s and confidence intervals were reported. A p-value below 0.05 considered as significant. This analysis was performed in the R 3.4.2 statistical language [12].

3.4 Model evaluation and analysis

Predictive models can be evaluated with three different approaches namely internal validation, temporal validation, and external validation [1]. A common approach is internal validation where the data set is split into a train and a test set. The predictive model is developed on the train set and evaluated on the test set. Often methods such as bootstrapping or cross-validation are used. Temporal validation is similar to internal validation with the addition that a data set is split based on time. In this case the predictive model is evaluated on future data. The last approach is external validation which uses another data set, for example another hospital or another patient population. In this work we employ a customized internal validation approach.

All predictive models are compared with the same regime which is our main principle in our evaluation approach. This will lead to results which are comparable across all models and can be generalized. The patient data stored in the Mongo database is randomly split into a train and a test set. Two thirds of the data

(27)

is assigned to the training set and the remaining test set consists of one third of the data. The performance of the models is reported on the test set. To prevent overfitting and to obtain a reliable performance estimate the models are developed on the training set and the test set is only used for final evaluation. The training set is used to optimize the predictive models and to find the optimal configurations. This is done in a five fold cross-validation fashion. Appendix E shows the tested configurations for the predictive models. The optimal configuration is defined as the configuration with the lowest average error score over the five folds. When the optimal configuration is found, this model is evaluated on the test data which is stored in the benchmark as final measure of the performance of the model.

Evaluation Measure

In this thesis we predict the LoS for patients undergoing knee replacement surgery. To this end the predictive models will predict the number of days after surgery a patient will be discharged. Our main evaluation measure will be the Mean Squared Log Error (MSLE) shown in Equation 3.1 where y denotes the ground truth and ˆy denotes the prediction. This measure is a common measure for regression problems and penalizes errors in proportion with the size of the prediction. Small errors are penalized for small values of the prediction, but small errors are not penalized for larger values of the prediction. In other words, the larger the prediction value the higher the error margin may be. For clinical purposes it is only interesting to know that a model can predict very precise in the first fourteen days after surgery. If the patient will stay longer than fourteen days the precise number of days is less important since clinicians know that this patient will occupy a bed for a longer term. In addition to the MSLE, some other measures are also reported such as the Mean Absolute Error and the Median Absolute Error.

M SLE(y, ˆy) = 1 n n X i=1 (log(1 + yi) − log(1 + ˆyi))2 (3.1)

In order to quantify the uncertainty of the evaluation measure we use bootstrap-ping to calculate confidence intervals. The whole test set is re-sampled using 1000 iterations. The confidence interval is calculated using the 2.5 and the 97.5 percentile from the resulting evaluation measures.

3.5 Prediction timing

The developed predictive models for the LoS are only beneficial at certain points in time. For instance, when we know how long a patient will stay before a patient enters the hospital, we could use this knowledge to influence our decision making. However, at this point in time there is only a limited set of variables available to predict with. We decided that the right moment to predict is immediately after surgery. At this point in time we have the most information about the patient and a prediction is very useful for logistic processes.

The nature of this study was retrospective because we used enriched claims data from the EHR. We collected as much data for each patient as possible and ordered it chronologically. All data after the target surgery is removed as in line with our prediction timing.

(28)

3.6 Predictive models

A total of five traditional predictive models were included into the benchmark. This include some of the models found in the literature as shown in Table 2.1 in Chapter 2. Unfortunately, not all predictive models presented in Table 2.1 in Chapter 2 could be implemented and used in the benchmark. This is due to the use of other types of data (such as time series) or incomplete explanations of the exact predictive model. A linear regression model, a decision tree regressor, a random forest regressor, and a support vector regressor were implemented with Python version 3.5 using Scikit-Learn [11]. In addition, a gradient boosting regressor is implemented using the XGBoost package [2]. In order to make comparisons between models more fair, we tune the predictive models. For linear regression we use a backward stepwise selection of variables. For the other models we found the optimal hyper parameter configurations using exhaustive grid search with predefined hyper parameters. This results in a better fit with the data set and makes the comparison between models more fair. The list of predefined hyper parameters can be found in Appendix E in Table 3.

Attention based recurrent deep neural network

In addition to the conventional models found in the literature and other traditional machine learning models, we developed a deep neural network. This section will explain the model itself and the rationale behind some design choices that were made.

Further inspection of the data reveals that the data consists of three levels of interest as shown in Figure 3.2. Care activities can be seen as the first level of interest. Care activities are simple time-stamped activities performed by a medical professional. More complex actions such as operations consists of multiple activities. This data is rich in volume since many activities are performed within in a care trajectory. Care trajectories are the middle level of interest. A care trajectory consists of multiple care activities in chronological order. Care trajectories are always related to a health problem of the patient coded as a diagnosis. The highest level of interest is the patient which has multiple care trajectories. The three proposed levels all have their own set of variables which may have predictive value for patient level outcomes.

(29)

The rationale behind our model is to use these three levels of interest. The field of representation learning as described in Chapter 2 concerns itself with learning representations of complex structures. The most used example can be found in natural language processing which also has multiple levels of interest. For example, word level, sentence level, and document level are common used levels of interest. A common used paradigm is the “embed, encode, attend, and predict” paradigm [7]. The embed step is used to transform a concept, such as a word, into a numerical vector representation. The encode step encodes all concepts as numerical vector representations in a matrix representation which together make up a higher level concept, such as a sentence. The attend step reduces the numerical matrix to a single numerical vector. During this process the importance of each lower level concept is quantified in the resulting numerical vector. The predict step takes the numerical vector from the attend step as input and produces the final prediction. The steps could be repeated to represent even more complex and nested data structures We used this paradigm to conceptualize our idea and to develop our model. The care activities could be seen as words, the care trajectories could be seen as sentences, and the patient could be seen as a document when compared with a task as document classification.

Starting at the first level of interest we used a predictive model to embed the individual care activities into vectors. A skip-gram models is trained over the whole collection of care trajectories from which it learns representations for care activities [9]. We chose to use the recommended parameters in the paper by Mikolov et al. The resulting model was qualitatively evaluated by a medical professional. This was done by selecting a few care activities and inspecting the most related care activities according to the trained model. This resulted in adjustment to the window parameter to 5 care activities since there could be some time between activities which are related but if the time between care activities is too large it will lead to an irrelevant suggestion of care activities for the care activity at hand. With the embedding model we are now able to effectively embed care activities into vectors of fixed length. This means we can represent our first level of interest.

To effectively represent our second level of interest we need to combine the care activities vectors to a single vector. This can be done in multiple ways. A straight forward method is to use simple math on the vectors such as using an average op-eration. However, this leads to a loss of relational information between the care activities vectors. For example, the chronological order is lost and care activities that are common and performed multiple times a day are over represented. To tackle this problem, we use an attention mechanism in combination with a bidirec-tional Gated Recurrent Unit (GRU) [3, 13, 16]. First, the matrix of care activity vectors is passed to the bidirectional GRU layer. This layer “reads” the matrix from start to end and from end to start, this is because of the bidirectional property. The GRU layer creates for every vector a new vector by using the relevant features it memorizes from earlier seen vectors. The result is again a matrix representation which is passed to the attention mechanism. An attention mechanism combines vec-tors by using a linear layer. The result is transformed with a tanh and a softmax function resulting in a numerical vector with probabilities within the range of zero and one. The attention mechanism we use, is shown in Equation 3.2, Equation 3.3, and Equation 3.4. The result is a numerical vector with probabilities which is input to the prediction network.

(30)

αt= sof tmax(et) (3.3)

o =X

t

αtht (3.4)

With the attention vector we have an input to make a final prediction. This is done in the prediction network which consists of two layers. The first layer has a number of neurons equal to the input while the second and last layer has only one neuron to make the final prediction. In between these layers we put a rectified linear function (ReLU) which is a common nonlinear transformation function.

The prediction model was trained in batches of 500 patients for each epoch with a total of 1000 epochs. A loss function using the MSLE as shown in Equation 3.1 was implemented and used for training and validation. We employed early stopping when the model did not improve after 15 epochs. Clipping of the gradient by 25% was used to prevent the vanishing or exploding gradient problem. The whole model is shown in Figure 3.3.

(31)

Bayesian approach to quantify uncertainty

When making predictions for future events there is always uncertainty. In general, this prediction uncertainty could be divided into two types namely model uncer-tainty and inherent noise. Model unceruncer-tainty is defined as the unceruncer-tainty of the model parameters and can be reduced with more data. Inherent noise is defined as uncertainty in the data generation process and can therefore not be reduced. This section presents the methods we use in this thesis to quantify prediction uncertainty using a Bayesian approach.

Bayesian methods can be useful to quantify prediction uncertainty by using infer-ence. Instead of points estimates as model parameters, we specify a parameterized family of distributions for the resulting generative model. The generative model is then optimized by moving the specified model prior towards the posterior using the observed data. The optimization is automatized by using automatic differentiation. In order to address the uncertainty, we retrained the two final linear layers of the model using a Bayesian approach and we add a softplus activation function to the final layer to ensure results greater than zero. Instead of point estimates of weights we used a Gaussian distribution of weights which results in a distribution as output. We used a mean of zero and a scale of 10 as prior for the Gaussian distributions. With a wide prior the search space becomes large and the possibility to find an optimal solution increases. We modeled the output distribution as a Poisson distribution since the LoS has a skewness to the right. In other words: for each patient a distribution of LoS days is predicted instead of just a number of days. With this technique we can say something about the uncertainty of the prediction. For example, with a Bayesian model we can predict a value of 5 days with a lower bound of 4 and a higher bound of 7 days. This Bayesian model is optimized with the evidence lower bound (ELBO) loss as shown in Equation 3.5 with a learning rate of 0.001 and a batch size of 500 patients for each epoch. A total of 1000 epochs was used to train the model. The ELBO loss function can be easily estimated using Monte Carlo estimates from the prior distributions. Equation 3.6 shows that the gap between the log evidence and the ELBO is given by the Kullback-Leibler divergence between the prior and the posterior. The Kullback-Leibler divergence is a popular metric to quantify similarity between two distributions. The prediction model described in Figure 3.3 was implemented using Pytorch [10]. The Bayesian variant of the attention based recurrent neural network used a probabilistic framework on top of Pytorch called Pyro [15].

ELBO = Eqφ(z)[log pθ(x, z) − log qφ(z)] (3.5)

(32)

References

[1] Douglas G Altman et al. “Prognosis and prognostic research: validating a prog-nostic model”. In: Bmj 338 (2009), b605.

[2] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM. 2016, pp. 785–794.

[3] Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).

[4] Alexandre De Brebisson et al. “Artificial neural networks applied to taxi des-tination prediction”. In: arXiv preprint arXiv:1508.00021 (2015).

[5] Timothy A Dobbins et al. “Assessing measures of comorbidity and functional status for risk adjustment to compare hospital performance for colorectal can-cer surgery: a retrospective data-linkage study”. In: BMC medical informatics and decision making 15.1 (2015), p. 55.

[6] Cheng Guo and Felix Berkhahn. “Entity Embeddings of Categorical Variables”. In: arXiv preprint arXiv:1604.06737 (2016).

[7] Matthew Honnibal. Embed, encode, attend, predict: The new deep learning for-mula for state-of-the-art NLP models. https://explosion.ai/blog/deep-learning-formula-nlp. (Retrieved April 2018). 2016.

[8] Alistair EW Johnson et al. “MIMIC-III, a freely accessible critical care database”. In: Scientific data 3 (2016).

[9] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013).

[10] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: NIPS-W. 2017. [11] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of

Machine Learning Research 12 (2011), pp. 2825–2830.

[12] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2017. url: https : //www.R-project.org/.

[13] Colin Raffel and Daniel PW Ellis. “Feed-forward networks with attention can solve some long-term memory problems”. In: arXiv preprint arXiv:1512.08756 (2015).

[14] Newel Salet et al. “Is Textbook Outcome a valuable composite measure for short-term outcomes of gastrointestinal treatments in the Netherlands using hospital information system data? A retrospective cohort study”. In: BMJ open 8.2 (2018), e019405.

[15] Uber. Pyro: Deep Universal Probabilistic Programming. http : / / pyro . ai/. (Retrieved April 2018). 2017.

[16] Ashish Vaswani et al. “Attention is all you need”. In: Advances in Neural Information Processing Systems. 2017, pp. 5998–6008.

(33)

Chapter 4

(34)

In this Chapter we present the results of our study. First, we will present some general findings, then the factors that are independently associated with the LoS are presented. The final section will show the results from the comparison of the predictive models.

4.1 General findings

A total of 54 hospitals in the Netherlands with a total of 34,393 patients were included for this study. We summarize the tabular patient data in Table 4.1. First, we notice that the LoS, is a skewed distribution. The LoS has a median of 4.00 days with an interquartile range of 3.00 to 5.00 days. However, a quarter of the patients has a LoS of longer than 5.00 days. This is visualized in Figure 4.1, where the tail to right is clearly visible. We observe that the LoS has visual similarity with Figure 1.1 shown in Chapter 1. In addition to the LoS, we also notice that the ASA score has two frequent occurring categories. Patients with a ASA score of 0 and patients with a ASA score of 2 make up the majority of this group. The number of care activities before the operation has a high variation. Some patients are operated fast while some patients have a great diversity of blood tests or other diagnostic tests before surgery. In most of the surgeries (81.7%) two prostheses were placed. The Charlson score is not normal distributed with a median score of 2. This indicated that most patients have some kind of co-morbidity. The average age of patients undergoing knee replacement surgery is 71.56 years old and most patients are female (64.2%). A last observation of Table 4.1 is the distribution of surgeries throughout time. The day of the week and the month show an almost equal distribution between their values. Furthermore, we observe that almost no surgeries are scheduled on Saturday and Sunday.

(35)

Table 4.1: Characteristics of obtained data Variable Dataset (n = 34393) ASA (%) ASA 0 9751 (28.4%) ASA 1 3050 (8.9%) ASA 2 16791 (48.8%) ASA 3 4732 (13.8%) ASA 4 68 (0.2%) ASA 5 1 (0.0%)

Operation time (mean (sd)) 102.63 (26.88)

Number of care activities before surgery (median [IQR])

16.00 [11.00, 22.00] Number of physician assistants (median

[IQR])

0.00 [0.00, 0.00]

Number of assistants (median [IQR]) 0.00 [0.00, 4.00]

Number of prostheses (%)

Number of prostheses 1 6295 (18.3%)

Number of prostheses 2 28098 (81.7%)

Number of care trajectories before surgery (median [IQR])

4.00 [2.00, 6.00] Presence of anesthesiologist (%)

Presence of anesthesiologist not present 938 (2.7%)

Presence of anesthesiologist present 33455 (97.3%)

Charlson score (median [IQR]) 2.00 [0.00, 6.00]

Age (mean (sd)) 71.56 (9.20)

Lifetime standardized treatment time (median [IQR]) 450.00 [331.00, 675.00] Month of surgery (%) Month January 3589 (10.4%) Month February 2760 (8.0%) Month March 3116 (9.1%) Month April 2562 (7.4%) Month May 2599 (7.6%) Month June 2770 (8.1%) Month July 2238 (6.5%) Month August 2128 (6.2%) Month September 3192 (9.3%) Month October 3222 (9.4%) Month November 3293 (9.6%) Month December 2924 (8.5%)

Standardized treatment time (median [IQR]) 330.00 [285.00, 375.00]

Number of inpatient days before surgery (median [IQR])

0.00 [0.00, 0.00] Sex (%)

Sex Male 12306 (35.8%)

Sex Female 22087 (64.2%)

Cutting time during surgery (mean (sd)) 70.05 (20.07)

Time from first care activity to surgery (median [IQR])

0.00 [0.00, 43.00] Day of the week (%)

Day of the week Monday 7418 (21.6%)

Day of the week Tuesday 7923 (23.0%)

Day of the week Wednesday 6749 (19.6%)

Day of the week Thursday 6810 (19.8%)

Day of the week Friday 5199 (15.1%)

Day of the week Saturday 83 (0.2%)

Day of the week Sunday 211 (0.6%)

Predicting the length of stay for optimizing hospital level capacity planning in Dutch hospitals