Using financial data to automatically prefill a clinical registration

(1)

Using financial data to

automatically prefill a clinical

registration

Author: Berta Esteve Soler (Student Number: 10736662)

Supervisor: Danielle Sent (Department of Medical Informatics)

Universiteit van Amsterdam

The Netherlands

A thesis submitted in fulfillment of the requirements for the

degree of Master of Science. In collaboration with X-IS

Waardeert Zorg with Wouter van Dijk and Jan van der Eijk as

mentors.

(2)

Introduction: Nowadays, doctors are requested to register a lot of data, in order to compute quality indicators to monitor and improve the quality of the delivered care. This computation is mostly performed manually, which is time consuming, expensive and errorprone. Other already existing registrations, for instance registrations used for financial purposes, could form a basis to (partially) automate the computation of the quality information. The objective of this scientific research project is to investigate this possibility. Thus, the main research question tackled is: Can financial data be used to partially prefill a clinical registration?

Methods: Information of patients su↵ering colorectal cancer provided by the Dutch Institute for Clinical Auditing and financial data provided by Performation (company that o↵ers benchmarking products to improve hospital productivity) was used in order to asses if the clinical registration of the Dutch Surgical Colorectal Audit could be automatically prefilled. For some variables in the medical registration, their value was estimated based on a classification algorithm performed on the financial data. Afterwards, those variables where grouped depending on the accuracy of prediction.

Results: Three variables could be computed reliably based on the financial data. For the other three, either the relationship between both

(3)

Conclusion: Part of the clinical audit can be prefilled with certain error using financial data. The approach used in this master thesis can serve as a basis to start reducing the administrative burden. In future studies, the method can be improved in order to prefill the registration with higher quality, which can ultimately serve as a basis for doctors, not only to monitor but also to deliver the best possible care.

Key words: registrations, patient data, automatizing, financial information

(4)

kwaliteitsindicatoren te bepalen die de geleverde zorg monitoren en als doel hebben de kwaliteit daarvan te verbeteren. Deze berekening wordt veelal handmatig uitgevoerd. Dit is tijdrovend, duur en zeer foutgevoelig. Naast de medisch-inhoudelijke registraties zijn er ook andere registraties, zoals bijvoorbeeld die voor financi¨ele doeleinden, die als basis zouden kunnen dienen om de kwaliteitsindicatoren (gedeeltelijk) automatisch te berekenen. Het doel van dit project is om deze mogelijkheid te onderzoeken. De

belangrijkste onderzoeksvraag is dan ook: Kan finani¨ele data gebruikt worden om een medische registratie gedeeltelijk voor in te vullen?

Methode: Gegevens van mensen met colorectale kanker, aangeleverd door het Dutch Institute for Clinical Auditing (DICA) and financi¨ele gegevens, aangeleverd door Performation, een bedrijf dat benchmarking producten aanbiedt om de productiviteit van ziekenhuizen te vergroten, werden gebruik om te bepalen of de medische registratie van de Dutch Surgical Colorectal Audit (DSCA) automatisch vooringevuld zou kunnen worden. De waarde van enkele variabelen in de medische registratie werd bepaald op basis van een classificatie algoritme dat getraind werd met de financie¨ele gegevens. Vervolgens werden deze variabelen gegroepeerd afhankelijk van de accuraatheid van de voorspelling.

Resultaten: De waarvan de drie variabelen konden betrouwbaar berekend worden op basis van de financi¨ele data. Voor de andere drie was ofwel de relatie tussen de twee databronnen niet significant genoeg ofwel niet-bestaand waardoor de voorspelling niet kon worden aangenomen.

Conclusie: Gebruik maken van de financi¨ele data, kan de clinical audit gedeeltelijk, met een zekere fout, vooringevuld worden. De aanpak van dit

(5)

worden om de registratie met een nog hogere kwaliteit voor in te kunnen vullen. Een dergelijke betrouwbare registratie zou uiteindelijk als basis kunnen dienen van waaruit artsen niet alleen de geleverde zorg kunnen motinoren, maar ook de best mogelijke zorg kunnen leveren.

Steekwoorden: registraties, pati¨entgegevens, automatiseren, financi¨ele informatie

(6)

First I o↵er my sincerest gratitude to my supervisor, Danielle Sent, who has supported me throughout my scientific research project with her patience and knowledge whilst allowing me the room to work in my own way. She has impressed me with her interest for the topic and was definitely a source of motivation during the whole process. I attribute the level of my Masters degree to her encouragement and e↵ort, and without her this scientific research project, would not have been completed or written. One simply could not wish for a better supervisor. Furthermore, I would like to thank Wouter van Dijk, Jan van der Eijk and Lilli Gong for the huge support on the way. Wouter and Jan o↵ered me the possibility to work for X-IS and introduced me to the topic. They sourced me a quality work space where much of the work has been done. Wouter van Dijk has followed me trough the whole period making sure I was on schedule and he was the critical eye that always found a reason for improvement. Jan van der Eijk has

impressed me with his enormous knowledge and has basically taught me a lot of data analysis techniques. They both helped me to use the data and gave useful guidance for example teaching me how to use SQL studio. Lilli Gong has fascinated me with her ability to use computer systems and I appreciate her unconditional willingness to help. I want to thank Ameen Abu-Hanna because he has o↵ered me a lot of advice in order to obtain the results of this scientific research project. Mr. Zwinderman has also provided

(7)

who have willingly shared their precious time with me. Those participants are: Annelotte van Bommel, Julia van Groningen, Esmee van der Willik, Pauline Spronk and Lieke Gietelink. Finally mention, that this scientific research project would not have been developed without the support of DICA, Performation and MRDM that work hard everyday in order to provide the patient data. Finally, I thank my parents for supporting me throughout all my studies and for providing a home in which to complete my writing up. In many ways I have learnt much from all of you. Thanks.

(8)

Abstract iv Acknowledgments vi Contents vii List of figures x 1 Introduction 1 2 Preliminaries 6

2.1 The medical condition . . . 6

2.1.1 What is cancer? . . . 7

2.1.2 What is colorectal cancer? . . . 7

2.1.3 Causes for colorectal cancer . . . 7

2.1.4 Treatment for colorectal cancer . . . 8

2.1.5 Some information about this medical domain in The Netherlands . . . 8

2.2 Machine learning . . . 9

2.2.1 Logistic regression . . . 10

2.2.2 Overfitting . . . 10

(9)

3 Methods 14

3.1 The data sources . . . 15

3.1.1 Classification of the DSCA variables . . . 16

3.1.2 Selection of variables . . . 17

3.2 Matching the patient data . . . 18

3.3 Study of a small sample of patients . . . 20

3.4 Building the predicting model for categorical variables . . . . 21

3.4.1 Preparing the data . . . 21

3.4.2 Selecting the predictors . . . 21

3.4.3 Univariate analysis with logistic regression . . . 22

3.4.4 Splitting of the dataset . . . 22

3.4.5 Logistic regression . . . 23

3.4.6 Backward selection . . . 25

3.4.7 Multiple categories . . . 25

3.5 Predicting outcome . . . 26

3.5.1 Steps of the prediction . . . 27

4 Results 28 4.1 Descriptive analysis of datasets . . . 29

4.1.1 The DSCA . . . 29

4.1.2 Classification of the variables . . . 30

4.1.3 The financial dataset . . . 31

4.2 Ten patient study . . . 31

4.3 Outcomes of the prediction for categorical variables . . . 32

4.4 Color classification . . . 34

5 Discussion 35 5.1 Main findings of the study . . . 35

5.2 Limitations and directions for future research . . . 38

5.2.1 Data quality . . . 38

(10)

5.2.3 Hospital variability . . . 39 5.2.4 Classification based on true positives . . . 41 5.2.5 Direct variables . . . 42

6 Conclusions 43

7 Appendix 45

7.1 List of abbreviations . . . 45 7.2 10 patient study . . . 47 7.3 Scripts used in R studio and SQL studio . . . 50

(11)

2.1 Underfit, fit and overfit . . . 11

3.1 Left join . . . 19

3.2 Confusion matrix . . . 23

3.3 ROC curve . . . 24

(12)

1

Introduction

In the last 10 years the number of health related registries has increased due to the interest in measuring quality of care within many institutions [1,2,3]. Clinical data holds the potential to help transform health care systems o↵ering the opportunity to accelerate progress and being more useful to monitor clinical e↵ectiveness [4]. In a competitive health care market, hospitals have to focus on ways to streamline their processes in order to deliver high quality care while at the same time, reducing costs [5]. Furthermore, on the governmental side and on the side of the health

insurance companies, more and more pressure is put on hospitals to work in the most efficient way as possible, whereas in the future, an increase in the demand for care is expected [6]. It is fair to claim that in The Netherlands, managed competition for providers and insurers is a major driver in the

(13)

health care system. Insurers now negotiate with providers on price and quality and patients choose the provider they prefer and join a health insurance policy which best fits their situation [7]. Thus, due to legally mandatory or voluntary purposes, patient outcomes are being registered and health care professionals have found ways to integrate the data collection process into their daily clinical routine. Collecting high quality patient data is relevant to drive tangible improvement in health care. Patients, physicians, insurers and government can benefit from reporting data outcomes in multiple ways: decision making process, evaluation of performance, targeting improvement, increasing transparency, research, etc[8].

An important registration in The Netherlands, is the DOT system that stands for Diagnosis Treatment Combination. The DOT system forces hospitals to provide an overview of the total costs of each treatment from the first consultation until final follow-up check after the treatment [7]. It has been created to gain more insight into the care hospitals provide and is the basis upon which insurance companies negotiate contracts with

hospitals [9]. The hospital care providers are obliged to provide their DOT data to the DOT information system in order to be reimbursed for the delivered care [7]. The DOTs are based on a group of activities (in Dutch: zorgactiviteiten) [45]. Examples of activitis can be: MRI (Magnetic

resonance imaging) scan, day in the intensive care, receiving anestesia, undergoing surgery, etc. An activity code identifies what was done, a date tells when it was done and an amount tells how many times the activity was done [9]. The activities are the basis of funding and allow health care facilities to register the care, so that it can eventually be declared to the patient or health care provider. Based on that information, providers and purchasers can negotiate on the quality, price and number of treatments[10].

(14)

As an example, man can zoom into the clinical history of a patient su↵ering colorectal cancer and observe that multiple activities are performed and recorded. For instance: laboratory tests, surgeries, days being in the intensive care unit, treatments, etc. Every patient is unique and there are thousands of di↵erent activities which can be produced at di↵erent moments in time [9]. Table 1 shows some activities based on real data of patients su↵ering colorectal cancer.

Table 1: Examples of activities

Description Code Date

IC respiration supplement 190127 2011-01-15 Endoscopic resection of the rectosigmoid 35025 2011-03-08 Emergency department 190015 2011-10-15 Cysto urethroscopy 39859 2011-12-12 CT abdomen 87042 2012-07-12 Inpatient day 190204 2012-08-19 MRI brain 81092 2013-06-21 Ultrasound of the heart 39494 2013-10-28 Comprehensive Geriatric Assessment 39577 2013-11-21 Urological treatment and catheterization 39881 2014-06-12 Anesthesia monitoring 39607 2014-11-12

Apart from this registration of financial data there is also an important institution responsible for collecting and analyzing quality data. This institution is the Dutch Institute for Clinical Auditing (DICA) and it studies specific medical domains with the objective of achieving quality improvement. DICA is the leading institute in the Netherlands that holds a valid measurement system and designates insight into the quality of

medical care. The Dutch Surgical Colorectal Audit (DSCA) is one of the registries of the DICA concerning colorectal cancer. It was initiated by

(15)

record reliable data during the process of care [11]. The DSCA has been set up in 2009 to measure and improve the quality of colorectal cancer surgery, serving as both national and international model. All dutch hospitals that perform colorectal cancer surgery are obliged to submit data to the DSCA register. Surgeons enter the information manually using a web form [12]. As a consequence of the need of recording information, the medical practice routines need to be adjusted and health care professionals must spent part of their valuable time in the registration process [36]. More and more time is spend in administrative tasks rather than in patient care [13]. However, these registrations are requested in order to help determining quality of care and should assure that data is available, accessible and usable for further use [12].

In this scientific research project the activities from the financial dataset of patients having colorectal cancer were used to investigate the possibility of predicting an outcome for some variable in the clinical registration. Performation was the source of the financial data that was used to fill the DSCA. This company, located in The Netherlands

(http://www.performation.com), assists health care institutions in meeting their economic and quality goals. Their mission is to support health care providers to efficiently deliver the highest quality of care per euro spent, providing patient level costing and benchmarking products that improve processes and outcomes for successful hospital productivity. This study aims to assess whether the set of variables of the DSCA can be computed automatically based on financial data and to investigate with which accuracy the registration can be prefilled. If the information could be already submitted to the doctors semi-automatically, and part of the registration is already prefilled, they would only have to check for

correctness and fill the information that was not possible to determine. In that way, the amount of time saved could be used for other purposes.

(16)

This thesis is organized as follows. A preliminaries chapter which includes background description of key concepts, such as the medical domain, specific techniques and related work. The methodologies used to achieve the goal of this master thesis are described in Chapter 3. The results are described in Chapter 4 as well as the outcome of the prediction model build for prefilling the DSCA. Chapter 5 corresponds to the

discussion of the initial hypothesis that financial data can be used for other purposes, instead of only serve to measure the cost of care. The main findings are also explained in addition to the limitations of the study and recommendations for future research. Finally, in the conclusion a summary is done, including the meaning of this work to others.

(17)

2

Preliminaries

In this scientific research project data of patients having colorectal cancer was used. For this reason, this medical domain is explained in this chapter. In addition, some background in machine learning is provided.

2.1 The medical condition

Colorectal cancer is the third leading cause of cancer-related deaths in the United States when men and women are considered separately, and the second leading cause when both sexes are combined. Deaths from colorectal cancer have decreased with the use of colonoscopies and fecal occult blood tests, which check for blood in the stool. Tests that examine the colon and rectum are used to detect and diagnose colon and rectum cancer [16]. It is

(18)

expected that it will cause about 49.190 deaths during 2016 [14].

2.1.1 What is cancer?

Cancer is ultimately the result of cells that uncontrollably grow and do not die. Normal cells in the body follow an orderly path of growth, division, and death. Programmed cell death is called apoptosis, and when this process breaks down, cancer results. Colon cancer cells do not experience programmatic death, but instead continue to grow and divide.

2.1.2 What is colorectal cancer?

Colorectal cancer is cancer that starts in the colon or rectum. The colon and the rectum are parts of the large intestine, which is the lower part of the body digestive system. Most colorectal cancers are adenocarcinomas, cancers that begin in cells that make and release mucus and other fluids. Colorectal cancer often begins as a growth called a polyp, which may form on the inner wall of the colon or rectum. Some polyps become cancer over time [14].

2.1.3 Causes for colorectal cancer

Scientists do not know exactly what causes these cells to behave this way [15]. However, according to the American Cancer Society

(http://www.cancer.org), there are several risk factors that can be related with the development of the disease such as: age, having certain kinds of polyps, having a history of ulcerative colitis, Crohn disease, type two

(19)

can also influence [14].

2.1.4 Treatment for colorectal cancer

The main treatment option is surgery and the chance of recovery of the patient depends on the following [16]:

- Stage of the cancer (whether the cancer is in the inner lining of the colon only or has spread through the colon wall, or has spread to lymph nodes or haematogenous metastasis in the body).

- Whether the cancer has blocked or made a hole in the colon. - Whether there are any cancer cells left after surgery.

- Whether the cancer has recurred. - The patient general health. - Age

- Blood levels of carcinoembryonic antigen (CEA) before treatment begins. CEA is a substance in the blood that may be increased when cancer is present.

2.1.5 Some information about this medical domain in

The Netherlands

- At the moment roughly 570.000 persons have cancer or survived cancer in the Netherlands (20 years prevalence).

- In 2012 43.666 persons died of cancer.

- In 2012 101.210 persons got cancer: 52.735 men, 48.475 women. - The five years survival rate di↵ers for the several types of cancer. The overall five year survival rate has increased up to 61 percent (2004-2009)[49].

(20)

2.2 Machine learning

Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. It is concerned with the question of how to construct

algorithms that automatically improve with experience. For example, a model can be build based upon existing data to make predictions about future data. Machine learning algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions [17]. It is possible to discover interesting results, such as patterns, associations, changes, anomalies and significant structures, from large amounts of data stored in databases, data warehouses, or other information repositories [18].

Machine learning also refers to a set of tools for modeling and

understanding complex datasets [19]. Due to the wide availability of huge amounts of data in electronic forms, and the imminent need for turning such data into useful information and knowledge, machine learning has attracted a great deal of attention in information industry in recent years [18].

Most machine learning approaches fall into one of two categories: supervised or unsupervised. The unsupervised category uses an algorithm to draw inferences from datasets consisting of input data without labeled responses [47]. Contrary, supervised learning is the machine learning task of inferring a function from labeled training data. It fits a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference) [39]. Many classical statistical learning methods such as logistic regression, operate in the supervised learning domain [19].

(21)

2.2.1 Logistic regression

Logistic regression is part of the ’classification’ problems. And classification problems are one of the machine learning do-ables. It is mainly used to study qualitative variables [19]. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical. The approach for predicting qualitative responses is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category or class. However, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification [19]. Classification analyses a set of training data and constructs a model for each class based on the features in the data. A decision tree or a set of classification rules is generated by such a process, which can be used for better understanding of each class in the database and for classification of future data. One of the most used classifiers is logistic regression [19]. While using this method a division of the data in a training and testing set is usually done. Part of the data is used for training a model and the rest to test it [30,39]. The logistic regression model is tested on some data with a model trained on the remaining [48]. Given a set of training observations, a classifier can be build. This classifier has to perform well not only on the training data, but also on test observations that were not used to train the classifier.

2.2.2 Overfitting

Sometimes data models can lead to a phenomenon known as overfitting. When a model yields a small training set but a large test set, it is said that the data is being overfitted. This happens because the statistical learning

(22)

Figure 2.1: Underfit, fit and overfit

procedure is working too hard to find patterns in the training set, and may be picking up some patterns that are just caused by random chance rather than by true properties. Thus, the supposed patterns that the method found in the training data might not exist in the test data. When this happens the model itself is biased, and this will be reflected in the fact that the data is underfitted (See a graphical example in Figure 2.1). Moreover, when the model has too many free parameters which can be adjusted to perfectly fit the training set than, the data the data is overfitted. If a new point is added to the plot there are more chances that it will be very far from the curve. In this case, it is said that the model su↵ers from high variance. The reason for this label is that if any of the input points are varied slightly, it could result in an extremely di↵erent model. In the middle, there is good mid-point. It fits the data fairly well, and does not su↵er from the bias and variance problems seen in the figures on either side. Bias and variance must be quantitatively identified and avoid in order to determine the best model. The division of the data described before makes that overfitting is avoided since it is uncommon that outliners will be present in both sets likewise [22].

(23)

2.3 Previous work on reuse of data

The possibility to reuse routinely recorded data has been investigated before. For example, previously to this study it has been tried to asses the quality of the information extracted from an EMR (Electronic Medical Record) in order to compute the indicators of the colorectal cancer surgery audit. With this study researchers from the Academic Medical Center, proved that the EMR data was not suitable to compute quality indicators [12]. The EMR data included information of diagnoses, operations,

admissions, encounters, pathology reports, endoscopies and medications. M. Deserno et al. have also tried to integrate image data and medical record information for a rare disease registry in Germany [26]. Both M. Deserno et al. [26] and Dentler et al. [12] found that the studied

registration should be designed to guarantee inter-operability and re-usability of the data. Bianconi et al. [1] designed a web-based management system to integrate di↵erent sources and elaborate an

oncology network based on cancer registries. With the registration system it was possible to import routinely available data, from pathology archives, screening services, and hospital discharge records. They conclude that information technology contributed to shortening all phases of cancer registration, reducing the time needed to produce data. Huber et al. [27] created an algorithm to automatically register multiple three dimensional data sets. All of these researchers were not the first trying to automatically compute data; it has been shown that the interoperability between datasets and the reuse of data is an important topic within many institutions. For example, Bloomrosen et al. [3] stated that biomedical research increasingly depends on the availability of aggregated health data. It has been shown that the growing volume of health data collected and stored electronically dramatically heightens the importance of data issues [3]. Health data often involves many di↵erent data custodians such as: hospitals, public health clinics, laboratories, physicians, research centers, pharmaceutical

(24)

companies, third party payers, employers, registries, health information producers, and federal, state, and local government agencies [3]. While various state and federal regulations address many of these concerns, they cannot adequately deal with the full range of current or contemplated data reuse scenarios. The American Medical Informatics Association

(https://www.amia.org/about-amia) has suggested multiple

recommendations in order to improve health data collection, storage and exchange [3]. They have shown the actual increased interest in reuse of health data. This information was found in the PUBMED database using the key words: reuse of data, automatizing, registrations.

(25)

3

Methods

In this chapter the methodologies used in order to fulfill the goal of this scientific project are described. The aim is to assess whether the clinical registration of the DSCA can be partially prefilled based on financial data. Chapter 3 includes the description of the two datasets used: the financial containing the activities of the DOT provided by Performation and the medical data with the DSCA variables provided by DICA. X-IS o↵ered the possibility to establish the connection between both companies as well as with MRDM, that was responsible for the encrypting process of the data.

(26)

3.1 The data sources

The data from more than five thousand patients (5140) su↵ering colorectal cancer was used for this study. They were treated in 11 di↵erent dutch hospitals. The information of these patients is computed by surgeons with the help of a web form to create the DSCA registry. This registration process takes 15 to 20 minutes per patient [12]. After the data entry task, the DSCA is monitored to prove reliability by the DICA by an annual comparison to the dataset of the Dutch Cancer Registry. For this scientific research project, the DSCA was assumed to be the reference. Despite this fact, please note that it has to be considered the ”silver” standard since there is always the possibility of errors due to manual data entry. The DSCA dataset consists of 212 variables, including: demographic

information, diagnoses, procedures, results of pathological examinations and clinical outcome (see the data dictionary in the Appendix). Each variable is a combination of categorical and numerical attributes of a patient (textual information and numerical data). Each variable in the dataset has his own optionset with values and labels. For example the variable ’status’ has the following optionset: 0 = in leven (alive), 1 = overleden (dead), 9 = onbekend (unknown). Other values are not valid for this variable. Di↵erent variables have di↵erent optionsets [11]. For this study, 96 variables were used; Both, subvariables or variables related to a second tumor of the patient, were excluded. See some examples of variables of the DSCA in Table 2.

(27)

Table 2: Examples of variables of the DSCA

Variable Type

Sex Integer

Associated comorbidity Boolean Location of the primary tumor Integer Degree of spread to regional lymph nodes Integer

Colonoscopy Boolean

Date of operation Date Type of operation Integer Urgency of the operation Integer Conversion from laparoscopy to open procedure Integer

Complication Boolean

Number of positive lymph nodes Integer Radiotherapy schedule Integer

3.1.1 Classification of the DSCA variables

As seen in Table 2, the DSCA registration contains very diverse variables, such as: dates, procedures, medical information, amount of days in the hospital, patient characteristics like age or gender, etc. All of them, have been treated with the same importance and classified in three di↵erent groups: direct, proxy and not linking. See an example of each group in the following list.

1. Direct: date of operation 2. Proxy: complication 3. Not linking: weight

A variable was considered direct when the data might be available in the activities in a structured format and directly used to automatically fill in

(28)

the DSCA. For example, the date of operation should match with the registered date of an activity that describes the operation procedure.

Unfortunately, not all the required information to prefill the DSCA registry can be found that easily, and a lot of other variables were considered

proxies. In statistics, a proxy is not in itself directly relevant, but serves in place of an unobservable or immeasurable variable [28]. According to this definition many activities within the financial dataset might not be relevant but can be used to predict a variable of the DSCA. A clear example of a proxy is determining whether or not a patient had a complication; the information is not directly available but, by means of an activity a

prediction to identify this medical event can be done. Activities that can be related to a complication can be: reintervention, long stay in the hospital, use of certain treatments, length of stay in the intensive care, and so on. If the variable depends on one activity then it was considered a simple proxy. All the variables with an optionset with more than two categories, or depending on multiple activities were considered complex proxies. When there is no possibility to assert the variable based on the financial data the variable was classified as not linking (example: weight). Of the 96 studied variables of the DSCA, 19 are direct variables, 42 are proxies and 35 are not linking variables.

3.1.2 Selection of variables

In this scientific research project a total of six proxies was selected to perform the study. The selected ones are: complication (COMPL), vein invasion (VENINV), comorbidity (COMORB), operation urgency

(TYPOK), chemotherapy (CHEMO1) and score of lymph nodes spreading (CNSCORE). Two direct variables were also studied: date of operation (DATOK) and days in the intensive care (ICDG). As the dataset contains

(29)

variables. First, the variables were randomly selected. After, the sample was apporved by an expert of the DICA that had investigated whether the sample was sufficient and representative for the study.

3.2 Matching the patient data

The financial data was encrypted by the Medical Research Data Management (MRDM) and all patient identifiable information was

removed. MRDM encrypts the patient number in a way that the data from DICA and Performation can be combined. For this reason, the patients for whom data has been submitted by DICA were matched with the activities of the financial data source based on the patient number.

In general, when tables of information are being combined or compared a ‘join’ is used. There are several options of applying this method depending on which data needs to be included [46]. In this case, all the information of the patients in the medical dataset was combined with the activities of the financial dataset. The financial data was added to the DSCA by means of a left join using SQL (Structured Query Language). The left join method returns all the rows from the first table with the matching rows of a second table. The result is NULL in the right side when there is no match [46]. In this case, the first table was the medical data from the DSCA merged with a second table containing the activities from the financial database. This method can be graphically explained with Figure 3.1. The circle A represents all the patients within the DSCA registry and the circle B the financial data containing all the activities. The red space include all

patients from circle A and those patients of the DSCA registry that can be identified in the activity data by means of the patient number (in total 4761). The individuals outside the red space will have NULL as a result of

(30)

Figure 3.1: Left join

non-match.

Table 3, Table 4 and Table 5 can serve as an example of the left join method described previously. Table 3 contains information from the clinical audit and Table 4 contains the available activities for the same patients. In Table 5 it can be seen that two patients have been matched with two activities and a third one does not have a match and therefore has a NULL in the activity column.

Table 3: Clinical data

PN H Sex Comorb Loc ScoreCN Col Date

1 H1 male Yes Rectum 0 Yes 20/02/14

2 H2 female Yes Sigmoid colon 0 Yes 3/4/15 3 H3 female No Rectum tumor 0 Yes 21/6/14 PN=Patient Number H=Hospital Comorb=Comorbidity Loc=Location of the tumor Col=Colonoscopy Date=Date of operation

(31)

Table 4: Financial data PN CTA ID MRIB ED

1 0 1 0 1

3 1 1 0 0

CTA=Computed Tomography Abdomen ID=Impatient day

MRIB=Magnetic Resonance Imaging Brain ED= Emergency Department Table 5: Result

PN H Sex Comorb Loc ScoreCN Col Date Activity

1 H1 male Yes Rectum 0 Yes 20/02/14 ID

1 H1 male Yes Rectum 0 Yes 20/02/14 ED

2 H2 female Yes Sigmoid colon 0 Yes 3/4/15 NULL 3 H3 female No Rectum tumor 0 Yes 21/6/14 CTA 3 H3 female No Rectum tumor 0 Yes 21/6/14 ID

3.3 Study of a small sample of patients

Before using the complete dataset in order to build the model, the information of 10 patients randomly selected was explored by hand, in order to have a better knowledge of both registrations. The possibility of predicting a medical event using the activities of the financial registration was analysed in a small scale. The activities were followed chronologically in order to build the medical history of those patients taking as a reference the variables of the DSCA. With the purpose of seeing if both registrations contained the same information, special attention was payed in concrete activities such as: tests, operations, treatments, etc. These ten patients cover, roughly speaking, all patientprofiles available. See the medical history of these 10 patients in the appendix.

(32)

3.4 Building the predicting model for

categorical variables

As described in Chapter 2, machine learning is a technique that gives the opportunity to build a model based upon existing data to make predictions about future data. This technique was used to build the predictions of the sample of variables of the DSCA described before: presence or absence of complication (COMPL), presence or absence of comorbidity (COMORB), presence or absence of extramural vein invasion (VENINV), chemotherapy schema (CHEMO1), degree of spread to regional lymph nodes (SCORECN) and urgence of surgery (TYPOK). All these variables are categorical, thus, it was preferable to use a classification method that is truly suited for qualitative response values [19].

3.4.1 Preparing the data

The data obtained by means of the left join, was converted into a table where every row contained the medical information of the DSCA and a count on the activities related to the corresponding patient. This was done with SQL studio and exported to R studio.

3.4.2 Selecting the predictors

The focus of this study was to asses the possibility of prefilling the di↵erent values or categories of the DSCA variables. For this reason, only the

activities of patients with certain value for the studied variable of the DSCA were used as possible predictors. For example, when studying the presence (COMPL=1) or absence (COMPL=0) of complication, the activities of the patients having a complication (COMPL=1) were used to asses the possibility of predicting that a patient have had a complication. As the number of activities that a patient can have is very large, only the

(33)

most frequent activities were selected (200 per category). The less frequent activities are only available in the dataset for a small number of patients which makes it logical not to include them in the predicting model to minimize the size of the studied sample. See the di↵erent categories of the studied variables in the data dictionary of the DSCA in the Appendix.

3.4.3 Univariate analysis with logistic regression

As the the number of predictors (activities) far exceeded the number of observations (category of a DSCA variable), an univariate analysis with logistic regression was performed before building the predicting model. Univariate logistic regression is a penalized method that is particularly advantageous to select a predetermined number of predictors, in this case, the most relevant activities [29]. By means of a function in R (lapply) [38] that is based on the logistic regression approach, the relation between each predictor and the observation was studied. The p-value of each predictor was calculated and classified based on the level of significance (p-values higher than 0.2 were considered significant [19]).

3.4.4 Splitting of the dataset

As mentioned in Chapter 2, logistic regression is one of the most common methods used for classification. The algorithm used in this SRP to make predictions for the DSCA categorical variables was based on the logistic regression approach. The algorithm uses 70 percent of the data for training the model and the 30 percent left to test the model. With this division at the same time that the model is build, it can be validated with a subset of the total data [30,39,48]. The splitting of the dataset in these two groups was done randomly.

(34)

Figure 3.2: Confusion matrix

3.4.5 Logistic regression

Logistic regression is mainly used to predict probabilities [37,40]. In this case, a logistic regression model was applied to the training set where the probability that an activity belongs to a particular category of a variable of the DSCA was calculated [19]. The model traverses over the activities and calls the specific function for a specific category of a variable. The

algorithm contains a threshold based on specificity and sensitivity and when the best maximal Area Under the Curve (AUC) is found the iteration stops.

For example, if having a complication is considered positive (1) and not having a complication negative (0), there is a di↵erent distribution of predictions depending on a threshold. If this threshold increases, the number of false positive (FP) results is lowered, while the number of false negative (FN) results increases [48]. The positive cases should have a higher probability than negative cases. However, when more cases are predicted positive the number of false positives increases as well. See in figure 3.2. the 4 possibilities depending on the threshold.

(35)

Figure 3.3: ROC curve

The impact of that treshold on the balance between false positives and false negatives can be visualized and quantified with a receiver operating characteristics (ROC) curve. The ROC curve is the interpolated curve made of points whose coordinates are functions of the threshold. The optimal point on the ROC curve is (FP, TP) = (0,1). No false positives and all true positives. The curve is by definition monotonically increasing. That means that changing the threshold (blue line in Figure 3.3) up and down has an impact in the curve. Furthermore any reasonable model ROC is located above the identity line because any point below implies a prediction performance worse than random. The area of the convex shape below the ROC curve is the AUC. The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1 [48]. See a graphic example in Figure 3.3.

(36)

3.4.6 Backward selection

As explained before, only the activities with a p-value higher than 0.2 were included in the predicting model. Moreover, the logistic regression was performed using backward selection [42]. This method, is based on the AIC (Akaike Information Criteria) which estimates the quality of each model of a given collection. It selects the combination of activities that will be the best predictors and at the same time, guards from overfitting (see the definition in Chapter 2)[31]. Backward stepwise regression involves starting with all or a preselection of candidates activities. Thus, step 0 corresponds to the model containing all the variables. Then, step by step, 1 variable is excluded. From all n-1 models, the best model is selected, and again, 1 variable is dropped. This backward stepwise selection stops when the model cannot be improved anymore.

3.4.7 Multiple categories

For the categorical variables that were not boolean, as much predicting models as possible outcomes were analysed. After estimating the result, the most likely one was selected. For example, the operation urgency variable (TYPOK) has 4 di↵erent possible categories. Thus, 4 di↵erent models were performed. The value was either 1 for elective or 0 (the rest of possible categories), in the second model the value was 1 for urgent and otherwise 0; the process was repeated till performing the regression with all the

possibilities. In this particular example, 4 models were studied. It was assumed that using the highest probability of the 4 models corresponded to the most likely outcome to impute the prediction of the variable.

(37)

3.5 Predicting outcome

As described before, by means of logistic regresion the best possible model can be found based on sensitivity and specificity. The accuracy (ACC), the positive predictive value (PPV) and the negative predictive value (NPV) were also calculated. Afterwards, the six selected variables were sorted in di↵erent groups based on the PPV. Higher importance was given to the true positives because the initial goal of this master thesis was to discover if an event can be detected by means of financial data. For this reason, the PPV was used for grouping.

A color strategy was used to classify the variables in three di↵erent groups. Color classification has been used in the health care domain with the objective of grouping patient with the same characteristics. For example, it was used to determine the need for urgent care and

hospitalization within patients that visit the emergency department [32]. The main objective of this classification is to use colors to guide the doctor trough the registration process. Not only the time spend in this process might be reduced but also, the quality of the information might be

improved because the di↵erent colors emphasize which information should be checked to assure correctness. See the characteristics of the color classification below [33]:

Green: PPV bigger than 0.75. Expected result is prefilled. It is possible to select another answer if incorrect.

Orange: PPV between 0.25 and 0.75. The best possible answer is prefilled. It is possible to select another answer if incorrect, otherwise, the doctor must tick a check box.

Red: PPV smaller than 0.25. The doctor must select the answer. Optionally, the predicted value can also be shown to the doctor.

(38)

3.5.1 Steps of the prediction

The following steps clarify the order in which the methodologies described before were performed:

Step 1: Two hundred activities were used to perform an univariate

analysis with logistic regression for each possible category within a variable. The ”lapply” function in R was used.

Step 2: The previous step served to select the significant activities (p-value higher than 0.2) to fit into a logistic regression model including evaluation, by means of splitting the data in a training and testing set. The logistic regression was done using the ”glm” (generalized linear model) function in R.

model= step(glm(f, traindata[,-targindex] ,family=’binomial’, maxit = 50),direction=”backward”)

Step 3: A backward stepwise method was included in the previous regression function.

Step 1,2 and 3 were performed for each possible category of the six DSCA variables.

Step 4: The outcome of the logistic regression which finds the best possible prediction, was used to do a color classification for each variable.

(39)

4

Results

In this chapter the main findings after performing the analysis of the datasets are described as well as the outcome of the prediction model build for prefilling the clinical registration of the DSCA. As explained in Chapter 3 six representative variables were selected in order to perform this study and the outcome of this selection will be shown in this chapter. However, the results related with the direct variables will be later explained in Chapter 5 because another approach for prefilling is needed.

(40)

4.1 Descriptive analysis of datasets

4.1.1 The DSCA

The clinical dataset used for this SRP contains data from 5140 individual patients: 55 percent are man (2845 individuals) and 45 percent are woman (2295 individuals). These patients were treated in 11 di↵erent hospitals in the Netherlands between 2007 and 2014. These hospitals are anonymously presented in text as H1, H2, H3, etc. Table 6 contains the list of the hospitals including general characteristics.

Table 6: General characteristics

H Num LM Compl Chemo Urg

H1 995 55 424 253 48 H2 838 34 301 205 52 H3 821 58 275 206 43 H4 549 102 198 159 7 H5 424 23 72 133 25 H6 385 38 123 124 10 H7 351 31 62 100 36 H8 331 30 132 97 47 H9 328 23 58 80 28 H10 80 7 27 18 0 H11 38 4 14 10 1

Num = Number of patients LM = Liver Metastasis Chemo = chemoterapy Urg = Urgent operation

The age of the patients varies from 49 to 94 years old and each patient either have had a resection due to a tumor in the colon or in the rectum. Concretely, 71 percent of the patients were diagnosed with colon cancer and

(41)

29 percent with rectum cancer. The most common metastasis was nested in the liver with a total of 405 cases. There were 5059 operations: 2117 were open and 2942 were laparoscopic procedures. The most common resections were: low anterior resection (2334 patients) and extended right

hemicolectomy (1691). Only 12 percent of those operations were urgent and the rest were elective. An amount of 1686 patients had a complication within 30 days after surgery and 283 reinterventions were proceed. Chemotherapy was given to 1385 patients.

4.1.2 Classification of the variables

It has to be remarked that the 96 studied variables of the DSCA can be either numerical, categorical (fifty percent do have two classes and the rest more than two), text or dates. See the proportion of each of these groups in Table 7. Table 8 shows the amount of variables that are direct, proxies or not linking.

Table 7: Classification of the variables 1

Type of variable Amount Numerical 15 Categorical 65

Text 6

Date 10

Table 8: Classification of the variables 2

Type of variable Amount

Direct 19

Proxy 42

(42)

It is not the same to predict a direct variable such as the date of

operation, that might be fully computable based on the financial data, than predicting if a patient have had a complication. In this last case, the

activities of patients having a complication, contained in the financial registration, were used to build the predicting strategy.

4.1.3 The financial dataset

After doing the left join, the resulting data matrix has a total of 1.594.470 elements. As explained before, this matrix contains the DSCA patient data linked to the registered activities by means of the patient number. Three hundred seventy nine patients from the DSCA dataset could not be matched to any activity because the financial data was unavailable (the stable database for this SRP was created with the information available for X-IS at a concrete moment). However, it was possible to match 4761 of the DSCA patients with the financial dataset. For this sample of patients with colorectal cancer, 1634 di↵erent activities were reported.

4.2 Ten patient study

The prediction algorithm was based upon all patients. However, before starting analysing to the whole patient group, a small sample of 10 patients was studied in detail. There were matches between them in several

activities like: diagnostic tests, treatments, operation procedures, etc. Looking at individual cases was very helpful to understand more the DSCA variables and be able to know if it was possible to find the equivalent information in the financial data. Manual inspection showed that relevant information might be extracted from the financial dataset and that there is certain overlapping. See in the Appendix the clinical information of the ten patients created while contrasting the information of both datasets.

(43)

4.3 Outcomes of the prediction for

categorical variables

As described in the methods, stepwise logistic regression was used in order to select the activities to fit for modeling the prediction of the target variables of the DSCA. For each of these variables, there was a di↵erent number of significant activities to include in the model. In case that a variable had more than two possible categories, the model was performed for all the options and the one with the highest probability was assumed the most likely outcome. Table 9 shows the studied category of the DSCA variables and the corresponding number of patients belonging to that category.

Table 9: Categories

Variable Predicting outcome Amount Percentage COMPL The patient had a complication 1686 33 COMORB The patient had a comorbidity 3708 72 VENINV The patient had extramural invasion 789 15 CHEMO1 The patient did not receive chemotherapy 4151 81 SCORECN The patient had a SCORECN equal to 0 3953 77 TYPOK The operation was elective 4428 86 Table 10, 11 and 12 show the statistical results based on the best model found for each variable. Each row of Table 10 sums up to 1455 which corresponds to the sum of true positives, true negatives, false positives and false negatives of the test set of the model with the higher AUC. Table 11 shows the results of the same model concerning sensitivity and specificity. And Table 12 shows the results of the positive and negative prediction.

(44)

Table 10: Variable TP TN FP FN COMPL 311 619 387 138 COMORB 1021 24 403 7 VENINV 87 1042 200 126 CHEMO1 1128 128 144 55 SCORECN 1114 54 256 31 TYPOK 1246 16 187 6

TP=True Positive TN=True Negative FP=False Positive FN=False Negative

Table 11: AUC

Variable AUC ACC Sensitivity Specificity COMPL 0.70 0.64 0.69 0.62 COMORB 0.73 0.72 0.99 0.06 VENINV 0.66 0.78 0.41 0.84 CHEMO1 0.81 0.86 0.95 0.47 SCORECN 0.79 0.80 0.97 0.17 TYPOK 0.72 0.87 0.99 0.07 AUC = Area Under Curve ACC = Accuracy

Table 12: Prediction Variable PPV NPV COMPL 0.45 0.82 COMORB 0.72 0.77 VENINV 0.30 0.89 CHEMO1 0.89 0.67 SCORECN 0.81 0.63 TYPOK 0.87 0.73

(45)

4.4 Color classification

Based on the statistical results the variables were classified depending on the PPV. The variables chemotherapy (CHEMO1), score of lymph nodes spreading (SCORECN) and operation urgency (TYPOK) could be

computed reliably (PPV bigger than 0.75) based on the financial data and were granted with the green color. For the rest of variables the relationship between the DSCA and the activities was less significant and were granted with the orange color. Table 13 shows the classification based on PPV.

Table 13: Classification Variable Color COMPL Orange COMORB Orange VENINV Orange CHEMO1 Green SCORECN Green TYPOK Green

(46)

5

Discussion

In this chapter the main findings are discussed. Moreover, the problems encountered during this scientific research project are explained with recommendations for future studies.

5.1 Main findings of the study

The DSCA contains di↵erent types of variables. Depending on the nature of those, the accuracy of prediction can be di↵erent. In the financial data, the activities show if a certain event happened in a hospital for a patient. However, the results of that event are not available, thus, the variables that strictly depend on medical information were harder to predict. In some cases, it was even impossible to consider the possibility of prediction. A

(47)

clear example, is to determine the size of the tumor. No activity was found to predict the tumor size reliably.

Nevertheless, the existence or the frequency of some activities in the financial data can be used to partially prefill the clinical registration of the DSCA. The results of this scientific research project show that the financial data contains quality information that can be used to fill in the clinical registration, but the accuracy of prefilling is di↵erent if the variable is process related (depends more in administrative information) or if the variable describes a medical characteristic (usually obtained by means of exploration). The first option can be most likely filled in.

In order to reflect this di↵erence, the variables of the DSCA were assigned to di↵erent colors (green, orange or red) depending on the accuracy of prefilling. With this strategy the registration process can be easier for the doctor because part of the information will be already filled in and another part will only need to be reviewed. The doctor or health care professional responsible for the registration will only need to register the variables for which it was impossible to prefill the variable based on the activities.

The variables of the DSCA were classified in three di↵erent groups: direct, proxy and not linking. An initial assumption was that the variables for which the data was available in the financial dataset could be prefilled. However, after performing the analysis of the selected sample it can be stated that the percentage of variables which can be prefilled with accuracy can only be determined if the method used in this thesis is applied to each variable of the DSCA. Only after that, it will be possible to sort by color the whole DSCA. Beforehand, it can only be stated that for the 19 direct variables, the green color will be assigned when the information is correctly available. For the 42 proxies, the classification in the green, orange or red

(48)

Figure 5.1: Graphical representation of the color classification

group depends on the statistical results. The red color can be assigned to the 35 variables for which the information cannot be found in the financial data. Despite the fact that it is not possible to estimate the accuracy of prefilling for all the variables, this scientific research project served to design a method that can be reusable for the whole dataset.

See below an example that proves that the results can not be generalized to the whole DSCA:

The date of first treatment with radiotherapy was considered a direct variable because the information required in the DSCA is available in the same format in the financial data. But this does not mean that this variable can be classified as green and automatically prefilled. It might be that the percentage of matches between both datasets for this variable is very low and than it needs to be classified as orange or red.

It has been proved that by means of machine learning techniques, valuable relationships can be found between administrative and clinical patient data. For example, the occurrence of a specific laboratory test

(49)

served as a predictor to know whether a patient had a complication or not. Such a relationship might not be found using the knowledge of an expert. However, using a model like the one designed for this scientific research project, it can be proved that such a laboratory test is significant for the prediction. Most frequent activities with highest counts are most likely not the ones discriminating between an outcome of a variable, since the

distribution of counts for the possible outcomes is rather equal. Less frequent activities might be too rare or uncommon to have a predicting impact. For this reason, before creating the predicting model, the significance of the 200 most frequent activities was measured.

5.2 Limitations and directions for future

research

Based on the experience acquired during this SRP, recommendations can be done to improve the quality of the predicting model build during this research.

5.2.1 Data quality

The DSCA registration was considered the standard. However, the validity of the 2 registrations used in this SRP was not studied. In the future, the possibility to improve the quality of the DSCA registration can be

researched. It might be that the ”gold” standard is somewhere in between the financial and the quality registration.

(50)

5.2.2 Constraints for the financial data

Another point to remark, is the impact that can have to only use the activities performed during a specific period of time. Using the activities happening before and after a concrete moment of the cycle of care of a patient, might have an influence in order to predict a medical event (for example during the week after surgery). For this SRP all the activities registered during 2007 and 2014 were used. Moreover, this time constraint can be di↵erent depending on the studied variable. Further studies can investigate whether the method can be improved using time intervals. The cost or the department where an activity is performed could also serve to set constraints in the financial dataset. For example, a patient with a complication might be more expensive than one without. Or activities registered in a concrete department may indicate that a patient has a comorbidity.

5.2.3 Hospital variability

Future research should also consider hospital variability and investigate the di↵erences in the prediction. In this scientific research project the

variability between hospitals for the variable with the lowest accuracy (COMPL) was studied. Table 14 shows the results for the variable

complication, per hospital. It can be seen that with quite constant values of AUC the PPV oscilates a lot depending on the studied hospital.

(51)

Table 14: Hospital variability 1 Hospital AUC PPV NPV H1 0.59 0.93 0.14 H2 0.69 0.54 0.84 H3 0.65 0.48 0.77 H4 0.53 0.55 0.57 H5 0.60 0.36 0.92 H6 0.60 0.55 0.66 H7 0.57 0.4 0.73 H8 0.66 0.57 0.74 H9 0.58 0.35 0.82 H10 0 0 0 H11 0.61 0.67 0.34

The model might be ameliorated if the hospitals that do not report their data correctly are excluded of the study. Table 15 shows the activities registered per hospital. Excluding the hospitals were the number of activities per patient is low might also have a positive impact.

(52)

Table 15: Hospital variability 2

Hospital Activities Patients Average

H1 299867 995 301 H2 170132 838 203 H3 425137 821 517 H4 179646 549 327 H5 37257 424 87 H6 130580 385 339 H7 160803 351 458 H8 68963 331 208 H9 121119 328 369 H10 302 80 3 H11 664 38 17

5.2.4 Classification based on true positives

As shown in the results section, the color classification of the variables was based on the number of true positives. It was assumed that it is more important to discover that an event did occur by means of the financial information. Thus, a higher importance was given to the number of true positives. Despite this fact, the prefilling could also be done using the true negatives. For some variables, the NPV was higher than the PPV. For example, for VENINV the PPV was 0.30 and the NPV was 0.89. This example can serve to see that a di↵erent approach is possible where the NPV is used for prefilling.

Unexpectedly, the variables with only two possible categories (COMPL, VENINV and COMORB) had a PPV lower than 0.75 and were than classified as orange. However, the variables with multiple possible categories were grant with the green color due to PPV higher than 0.75.

(53)

the activities might have an influence in the end result.

5.2.5 Direct variables

During this master thesis it was also briefly studied if the information of two direct variables coincided in both registrations. These two variables were: the date of operation (DATOK) and the days in the intensive care (ICDG). Eighty-six percent of the dates registered for the most frequent activities used to describe an operation procedure, coincide with the

DATOK (Operation date) of the DSCA. Moreover, if the date is placed in a bigger interval on time (example: adding and subtracting 5 days), the percentage improves.

Concerning the ICDG which are the number of days that a patient is in the intensive care, a match of fifty percent was found while comparing both datasets. However, for some patients the di↵erence was only of 1 or 2 days. Thus, maybe the exact number cannot be predicted for all cases but by means of the activities, an estimation can be filled in. If the total number of days in the intensive care is summed for 2014, a di↵erence of only 17 days was found for the whole data. In the activities, 262 days were registered and 245 days in the DSCA. Specific rules must be set for each direct variable in order to determine the most adequate technique to prefill this kind of variables.

(54)

6

Conclusions

In this study the data of 11 hospitals was included. For this sample the financial data can be used to partly prefill the medical registration of the DSCA. The prediction can be done with certain degree of error but this method can still save some time to the doctors; instead of typing the whole information, for mostly all the variables they would only have to check if the best possible option is correct. As a conclusion of this study, it can be stated that part of the DSCA can be predicted using the financial data although, the information that strictly depends on a result of medical procedure can not be found. As the studied sample is large, this study is generalisable to, at least, other hospitals of the Netherlands. However, the model must be ameliorate in order to be able to use it in practice; first, the recommendations for improvement explained in Chapter 5 must be taken

(55)

into consideration and afterwards, the model could be applied in new data and validated for doctors. One of the goals of this study was to break the barriers to succeed in the registration process in the dutch hospitals;

multiple data entry is unnecessary, errorprone, tedious and timeconsuming. For this reason, the model should be tested in the medical arena to see if it is successful. If possible, data should be recorded only once and in an adequate quality [12]. The DSCA variables should be prefilled

automatically from the financial data, checked by the one responsible and submitted to the DICA. Health care institutions must take advantage of the possibility of reusing data and create strategies to increase quality of the registrations, which can ultimately serve as a basis for doctors, not only to monitor but also to deliver the best possible care [8]. The latest news indicates that the quality of the health registries must be improved but also that the annual burden of the registration process in the Netherlands sums up to 80 million euros [34]. Thus, it is clear that costs should decrease at the same time that quality increases; this master thesis is a first attempt to change the current situation. In the nearest future, the predicting model will be improved and applied to all the variables of the DSCA to investigate the impact that can have in the registration process.

(56)

7

Appendix

7.1 List of abbreviations

The following table shows the description of the abbreviations used in this thesis. The page on which each one is first used is also given.

(57)

Abbreviation Meaning Page

CEA carcinoembryonic antigen 8

DSCA Dutch Surgical Colorectal Audit 3 EMR Electronic Medical Record 13

SQL Structured Query Language 19 DICA Dutch Surgical Colorectal Audit 3 MRDM Medical Research Data Management 19

DOT Diagnose Behandeling Combinatie 2

CT Computed Tomography 3

CTA Computed Tomography Abdomen 3

PN Patient Number 20

H Hospital 20

Loc Location of the tumor 20

MRIB Magnetic Resonance Imaging Brain 20

ED Emergency Department 20

COMORB comorbidity 18

COMPL complication 18

VENINV extramural vein invasion 18

CHEMO1 chemotherapy 18

SCORECN cN score 18

TYPOK operation urgency 18

SRP Scientific Research Project 23

AUC Area Under the Curve 23

AIC Akaike Information Criteria 24

ACC accuracy 25

PPV positive predictive value 25 NPV negative predictive value 25 ASA American Society of Anesthesiologists 29

TNM Tumor Nodes Metastasis 32

HIPEC Hyperthermic Intraperitoneal Chemotherapy 31 CAPOX Combination of Capecitabine and Oxaliplatin 31

(58)

7.2 10 patient study

See below the description of the clinical information of the ten patients created while contrasting the information of the quality and the financial registration.

Patient 1: 79 year old man with no comorbidities and ASA score II (physical status classification system for assessing the fitness of patients before surgery). He was diagnosed with colon cancer, precisely, he had an adenocarcinoma in the sigmoid colon. An elective laparoscopy was proceed to do a low anterior sigmoid resection with removal of 13 lymph nodes. An early conversion was made during the operation because of the extension of the tumor and finally a terminal colostomy was done. He had neurological complications within 30 days after the operation. The pathological tumor score was classified as 3.

Patient 2: 75 year old woman with no comorbidities and ASA score I. She was diagnosed with colon cancer, precisely, she had an adenocarcinoma in the sigmoid colon. A surgery was done before the resection of the tumor and she had a preoperative complication, concretely, ileum obstruction. An elective laparoscopy was proceed to do a low anterior sigmoid resection with intestinal anastomosis and removal of 14 lymph nodes. The pathological T was classified as 4.

Patient 3: 66 year old man with comorbidities and ASA score II. He was diagnosed with rectum cancer, precisely, he had an adenocarcinoma. He had preoperative chemotherapy and radiotherapy. An elective

laparoscopy was proceed to do an abdomino-perineal resection with removal of 19 lymph nodes. Finally, a terminal colostomy was done. He had a second tumor in the liver. The pathological T was classified as 2.

(59)

Patient 4: 66 year old man with comorbidities and ASA score II. He was diagnosed with rectum cancer, precisely, he had an adenocarcinoma. An elective open procedure was done to do a low anterior sigmoid resection with intestinal anastomosis and removal of 9 lymph nodes. A reintervention was necessary to do a laparotomy and terminal colostomy because he su↵ered a complication (anastomotic leakage). He was 2 days in the intensive care. The pathological T was classified as 1.

Patient 5: 82 year old woman with comorbidities and ASA score II. She was diagnosed with colon cancer, precisely, she had an adenocarcinoma in the ascending colon. An elective open procedure was done to do an extended right hemicolectomy with intestinal anastomosis and removal of 25 lymph nodes, which 2 of them were positive. The pathological T was classified as 4 and N as 1. She had extramural venous invasion and she received adjuvant postopeative chemotherapy (Capecitabine).

Patient 6: 75 year old woman with comorbidities and ASA score I. She was diagnosed with colon cancer, precisely, she had an adenocarcinoma in the transverse colon. An elective laparoscopy was proceed to do a subtotal colectomy with intestinal anastomosis and removal of 19 lymph nodes. A reintervention was necessary to do an ilestomy exculpatory because she su↵ered a surgical complication. She was 6 days in the intensive care. The patological T was classified as 3.

Patient 7: 60 year old woman with comorbidities and ASA score II. She was diagnosed with colon cancer, precisely, she had a zegelringceltumor in the sigmoid colon. She received preoperative chemotherapy. An elective open procedure was proceed to do a low anterior sigmoid resection with removal of 21 lymph nodes, which 9 of them were positive. A reintervention was necessary to do: a terminal colostomy, an additional extensive resection