Inappropriate prescribing in elderly people and its impact on clinically relevant outcomes. Using the STOPP/START criteria

(1)

Medical Informatics

Inappropriate prescribing in elderly people and its impact on

clinically relevant outcomes.

Using the STOPP/START criteria

Mirna L. Dees - Master’s Thesis

Universit `a di Pavia Centre for Health Technologies Via Ferrata 5, 27100 Pavia Study period: December 2018 - July 2019

(2)

(3)

Medical Informatics

Student Mirna L. Dees Student number 11394684

Mentor prof. Riccardo Bellazzi Tutor prof. Ameen Abu-Hanna

Universit `a di Pavia Centre for Health Technologies Via Ferrata 5, 27100 Pavia Tel.: +382 989898

(4)

(5)

I have received my previous education at a Dutch University of Applied Sciences. Despite the fact I suc-cessfully completed the pre-Master’s program before enrolling into the Master’s program, I would often be insecure about my capabilities and would feel the necessity to prove myself. For this reason it feels like a small victory to hand in my Master’s thesis. I am very grateful I was given the opportunity to gain knowledge about inappropriate prescribing in elderly patients, machine learning and on top of that being able to explore Italy. I can honestly say I have learned a lot during this period.

I could not have done this alone, and I would like to express my gratitude towards some people. First, I would like to thank professor Bellazzi for his warm welcome and giving me the opportunity to write my Master’s thesis at the computer science and systems engineering laboratory for ISCM and professor Abu-Hanna for his guidance and advice.

Next, I would like to thank Alberto for the pleasant collaboration, your help is much appreciated. Furthermore, I would like to thank everyone at the lab for collegial support and countless cups of coffee. I am very happy that I got to learn stuff about bees and chickens, next to medical informatics topics.

In general, I would like to thank everyone that made my time in Pavia unforgettable. I am curious to see where we will meet again.

Stef, thank you for your never ending support and for being such a good babysitter for my plants (only three of them died). Justin and Julia, I had no doubt that you guys would pull this off again. Unfortunately, the number of diplomas we obtain does not necessary reflect back in our sense of humor.

I will end my preface here, I hope you will enjoy reading my Master’s thesis.

Delft, July 2019 Mirna Dees

(6)

(7)

Preface i

Nederlandse samenvatting vii

English summary ix

1 Introduction 1

1.1 Introduction . . . 1

1.2 Inappropriate prescribing in elderly people . . . 1

1.3 Identifying Potential Inappropriate Medications and Potential Prescribing Omissions . . . 2

1.4 STOPP/START criteria . . . 2

1.5 Prediction models . . . 2

1.6 Clinical Scientific Institute Maugeri . . . 3

1.7 Previous research . . . 3

1.8 Objectives . . . 4

1.9 Organization of chapters . . . 4

2 Methods 5 2.1 Extraction of the data . . . 5

2.2 Preprocessing of the clinical databases . . . 6

2.2.1 Preprocessing the STOPP database . . . 6

2.2.2 Preprocessing the START database . . . 6

2.3 Detection of inappropriate medications . . . 7

2.3.1 STOPP criteria detection method . . . 7

2.3.2 START criteria detection method . . . 7

2.4 Analysis of the adverse outcomes . . . 8

2.4.1 Model selection & variable selection . . . 8

2.4.2 Cross-validation . . . 8

2.4.3 Variable selection . . . 9

2.4.4 Regression models . . . 9

2.4.4.1 Linear regression . . . 10

2.4.4.2 Logistic regression . . . 10

2.4.4.3 Mixed effect models . . . 11

2.4.5 Performance measures . . . 12

2.4.5.1 Performance measures for continuous outcomes . . . 12

2.4.5.2 Performance measures for binary outcomes . . . 12

2.5 Occurrence of falling . . . 14

2.5.1 Classification trees . . . 15

2.5.2 Random forest . . . 15

3 Results from the analysis of the STOPP criteria 17 3.1 Extraction of the data . . . 17

3.2 Preprocessing of the clinical databases . . . 18

3.3 STOPP criteria detection method . . . 20

(8)

3.3.1.1 STOPP compliance per rule . . . 20

3.3.1.2 STOPP compliance per category . . . 22

3.3.1.3 STOPP compliance per year . . . 22

3.4 Analysis of clinical outcomes of interest . . . 22

3.4.1 Preprocessing . . . 24

3.4.2 Model selection . . . 24

3.4.3 Variable selection . . . 26

3.4.4 Hospitalization duration . . . 27

3.4.5 Hospitalization duration exceeding the expected length-of-stay . . . 30

3.4.6 Occurrence of falls during hospitalization . . . 33

3.4.6.1 Results from logistic regression . . . 35

3.4.6.2 Results from classification trees . . . 36

3.4.6.3 Results from random forest . . . 37

3.4.7 Variables influencing the probability of non compliance . . . 39

4 Results from the analysis of the START criteria 43 4.1 Extraction of the data . . . 43

4.2 Preprocessing of the clinical data set . . . 43

4.3 START criteria detection method . . . 45

4.3.1 START criteria evaluation . . . 45

4.3.1.1 START compliance per rule . . . 46

4.3.1.2 START criteria per category . . . 48

4.3.1.3 START compliance per year . . . 48

4.4 Analysis of clinical outcomes of interest . . . 49

4.4.1 Preprocessing . . . 49

4.4.2 Model & variable selection . . . 51

4.4.3 Hospitalization duration . . . 52

4.4.4 Hospitalization duration exceeding the expected length-of-stay . . . 55

5 Discussion 59 5.1 Main findings . . . 59

5.2 Strengths and Limitations . . . 59

5.3 Comparison to other studies . . . 60

5.4 Future work . . . 60

5.5 Conclusion . . . 61

A STOPP/START criteria 69 A.1 STOPP criteria mapping . . . 69

A.2 Discarded STOPP rules . . . 74

A.3 START criteria mapping . . . 74

B Results per rule 77 B.1 STOPP results per rule . . . 77

(9)

(10)

(11)

Introductie Het eerste doel van deze studie was het identificeren van potentiële ongeschikte medicatie en potentiële omissie van medicatie bij oudere patienten in het ICSM (Klinisch Academisch Instituut Maugeri) Pavia, Italië. Hierbij zijn de STOPP/START criteria gebruikt als instrument voor medicatiebeoordeling. Het tweede doel was onderzoeken of er een relatie aanwezig is tussen ongeschikte medicatievoorschijvingen en nadelige gevolgen voor de patient.

Methode Voor zowel de STOPP en de START criteria zijn twee data sets gecree¨erd (een set met medi-catie informatie en een set met ziekenhuis informatie) met patienten die medimedi-catie of een diagnose hebben die voorkomt in de criteria. Er is een code ontwikkeld in R die voor elke regel een set maakt met alle cor-responderende medicatie en een set met alle corcor-responderende diagnoses. Vervolgens zijn deze twee sets gecontroleerd op de aanwezigheid van beide elementen in de regel. Wanneer dat nodig was, zijn andere voorwaarden (zoals tijd, dosis, etc.) ook meegenomen in deze controle.

Om een mogelijke associatie tussen ongeschikte medicatievoorschijvingen en nadelige gevolgen te con-troleren zijn er verschillende data analytics technieken gebruikt waaronder lineare en logistieke regressie-modellen. De volgende uitkomsten zijn geselecteerd: duur ziekenhuisopname (in dagen), overschrijden van de verwachte duur ziekenhuisopname en een valindicent tijdens ziekenhuisopname.

Resultaten Het was mogelijk om 55 van de 65 STOPP criteria toe te passen. Hiernaast zijn 21 van de 22 START criteria zijn toegepast. De STOPP criteria waren 144163 keer van toepassing op deze data set. Hiervan is 82,2% geschikt bevonden, 17.1% ongeschikt bevonden en 0.7% is onzeker bevonden. De START criteria waren 21481 van toepassing op deze data set. Hiervan is 47.3% geschikt bevonden, 45.1% ongeschikt bevonden en 7.6% onzeker bevonden.

Er is een verband gevonden tussen het niet naleven van de STOPP criteria en de geselecteerde uitkom-sten. Echter, er is ook een verband gevonden bij een analyse waarbij de aanwezigheid van een ongeschikte regel als uitkomst is gebruikt en de duur ziekenhuisopname als variabele.

Er is geen significant verband gevonden tussen het niet naleven van de START criteria en de gese-lecteerde uitkomsten. Er is wel een verband gevonden tussen een valincident tijdens het gebruiken van medicatie en de geselecteerde uitkomsten.

Conclusie Deze studie onderstreept dat de procedures met betrekking tot ongeschikte medicatievoor-schrijving bij ouderen aangescherpt mag worden. Niet alleen omdat de het percentage ongeschikte medica-tievoorschrijving vrij hoog ligt, maar ook omdat het bewezen nadelige gevolgen kan hebben.

(12)

(13)

Introduction The first goal of this study was to identify potential inappropriate medications (PIM) and potential prescribing omissions (PPOs) in elderly patients in ICSM (Clinical Scientific Institute Maugeri) Pavia, Italy. The STOPP/START criteria were used as an instrument for this assessment. The second goal was to analyze whether there is a possible association present between inappropriate prescriptions and adverse outcomes for the patient.

Methods To apply the rules, for both the STOPP and START criteria, two data sets have been created. One set included medication information and one set included (other) hospital information. Only patients which have the medication or diagnosis that is present in the criteria were selected for the sets. R code was developed to generate a set with all the corresponding medications and a set with corresponding di-agnosis for every rule. Thereafter, these two sets were checked on the presence of both elements in the STOPP/START rule. If necessary, other conditions (such as time, dose, etc.) were also checked.

To analyze possible associations between inappropriate medications and adverse outcomes, different kinds of data analytics techniques were used including linear and logistic regression models. The following outcomes were selected for analysis: hospitalization duration (in days), exceeding expected hospitalization duration and the occurrence of falling during hospitalization.

Results It was possible to implement 55 out of 65 STOPP criteria. At the same time 21 out of 22 START criteria have been implemented. A STOPP criterion was applicable 144163 times. Of these, 82.2% was found to be compliant, 17.1% was found to be potentially not compliant and 0.7% was found to be uncertain. For 21481 times, it was possible to apply a START criteria. Of these, 47.3% was found to be compliant, 45.1% was found to be not compliant and 7.6% was found to be uncertain.

Statistically significant associations were found between non compliance to the STOPP criteria and the selected adverse outcomes.

No significant association was found between non compliance to the START criteria and the selected adverse outcomes.

Conclusion In conclusion, this research describes the distribution of potentially inappropriate prescrip-tions and omissions in ICSM Pavia (Italy), according to the STOPP/START criteria and analyze potential consequences. Results from multivariate models showed that the presence of potentially inappropriate pre-scriptions and omissions could have an impact on clinically relevant outcomes.

(14)

(15)

1

Introduction

1.1 Introduction

In this thesis, the subject of inappropriate medication in elderly people and their possible relation to adverse outcomes are explored. This problem is present in both primary care patients and hospitalized patients. This research focuses on hospitalized patients. In this chapter, an introduction to the problem is described, after which the location and setting of this project will be described. This is followed by the objectives of the research and the organization of chapters of this thesis.

1.2 Inappropriate prescribing in elderly people

Elderly people are more likely to have multiple morbidities. Therefore they are more likely to be treated by multiple practitioners and consume more kinds of medication [1–3]. In addition, elderly people may have a different kind of responsiveness to drugs in comparison to younger people. This combination of factors causes elderly people to be more sensitive to Adverse Drug Reactions (ADRs) [2, 4]. A complete overview of medication history is often lacking, which makes correct prescribing even more difficult for practitioners. This proses a bigger risk for polypharmacy or inappropriate prescribing [5–7]. Polypharmacy is very common in the elderly population and is defined as using five or more kinds of medication at the same time [2, 5, 6, 8]. A Swedish research concluded that almost half of the Swedish population over 75 years is exposed to polypharmacy [8].

Inappropriate prescribing encompasses the twofold problem of potential inappropriate medication (PIM) and potential prescribing omissions (PPO) [9]. In PIMs, drugs are prescribed when possible risks outweigh the potential benefits [10]. For example: drugs that have an undesirable drug-drug interaction, overuse of drugs or ADRs [1]. In PPOs, drugs are omitted when they are indicated [10].

ADRs are a well known problem in elderly patients [7, 9, 11]. Oscanoa et al. estimated by a meta-analysis that 8.7% of the hospital admissions in geriatric units were induced by ADRs [3]. The risks of inappropriate prescribing can be numerous, including longer duration of hospitalization, occurrence of falling, mortality and other poor health outcomes [10, 12–15]. PIMs are frequent: Di Giorgio and Piera observed that 27% of the analyzed patients had one or more inappropriate medication prescribed [16]. Another example provided by Gallagher et al., reporting about 35% of the cohort prescribed with at least one PIM [4].

(16)

non-existent, but mostly it is more difficult to detect PPOs [1, 17]. Despite the odds, polypharmacy and under prescribing can occur in the same patient [18]. Research of Di Giorgio and Piera showed that 33% of the analyzed patients had one or more appropriate medication omitted in their hospital visit [16]. In the research of Barry et al., 57.9% of the researched cohort had at least one PPO [1]. Failing to prescribe medication can have an effect on disease prevention and can possibly have a big impact on health outcomes [1, 15].

1.3 Identifying Potential Inappropriate Medications and Potential

Prescribing Omissions

Between 1991 and 2015, about fourteen different methods were developed to identify PIMs and PPOs in elderly people [7, 19]. Well known set of criteria for identifying PIMs are the Beers criteria [13] and the Screening Tool of Older People’s Prescriptions (STOPP). A well known criteria for identifying PPOs is the Screening Tool to Alert to Right Treatment (START) [7].

The Beer’s criteria were developed in the United States by a panel of geriatric interdisciplinary team of experts. They were among the first criteria to broach the subject and introduced the term “Potentially Inappropriate Medication” [13, 20]. The Beer’s criteria focus on detecting PIMs but they are not designed to detect PPOs [21]. Even though the Beer’s criteria are very popular in the United States, some of the drugs described in the Beer’s criteria are rarely used in Europe, which makes them more difficult to apply outside of the United States [20, 22, 23]. Others criticize the illogical order of the criteria which makes it difficult to use [20].

The STOPP/START criteria were created in 2008 by 18 Irish geriatrians, using the Delphi consensus method. The criteria consist of 65 STOPP criteria and 22 START criteria [4, 7]. The primary objective of the STOPP/START criteria is to reduce inappropriate prescribing [16].

In 2015, because of practical relevance, an updated version of the STOPP/START criteria was published [24].

A strong asset of the STOPP/START criteria is the two-piece nature of inappropriate prescribing, con-sidering both inappropriate medication and prescribing omissions [9]. The STOPP criteria are used to identify PIMs [7]. An example of a STOPP criterion is “Non cardioselective Beta-blocker with Chronic Obstructive Pulmonary Disease” [25]. The START criteria are used to identify PPOs [7]. An example of a START criterion is “Warfarin in the presence of chronic atrial fibrillation” [25].

When comparing the STOPP/START to other existing criteria, STOPP/START is seemingly the most beneficial in terms of flexibility, clinical correctness and usability in different parts of the world [20]. There-fore, in this research the STOPP/START criteria were used to detect inappropriate prescribing.

1.4 STOPP/START criteria

Because of their practical relevance, the criteria were updated in 2015 [4, 7]. However, the first version is still being used today (e.g. in the 2012 Dutch Multidisciplinary Guideline ”Polypharmacy in the elderly” [26, 27]).

1.5 Prediction models

As mentioned, inappropriate prescribing can possibly have an adverse effects on elderly patients. It is desired to investigate wether inappropriate prescribing indeed has an effect on clinical adverse outcome in elderly patients. Prediction models or other statistical techniques, such as regression analysis, can be used to study relations between an outcome and predictive variables [28].

Supervised machine learning can be defined as: “a computerized mechanism that is used to acquire knowledge about a task” [29]. In supervised machine learning, the goal is to train a computer on a set of

(17)

Category Number of rules

Cardiovascular system 17

Central nervous system 13

Gastro-intestinal system 5

Musculoskeletal system 8

Respiratory system 3

Urogenital system 6

Endocrine system 4

Drugs that adversely affect falling 5

Analgesics 3

Duplicate drug classes 1

Table 1.1: Categories and number of rules of the STOPP criteria [25]

Category Number of rules Cardiovascular system 8

Respiratory system 3

Central nervous system 2 Gastro-intestinal system 2 Musculoskeletal system 3

Endocrine system 4

Table 1.2: Categories and number of rules of the START criteria [25]

labeled objects to recognize or predict certain data values, patterns or events by using parametric models or non-parametric models [28, 30, 31]. In parametric models, such as linear and logistic regression, the form of the relationship between the outcome and prediction is pre-specified. Training these models amounts to optimizing the parameters in this pre-specified relationship [28, 32]. In non-parametric models, such as classification trees or Neural Networks, the models do not require previous knowledge about the specific relationship between predictors and outcomes. These kind of models provide more expressive relationships and requires more complex strategies to train them in comparison to the parametric models [32].

1.6 Clinical Scientific Institute Maugeri

The Clinical Scientific Institute Maugeri (ICSM) is an accredited research hospital and has different loca-tions throughout Italy. The first location was established in Pavia [33].

The main focus of ICSM is occupational and rehabilitative medicine. Occupational medicine is aimed at diagnosis and treating occupational diseases, prevention of risks and medical rehabilitation for people that suffer from cardiovascular, respiratory or neurological illness, caused after disease, acute and chronic invalidity or cancer [33].

The location in Pavia hosts students from the Medical School of University of Pavia and is a provider of continuing medical education through the Centro Studi Maugeri. Their mission is to improve healthcare and quality of life for citizens [33].

1.7 Previous research

The focus of the ICSM is patient’s rehabilitation, consequently it is more focused on patients with a chronic disease instead of patients with an acute illness. Chronic diseases and other morbidities are more prevalent in elderly patients [1–3]. This makes inappropriate prescribing in elderly patients relevant for ICSM and therefore, it makes the STOPP/START criteria interesting to use. Hence, the long term vision is to be able

(18)

to implement a tool that uses the same criteria to support the clinicians in their prescribing which improves clinical outcomes.

To realize this, research was done to automatically detect inappropriate drug administrations in elderly patients. This research used the first version of the STOPP criteria and included analysis about possible associations between inappropriate drug prescriptions against clinical outcomes and hospitalization falls. The result of this research was a tool which allowed retrospective detection of prescriptive appropriateness in consideration of the STOPP criteria. A number of 55 out of 65 STOPP criteria were implemented in this tool.

1.8 Objectives

In previous research, the foundation for a method to identify inappropriate prescriptions related to the STOPP criteria was developed and applied on ICSM data. It is desired to identify all the inappropriate prescriptions (including prescription omissions) in a reproducible way and automatically detect drug in-appropriateness. This knowledge can be used to improve clinical outcomes and patient management at ICSM.

The first aim of this research is to improve and apply previously developed methods from earlier research to an updated clinical data set to identify inappropriate prescribing in elderly patients. The second aim is to analyze if violation of these criteria are related to adverse outcomes.

The research questions selected for this project are formulated as follows: • What is the frequency of PIMs and PPOs in ICSM Pavia?

• Does violating the STOPP/START criteria have an impact on the occurrence of adverse outcomes?

1.9 Organization of chapters

Chapter 2 of this document will provide in dept information about the methods used for this research. Chapter 3 will provide the results about the STOPP criteria and chapter 4 will provide the results about the START criteria. Chapter 5 concludes this thesis with a discussion about this work and a conclusion.

(19)

2

Methods

2.1 Extraction of the data

The clinical database was created by gathering information from the Health Information System of the ICSM located in Pavia by querying the system for hospitalizations with diagnosis and drug administrations that are mentioned in the STOPP/START criteria.

The pharmacy service at the ICSM provided the Anatomical Therapeutic Chemical (ATCs) codes [34] from the international classification system that corresponded to the drugs from all the STOPP/START criteria. In parallel, a team of clinicians was asked to link the pathologies present in the STOPP/START rules to corresponding ICD-9 classifications [35] to gather information from the Health Information System regarding all the departments in the hospital.

The information that was extracted to check for inappropriate prescribing was different for the STOPP and START criteria because the corresponding ICD-9 codes and ATC codes differ and the rules are con-structed in a different way. For the STOPP criteria, the data which is related to the drug administrations mentioned in the criteria was extracted first. Next, the hospitalizations with patient information (including diagnosis) connected to the administrations. In this way, for all the related medications, it could be checked if there was a conflicting diagnosis present.

The extraction of the START data was different. First, the hospitalization with the diagnosis mentioned in the START criteria was extracted. Afterwards, the administrations connected to those hospitalizations were selected to check what medication is administered to the patient.

For both the STOPP and the START criteria, the extracted information consisted of a set of hospital-izations including patient information and a set of therapies with specific drug administrations. The patient information consisted of: sex, date of birth and given diagnoses, while the hospitalization data involved: clinical procedures, duration of the hospitalization, information about the department where the patient was hospitalized, discharge mode (to home or another institution) and Diagnosis Related Group (DRG).

The DRG code is a classification system that standardizes the anticipated payment for hospitals re-specting the treatments and length-of-stay for a patient by using the given diagnosis codes [36]. This is an important measure in Italian healthcare because it is connected to the payment that a hospital will receive [37].

Information about falls during hospitalization were available from another database and had to be retrieved from a handcurated registry collecting information about hospitalizations during the years 2009 -2018.

(20)

2.2 Preprocessing of the clinical databases

Because the intent of the STOPP and START criteria are different, the preprocessing steps slightly differ from each other. The differences will be described in the upcoming paragraphs. A visual representation of the preprocessing steps can be found in Figure 2.1.

For both the STOPP and START datasets, only patients above 65 years were kept (the elderly patients) and duplicate rows were deleted. The drug administrations that were registered before or after the hospital-ization and the hospitalhospital-izations without a discharge date were deleted as well. This was done because it is not possible to administer drugs to people that are not present in the hospital. Therefore, these administra-tions were not trustworthy.

The open source program R was used for the preprocessing of the data sets. The R packages used in this part are: dyplr [38], Hmisc [39] and plyr [40].

Generated START data set

Make subset for patients over 65 years

Delete duplicate rows

Delete the drug ad-ministrations outside

of hospitalization Check for presence

hospitalization identification in

both datasets

Cleaned START data sets Generated STOPP data set

Make subset for patients over 65 years

Delete duplicate rows

Delete the drug ad-ministrations outside

of hospitalization

Calculate amount & duration of drug

administration

Cleaned STOPP data sets

Figure 2.1: Visualization of the preprocessing steps of the extracted data sets for both the STOPP and START data set

2.2.1 Preprocessing the STOPP database

For the STOPP criteria, some rules require evaluation of the amount of drug taken or the number of consec-utive days or the duration that a drug was taken. For this reason, for every drug administered, the amount and the number of consecutive days of administration was calculated and stored in a separate data set which was used for detecting inappropriate medications.

2.2.2 Preprocessing the START database

Even though it is known that every hospitalized patient will receive some kind of medication, the registra-tion of administered medicaregistra-tion is not always computerized. Therefore, for the START databases, it was

(21)

necessary to check the presence of a medication record for every hospitalization. This was accomplished by deleting the hospitalizations and administrations for which the hospitalization identifier was not present in both data sets.

2.3 Detection of inappropriate medications

The practitioners at ICSM are using the first version of the STOPP/START criteria as a guideline in daily practice. Therefore, this version was used. The exact STOPP/START criteria used are described in detail in the paper of Nobilli [25].

The mapping of the STOPP/START criteria and the R code available from earlier research was used as a basis to detect inappropriate medication. This code was improved and expanded to develop the detection method.

2.3.1 STOPP criteria detection method

The construction of a STOPP criterion is as follows: patients with a certain condition should not be using a certain kind of medication. An example of a STOPP rule is: “Glibenclamide or chlorpropamide in type II diabetes mellitus: risk of prolonged hypoglycaemia” [25].

A mapping of the STOPP criteria was used to evaluate the patient conditions. This mapping determines which conditions need to be checked for each rule. The detailed mapping can be found in Appendix A.

The mapping explains if a STOPP criterion is a chronic rule or non-chronic rule. For a chronic rule, all hospitalizations in relation to a patient (after the diagnosis has been given) needed to be checked for compliance. For a non-chronic rule, only the related hospitalization needed to be checked. Second, all the STOPP criteria were divided in five categories: Simple rules, Time rules, Dose, Association and Non-condition.

The Simple rules are rules where a condition is inappropriate if the mentioned diagnosis and drug therapy are present. In the Dose rules, the administration is appropriate when the maximum amount of drug is respected. In the Time condition, the rules are respected if the drug administration is under a certain time threshold. In the case of an Association rule, there are two options. In the first case, the drug administration is appropriate if the drug is taken in combination with other specific drugs. In the second case, the drug administration is appropriate if the drug is taken in absence of certain drugs. The Non-condition rules are the rules without a condition set. for example: drugs that affect falling.

Each STOPP criterion was evaluated to verify the adequacy to use certain medication given the patient conditions. For every rule, each mentioned administered drug was mapped into a therapy set and the related hospitalization information in a diagnosis set. The drugs and diagnosis where identified by using the ICD-9 codes for the diagnosis and the ATC codes for the therapies. With this information it was possible to detect each time that a rule was applicable to a hospitalization and to label them. The rules applicable to the hospitalizations were labeled “compliant”, “potentially not compliant” or “uncertain”.

The “uncertain” label was created because certain rules require to check information which is not avail-able or uncertain. When a hospitalization received this label, it is desired to evaluate the hospitalization manually for compliance or non-compliance.

The open source program R was used for detecting the STOPP criteria. The R packages used in this part are: dyplr [38], Hmisc [39] and plyr [40].

2.3.2 START criteria detection method

The construction of a START criterion is as follows: patients with a certain condition, should start a certain kind of medication. An example of a START rule is: “Warfarin in the presence of chronic atrial fibrillation”. Some of the START criteria require to check specific information beyond the medication list. To dis-tinguish these rules, a mapping of the START criteria has been made. The detailed mapping of the START

(22)

criteria can be found in Appendix B.

To make this subdivision, similarly to the STOPP criteria detection method, for all START criteria was determined if a rule was a chronic rule or a non-chronic rule. Second, all START criteria are subdivided into four categories: Simple rules, Time rules, Complete association and Exception.

The Simple rules involve the majority of the START criteria and are the most straightforward, involving just the presence of the condition, appropriate when the diagnosis and therapy are both present. The Time rulesprovide an extra condition for appropriateness. In these rules, the patient has to start with a specific drug after a specific amount of time. For example: “antidepressant drugs in the presence of moderate-severe depressive symptoms lasting at least 3 months”. A complete association involves two types of drugs that need to be taken at the same time. The Exception rules involve another kind of exception that needs to be checked.

Each START criteria was evaluated to verify if the patient conditions were adequate to omit certain drugs. For every rule, each corresponding diagnosis was mapped into a diagnosis set by using the ICD-9 codes for identification. Second, the ATC codes were used to identify all the correctly administered drugs which were mapped in a therapy set. Equally to the STOPP criteria, all detected cases where a criteria was found applicable, were labeled “compliant”, “potentially not compliant” or “uncertain”.

2.4 Analysis of the adverse outcomes

After the evaluation, the data was transformed to create two data frames (one for the STOPP criteria and one for the START criteria) with all the hospitalizations applicable, the number of compliant and potentially not compliant rules and information related to the adverse outcomes. In this set, each row was a hospitalization. The adverse outcomes that are selected for evaluation are: the hospitalization duration and the hospital-ization duration exceeding expected length-of-stay. The last outcome is connected to the expected number of hospitalization days according to the DRG code. This threshold indicates if a hospitalization duration exceeded the expected amount of days. Both these outcomes were an indicator to measure the adverse outcomes as a longer hospitalization stay is associated with a worse patient state.

The following section explains the methods used as well as some theory behind these methods. The open source program R has been used for the analysis of the adverse outcomes. The packages used are: lme4 [41], Metrics [42], nlme [43], PRROC [44], Caret [45], MLmetrics [46] and icd [47].

2.4.1 Model selection & variable selection

To decide which model best predicted the adverse outcomes, a function in R was created to test different possible models on the training set by using 10 fold cross validation. Selecting the correct model for prediction is important, because a wrongly chosen model or variables can lead to misleading conclusions or disappointing predictions [30, 31, 48].

2.4.2 Cross-validation

Training and evaluating a model on the same data can lead to overoptimistic results. Cross-validation orig-inated from the concept that there is a limited amount of data available to evaluate [31]. Even though cross-validation is widely used because of its simplicity and universality, its downside is that it is computa-tionally extensive [31, 49].

In cross-validation, the data is split randomly in n folds that are re-used multiple times [30, 31, 49]. After creating the folds, n − 1 folds are used to train the model, and the remaining fold is used to validate the trained model for prediction. This is repeated until all the folds have been used to validate the model once. The model with the best performance is selected and used to be trained on the entire set and, if possible, tested on a separate test set [30, 49].

(23)

In this research, first, the set was divided in a 70% training set and a 30% test set. Ten fold cross-validation was performed on the training set. The model with the best performance was used on the test set.

2.4.3 Variable selection

The goal of variable selection is to identify the most informative set of variables which have impact on the prediction and to create the most optimal performing model [48, 50–52]. Putting too many variables in a model can cause overfitting [51, 53]. At the same time, not all variables have an equal impact on the outcome, which makes it possible to neglect some of them [54]. In theory, all possible sets have to be tested in order to decide on the best performing set of variables [51].

Classical methods for variable selection compare p-value or information criterion to compare perfor-mances between variables [55]. An information criterion that can be used for regression, such as linear and logistic, is the AIC (Akaike Information Criterion). The AIC estimates the out-of-sample prediction loss by estimating the (penalized) in-sample prediction loss [50]. Models that lead to the smallest AIC are considered the best in performance [56].

2.4.4 Regression models

The models tested are regression models with mixed effects and models without. These models had different combinations of predictors in fixed effects and variables. The objective for this is to identify the most informative model in terms of fixed and random effects. Mixed effect models are described in Section 2.4.4.3.

For the continuous outcome, Linear Mixed-Effect Models (LMM) and Linear Models (LM) are fitted, while for the binary outcomes, Generalized Linear Mixed Effect models (GLMM) and Generalized Linear Models (GLM) are used. The first two are linear models while the latter two are logistic models. After the best model was chosen, the model was fitted on the entire training set and validated on the test set.

Regression analysis is a parametric technique that is used for modelling and analyzing the relationship between outcome variable y (for example: cardiac event) and one or more explanatory variables x1, x2, ...xk

(for example: age, weight, length, etc) from a data set [57–59]. Its goal is to describe a function that, as close as possible, describes the relationship between the outcome variable and the explanatory variables [57]. Regression analysis allows to:

• Describe the relationship between the outcome variable and explanatory variables as well as their statistical significance

• Estimate the value of explanatory variables

• Determine individual prognosis by determining the risk factors [60]

The first task of regression analysis is to estimate the analytic relationship between the outcome variable and explanatory variables and form an equation based on information in the data set [48, 59]. When using this technique, there is assumed that there actually is a relationship between the outcome and explanatory variables [61].

The equation that is formulated can be used to predict the outcome y on a new data set with the same variables [28, 48, 57]. The interpretation and statistical evaluation helps to find the variables that are truly important to explain the outcome variable [60].

An assumption for using this technique is that the exploratory variables are not redundant. For example, BMI (Body Mass Index) and weight should not be included in the same model because they are very correlated [62].

(24)

2.4.4.1 Linear regression

Linear regression is a modelling technique which can be used to predict a continuous outcome by using a linear function created with the given explanatory variables [53, 57]. Estimating the relationship between the outcome variables and explanatory variables is equivalent to estimating how a linear function would be found using the given information [59]. In linear regression, the outcome variable is assumed to be continuous but the exploratory variables can be categorical, binary or continuous [60, 61].

When using this technique, there has to be assumed that every data-point in the data is equally important or “information-rich” and therefore equally contributing to the regression line [61]. Another assumption is that the data does not have to much outliers. Outliers can have a very strong impact on the overall accuracy [62]. Additionally, there has to be assumed that the data-points have no correlation and no possibility to influence each other [61, 62].

A formula for linear regression demands a straight line and is expressed in Equation 2.1 [61]. In this formula, y is the outcome variable, α is the intercept, xiare the exploratory variables, βiis the regression

coefficient of a variable and ε expresses the error [28, 48, 59–61]. The error is assumed to have a mean of zero and is created by the factors that are not observable [59, 61]

y = α + β0+ β1x1+ β2x2+ ...βkxk + ε (2.1)

The regression coefficient (βi) of a variable indicates the contribution on y. Correct interpretation of

these coefficients for continuous variables asks to pay attention to the unit of measurement [53, 60]. An example could be a model where the outcome variable is weight and the predictor variable is height. In this case, the height in centimeters has an impact on the weight in kilograms. Interpreting categorical or binary variables depends on the encoding of those variables [60]. The intercept of the model indicates the value of y when the value of the exploratory variables (xi) is equal to zero. When its not equal to zero, the

intercept does not have an inherent meaning [63]. It is important to note that the exploratory variables (xi)

are assumed to be fixed without any random component (including measurement error) in them [61]. It is important to asses the residuals of the model (the difference between the fitted line and an observed point) because they show the error of the model. The plot of the residual is desired to be normal because this indicates randomness of the real world which is difficult to include in a statistic model [61].

2.4.4.2 Logistic regression

Imagine a regression model which tries to predict if someone will fall or not. Because this outcome variable is binary (there are only two possibilities, falling or not falling), instead of a linear function, the model has to use the natural logarithm of the odds (logit) to model the probability of the event [53, 58, 60, 64]. In other words, logistic regression overall has the same features and assumptions as linear regression but applied to the logit scale [62]. The reason for using the odds instead of probabilities is because the probabilities display a number between 0 and 1 while the odds range from 0 to infinity [53, 65]. For the explanatory continuous variables (for example age), there is assumed that they have a linear relationship with the logit transformed outcome [62].

Odds is the ratio between an event and a non-event (the probability of someone falling against the probability of someone not falling). The formula for odds is represented in Equation 2.2 where P represents the probability of the event happening [65].

Odds = P

1 − P (2.2)

A formula for logistic regression is expressed in Equation 2.3 where y represents the outcome variable and P r represents the probability, α represents the intercept, xirepresent the explanatory variables and βi

expresses the regression coefficient [53, 64].

(25)

Interpretation of the regression coefficients of a logistic model is not as straightforward as the regression coefficients of a linear regression because the model is expressed in terms of odds ratio [53, 66]. The odds ratio is the ratio between two odds (an example could be: the odds of falling against the odds of not falling) and is the same as the exponential function of the regression coefficient, in this case: eβi _{[62, 65, 66]. A}

descriptive formula can be found in Equation 2.4.

Odds Ratio = Odds (falling)

Odds (not falling) (2.4)

When there is more than one explanatory variables in a model, the odds ratios are “adjusted”. This means that their unique contribution to the outcome is described after adjusting the effects (or subtracting the effects) for the other variables in the model [62].

When interpreting the odds ratios it is important to keep in mind that an odds ratio bigger than 1 in-creases the likelihood of the event while an odds ration of less than 1 dein-creases the likelihood of an event [65]. Interpreting odds ratios for continuous exploratory variables asks for attention because a meaningful unit of measurement has to be identified to express the degree of change [62]. For example, when boldy interpreting the odds ratio for age deriving from a logistic model, the odds ratios only display the unit of change per year while this is not the most meaningful scale. An scale of for example 10 years could be more meaningful when estimating risks.

2.4.4.3 Mixed effect models

Most analysis techniques assume that the data points are independent from each other [61, 62]. However, if there is more than one measurement of a patient these measurements are not independent from each other [56]. An example would be taking six blood pressure measurements of the same patient within half an hour. These measurements would be correlated.

Analyzing non-independent data, such as the data from this research, is challenging. Mixed effect models are an extension of regular regression models and deal with non-independence [52, 67, 68]. In these models, random effects take the non-independent data in the data set into account [56]. Which means that a random variable is used to represent the unit on which correlated measurements are taken [56]. The “random” term refers to the randomness in the probability of a model for the coefficients with a group-level [53]. The random variable itself is not especially interesting for the outcome but variability between-individual and within-between-individual needs to be taken into account [56]. In the example given, this random variable would be represented by the patient. Fixed effects, which are the variables of interest, explain the overall relation [67].

There are no set rules for identifying random effects because this is strongly affected by the goal of the research [52, 53]. Ideally, a random effect needs to have at least five different groups. With groups levels, the model might not be able to estimate the variance between-individuals properly [52].

There are different reasons to use mixed effect models:

• Modeling variation among individual level regression coefficients. Multilevel modeling is convenient when it is desired to model the variation of the coefficients across groups, make predictions for new groups, or account for group level variation in the uncertainty for individual level coefficient. • Estimating regression coefficients for particular groups. An example could be: estimating the number

of falls in different hospitals. When a multilevel model is used, it is possible to estimate these numbers for hospitals with little information available.

• Accounting for individual- and group-level variation in estimating group-level regression coefficients. An example could be: patients that have stayed at different departments in the hospital [53].

The reason for using mixed effect models in this study was to take the dependence of the departments and patients into account. The data overlaps multiple years, which means some patients were hospitalized

(26)

multiple times. The data of a patient with multiple hospitalizations was not independent and would probably have more similarities compared to data from another patient. Every hospital department treats a specific kind of patients and the characteristics among departments vary. The patients from the same department might have similar characteristics in comparison to patients from different departments. Therefore, it was desired to incorporated this value as a random effect.

A formula for a mixed effect model is displayed in Equation 2.5 where y represents the outcome vari-able, X represent the variables for the fixed effects and Z represents the random effects. β and b represent the unknown vector for the fixed effects and random effects. ε represents the error. [67]. With our data in mind, i in this formula represents the patient and j = 1, 2, 3...n represents the department.

yij = Xijβ + Zijbj+ εij (2.5)

For mixed models, the fixed effect coefficients are the same for all the groups, but the random effect coefficients vary per group. The random effects can be seen as random variables and are therefore more comparable to the error than to a regression coefficient [69]. Because the fixed effects are only used to control for the dependent data, it is not needed to interpret them.

2.4.5 Performance measures

When a model makes predictions, it is necessary to know how good these predictions are compared to the true outcomes. This section will describe various metrics to measure the performance of a model.

2.4.5.1 Performance measures for continuous outcomes

The Root Mean Square Error (RMSE) is a measure used to estimate the error between the values predicted by a model and the actual values [70]. For every data point, the difference between observed and predicted value is used. After this, the sum of all values is divided by the number of observations. The square root of this equation leads to the RMSE [71]. The equation for the RMSE can be found in Equation 2.6. In this equation direpresents the predicted values for subject i, firepresents the observed values and σ represents

the number of observations.

RM SE =r 1 nΣ n i=1 d_i− f_i σi 2 (2.6) The RMSE is considered to be very important because it estimates the (in)accuracy of a model [72]. The formal definition is: “standardized variables as the square root of the mean of the squared differences between corresponding elements of the forecasts and observations” [70]. Ideally, a good performing model has a low error and therefore a low RMSE.

Regression models are mostly fitted by a method minimizing the sum of squared errors between the fitted and observed values [73]. To determine the correlation between the actual values and predicted values, the “cor” function in R was used [74]. The cor function uses the Pearsons R2to calculate the correlation.

Pearsons R2is a measure that ranges from 0 (no fit) to 1 (perfect fit) and uses the residual squares to measure the goodness-of-fit for a model [57, 73]. The formula for R2_{can be found in Equation 2.7 where}

SSyrepresents a measure of variation of yiaround its mean (y) and SSE represents a variation of yi[75].

R2=(SSy− SSE) SSy

(2.7)

2.4.5.2 Performance measures for binary outcomes

Because the predictions of classification models are probabilistic, interpreting the outcomes is not as straight-forward as the continuous outcomes. For example, how do we compare the predicted probability of 0.55 for an event for an individual to the outcome itself of 0 or 1 [53]. Predictions from classification methods

(27)

represent the probability a specific example in the test set belongs to a specific class (e.g. affected). A probability threshold is applied to divide the predictions in two categories [76].

For the models with a binary outcome, equation 3.1 was used to find an appropriate value for the probability threshold. In this equation, R stands for: a false negative finding has R times higher penalty compared to a false positive. This method makes it possible to estimate the probability threshold without using training data [77].

Probability threshold = 1 − (R/(1 + R)) (2.8)

After discretizing the output, predictions can be divided in four different categories [76, 78]: • The True positives (TP) are the number of correct predicted positive values.

• The True negatives (TN) are the number of correct predicted negative values. • The False negatives (FN) are the number of incorrect predicted positive values. • The False positives (FP) are the number of incorrect predicted negative values.

These categories can be derived from a confusion matrix, as shown in figure 2.2. A confusion matrix is a representation of the predicted values of a model and actual values of a set and shows all the information about the performance of the predictor [78, 79].

For logistic regression, this confusion matrix is only valid for a specific threshold for the probability threshold [76]. The matrix forms the basis for the calculation of common performance metrics such as sensitivity, specificity, balanced accuracy and positive predictive value (PPV) [76, 79].

True positive p0 p False negative n Total P0 False positive n0 Total P True negative N0 N Actual value Prediction outcome

Figure 2.2: Confusion matrix. P’ & N’ represent the total number of positive and negative cases in the data set. P & N represent the total predicted number of positive and negative cases resulting from the model.

Sensitivity can be described as dividing the correctly classified positives by the total number of actual positives [76, 79, 80]. Specificity, on the other hand, clarifies the correctly classified negatives by dividing the correctly classified negatives by the total number of negatives [79, 80]. Naturally, for both performance measure, the value is desired to be as high as possible. The formulas for sensitivity and specificity are described in Equation 2.9 and Equation 2.10.

Sensitivity = T P

T P + F N (2.9)

Specificity = T N

T N + F P (2.10)

The prediction of the model can be described with the positive predictive value (PPV). The PPV shows how many of the values that are predicted as being positive indeed had the event [76, 81]. It is desirable that the PPV is as high as possible. The formula for the PPV is described in Equation 2.11.

(28)

PPV = T P

(T P + F P ) (2.11)

The average accuracy can be seen as the correct predicted values divided by the total number of values [76]. When a set is not balanced in terms of outcomes distribution, the predictor might have a preference for the more frequent class. In this case, the average accuracy does not provide a trustworthy measure. The balanced accuracy (BA) can be described as “the average accuracy obtained by either class”, and gives a more clear idea about the performance of the model [82]. The balanced accuracy for a model should be as high as possible. The formula for balanced accuracy is described in Equation 2.12 where P represents the total number of predicted positives and N represents the total number of predicted negatives.

BA = 1 2( T P P + T N N ) (2.12)

It is only possible to pick the best threshold for a discrete classifier when there is an indication of how the performance measures (sensitivity and specificity) will vary [79].

A confusion matrix can be created for every possible prediction threshold [76, 79]. Then, for all the points, the variation between sensitivity and specificity can be plotted in a Receiver Operating Characteristic (ROC) graph [76, 79]. The origin of the ROC graph is in signal detection, where it was used to find the trade-off between hit rates and false alarm rates. For model prediction, these two-dimensional graphs display the true positive rate (sensitivity) and false positive rate (1-specificity) on the axes, and the ROC graph itself shows the trade-off between sensitivity and specificity for every threshold for the discrete classifier [76]. An illustrative example of an ROC graph can be found in figure 2.3.

0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 1 - specificity sensiti vity

Figure 2.3: Example of a ROC graph

The area under the ROC curve (AUC) is widely used as a measure for the discriminatory performance of a model [79, 83]. It is important, because the AUC is equal to the probability that a model will provide a higher probability to a randomly drawn subject without the event [76]. The maximum value of the AUC is 1, which indicates any subject with the event (a positive case) will receive a higher probability than any subject without the event [83]. Therefore, this value is desired to be as high as possible.

2.5 Occurrence of falling

The adverse outcome “occurrence of falling” was only analyzed for the STOPP data. The reason is that there is a specific category in the STOPP criteria dedicated to “drugs that affect falling”, which implicitly

(29)

mentions this as an adverse outcome of non-compliance to the criteria. Because this is not the case for the START criteria, it is considered more appropriate to just perform this analysis for the STOPP data.

Registration of the falls during hospitalization is not yet computer-coded at ICSM, but is done by main-taining a manually recorded information registry. In this registry, every time that a patient falls during hospitalization is registered together with the date of the fall. Some patients in the registry have fallen more than once.

In comparison to other adverse outcomes, the information related to the occurrence of falls is not as extensive and the data is more unbalanced. Therefore, descriptive analysis is preferred over predictive analysis with mixed effect models.

If a patient fell during hospitalization, the number of days since the start of the hospitalization was calculated. Alongside, it was calculated if a fall has taken place during or after the administration of a potentially not compliant drug.

To be able to use different kind of techniques without random effects, the data had to be simplified. This was done by making a subsample of the data where all the rows represent one treatment. This has been done by using the subset function in R [84].

To identify variables with a significant impact on the outcome, backwards selection with a logistic model was performed. This was done to both identify important variables as well as estimating their importance. In backwards variable selection, the chi-square, p-value or AIC is calculated for every variable in the model. The variable with the worst value was discarded from the model. This will be repeated until the best performing model is created [85, 86].

On top of this, classification trees and random forest were applied to identify informative interactions between variables. The reason to use these techniques is because there is no specific knowledge about possible interactions in the data and both these methods provide the possibility to identify interactions without making assumptions. Another reason is that the results of these techniques are more intuitive regarding finding interactions in comparison to regression models.

The packages used in this part are: rpart [87], random forest [88], OOBcurve [89], rattle [90], icd [47], dyplr [38] and plyr [40].

2.5.1 Classification trees

The classification trees were fitted using the same variables as for logistic regression.

Classification trees originate from the field of statistics and decision theory where they are used to indicate if a decision is optimal or not [91, 92]. This analysis has found its application in almost every field were decisions have to be made and they have been proven to be very useful in machine learning [91, 92]. Classification trees consist of nodes and branches, where a node represents a choice and a branch represents an outcome [93].

When creating classification trees for machine learning, there is a training set needed with classified examples [94]. The approach is to increasingly split the training set in subsets by using the variable that divides the classes in the best way [91, 94]. The nodes are represented by the variables that split the set and use a single variable each time [91]. The created tree can be used to predict on the test set. An illustrative example of a classification tree can be found in Figure 2.4.

2.5.2 Random forest

The same variables used for the logistic regression were used to create a random forest. To obtain the best performing random forest, different tuning parameters were compared. The best performing parameter settings were used to train the final model.

Random forest is a non-parametric method that grows a combination of classification or regression trees and lets them vote for the most popular class [95]. Training the random forest happens in two ways. First, the set is usually sampled by using bootstrap, keeping about 66% of the original data, creating a set with the size of the original data set. The test set is created by using the remaining 33% of the data and is

(30)

Root

Stroke Age < 80

Appr Not appr

Figure 2.4: Example of a classification tree

referred to as the Out-Of-Bag (OOB) observations [96, 97]. For the second way of training, to decide on the best variable split for each node in a random forest, a random chosen subset with the size “mtry” (this is a parameter) is used [96]. This subset is a lot smaller in comparison to the original set and only these variables are used for splitting the set [97].

When testing the random forest, the OOB observations are used. The error between the OOB observa-tions and predicted observaobserva-tions generates the OOBerror (Out-Of-Bag error). It is common to use this error to estimate the predictive performance [97].

(31)

3

Results from the analysis of the STOPP criteria

3.1 Extraction of the data

The data was extracted from the hospital information system with the Pentaho Data Integration software [98]. Because ICSM in Pavia has started to register hospitalizations into the Health Information System since 2012, the hospitalizations considered were from January 2012 to the 7th of June 2018. The data collected originates from the all the units in ICSM.

The data extraction strategy consisted of: 1) extracting all the data with the ATC codes mentioned in the STOPP criteria and 2) collecting all the hospitalizations connected to this data by the hospitalization code. In order to make it possible to analyze the STOPP rules that involve chronic diseases, the patient identifi-cation code was used to extract all the hospitalizations of the patients present in the hospital information system.

The STOPP criteria only mention medication classes instead of mentioning specific medication in order to leave room for interpretation. When extracting the data, in order not to miss useful information, the hospital information system was queried twice: first by using all the ATC codes regarding medication classes and afterwards by using specific ATC medication codes. Because the hospitalization information and medication information were stored separately, this resulted in four data sets: a “complete” diagnosis and therapy set considering the medication code regarding specific medication codes and a “class” diagnosis and therapy set regarding the medication classes.

The diagnosis sets consider the hospitalization information and all the diagnoses found of a patient. In these sets, each row is a diagnosis. The therapy sets regard all the drug administrations carried out to a patient. In these sets, each row is a drug administration. The detailed information considering these sets can be found in Table 3.1.

Complete therapy Complete diagnosis Class ther-apy Class diagnosis Rows 543074 324113 2869808 444659 Hospitalizations 13696 27558 21744 39252 Patients 10060 9980 15099 14972

Table 3.1: Characteristics of the extracted STOPP data sets. In this table, the complete sets are the sets resulting from querying the database for all the medication codes. The class sets are the sets resulting from querying the database for

(32)

The complete therapy set contained 543074 rows and the complete diagnosis set contained 324113 rows. The class therapy set contained 2869808 rows and the class diagnosis set contained 444659 rows.

3.2 Preprocessing of the clinical databases

The preprocessing steps are described in Figure 3.1 where the numbers indicate the number of rows in each set. After extraction, the first step was to merge the four data sets into two data sets (one therapy set and one diagnosis set) as well as discarding duplicates between the incomplete and complete sets. The second step was to delete the hospitalizations without a discharge date and the diagnosis description. The reason to do so was because one hospitalization could have multiple different diagnosis descriptions, and therefore possibly multiple rows with the same information. This consequently made it necessary to delete duplicates. The third step was to subset the therapy set, only keeping patients over 65 years. The fourth step was to discard the therapies that took place before or after the hospitalization. The last step was to select the patients for which both medication and hospitalization information was present.

The characteristics of the cleaned STOPP databases are described in Table 3.2. The cleaned clinical data sets were used to study the compliance to the STOPP criteria consider 25838 hospitalizations of 9558 patients. The mean patient age at hospitalization was 75.6 year (SD = 6.7), ranging from 65 to 102 years. Of this group, 46 % were female. The hospitalizations and administrations considered are from 10-1-2012 to 18-12-2018 and the total number of drug treatments studied was 2020644.

STOPP database Number of hospitalizations 25838 Number of patients 9558 Mean age (SD) 75.6 (6.7) Age range 65 - 102 % females 46 %

Number of drug administrations 2020644

(33)

(34)

3.3 STOPP criteria detection method

The list of STOPP criteria is reported in Appendix A. The developed code in R was used to detect each time a STOPP rule was applicable based on the ATC and ICD-9 codes.

For every STOPP rule, a diagnosis set and therapy set were created with all the patient information with the ICD-9 code mentioned in the rule and connected therapies. When a STOPP rule was considered a “chronic rule”, all the previous hospitalizations of the patient were collected. However if a STOPP rule was considered a “non-chronic rule”, only the hospitalization was studied. After this, both sets were analyzed for the presence of the therapy codes mentioned in the STOPP rule. If the STOPP rule required to check for other conditions next to the presence of medication (such as time conditions or medication dose), this was also performed. Every time a rule was applicable, the corresponding administrations received one of the following labels: “compliant” when the ICD-9 code was not present, “potentially uncompliant” when the ICD-9 code was present or “uncertain” when the compliance of the STOPP rule was uncertain.

The “uncertain” label is only actively used for a few STOPP rules if the compliance was not straightfor-ward enough to astraightfor-ward the “potentially uncompliant” label. When STOPP rules require to check the dose and time of a rule, these cannot be default complaint because of the possibility that information needed is not stored in this database, but stored elsewhere. The same applies for the STOPP rules that require to check for a complete association with another type of medication. The medication might not be registered in this database, however this does not mean that the patient is not taking it. Therefore, these rules require manual checking of for compliance.

It was possible to evaluate 55 out of the 65 STOPP criteria. The 10 not evaluated rules did not have a clear definition or required information that was missing. Appendix A provides an overview of the STOPP rules that were not implemented and the reason why.

3.3.1 STOPP criteria evaluation

In total, the STOPP rules that were analyzed for this set of hospitalizations was 144163. Every rule is a treatment. Multiple STOPP rules were applicable within the same hospitalization. Of these, 82.2% was found to be compliant, 17.1% was found to be potentially uncompliant and 0.7% was found to be uncertain. 3.3.1.1 STOPP compliance per rule

In Figure 3.2, the number of compliant and potentially uncompliant rules are displayed per category, where the x-axis indicates the rule index, rules that have not been evaluated are set to zero.

(35)

Figure 3.2: Overview of all the STOPP rules. For all the graphs, the x-axis indicates the index of the rules and the y-axis indicates the frequency.

(36)

As shown in Figure 3.2, the number of times a specific STOPP rule was applicable was extremely variable. Rule number 1 of the Associate drug classes, is the most frequently occurring STOPP rule.

The STOPP rules where the percentage of non-compliance was found to be the majority of the cases are:

• Rule 2 of the Muscular skeletal system (58% of the cases) • Rule 13 of the Central nervous system (60% of the cases) • Rule 1 of the Associated drug class (61% of the cases) • Rule 1 of the Endocrine system (63% of the cases)

• Rule 1, 2 and 3 of Rules that affect falling (100% non-compliant)

In all other STOPP rules, the majority of the cases were found to be compliant. The rules that are found to be 100% compliant were:

• Rule 2 of the Cardiovascular system • Rules 2, 3, 4 of the Central nervous system • Rule 5 of the Gastrointestinal system • Rules 3 and 6 of the Urogenital system • Rule 3 of the Endocrine system

The detailed results of compliance per rule can be found in Appendix B. 3.3.1.2 STOPP compliance per category

Table 3.3 provides an overview of the total number of compliant and potentially uncompliant STOPP rules per category.

The most prevalent category is the Cardiovascular system, being the category with the highest number of evaluated STOPP rules (13), followed by Central nervous system (12).

As seen in Table 3.3 the category with the highest percentage of non-compliance is Association classes (61%). followed by Medication that affects falling (53%). In all the other categories, the percentage of non-compliance was found to be lower than 50%, where the categories regarding Analgesic drugs, Urogenital systemand Gastro-intestinal system are found to be the most compliant (99%).

3.3.1.3 STOPP compliance per year

Figure 3.3 shows the compliance per year. The frequency of and compliance to the STOPP rules are of similar proportions every year, except for 2012 and 2018. The reason for the low frequency in 2018 is because the data was extracted in June 2018 and therefore, the year was not yet finished.

3.4 Analysis of clinical outcomes of interest

The goal of analyzing the clinical outcomes was to search for possible associations between non-compliance to the rules and adverse clinical outcomes.

The clinical outcomes selected for evaluation are: the hospitalization duration, hospitalization duration exceeding the expected length-of-stay and the occurrence of falls during hospitalization. The characteristic of each variable are as follows:

(37)

Category (Rules evalu-ated) Complaint rules (%) potentially uncompli-ant rules (%) Uncertain (%) Total Cardiovascular system (13) 43342 (91.52%) 3556 (7.51%) 458 (0.97%) 47356 Central nervous system (12) 17841

(93.29%) 1067 (5.60%) 205 (1.11%) 19113 Gastro-intestinal system (5) 11850 (99.68%) 28 (0.24%) 10 (0.08%) 11888 Musculoskeletal system (8) 20242 (90.49%) 1821 (8.77%) 163 (0.74%) 22226 Respiratory system (3) 3466 (98.81%) 898 (21.64%) 62 (1.56%) 4426 Urogenital system (5) 2902 (98.81%) 31 (1.19%) 0 (0.00%) 2933 Endocrine system (2) 30 (45.76%) 29 (53.06%) 0 (0.00%) 59 Drugs that adversely affect

falling (4) 7007 (46.94%) 7921 (53.06%) 0 (0.00%) 14928 Analgesics (2) 6268 (99.51%) 0 (0.03%) 24 (0.46%) 6292 Association classes (1) 5803 (38.84%) 9139 (61.16%) 0 (0.00%) 14942 Total 118530 (82.22%) 24691 (17.13%) 942 (0.65%) 144163

Table 3.3: The number of compliant and potentially uncompliant STOPP rules per category. Category represents the categories of the STOPP rules and rules evaluated represents the number of rules evaluated per category, the other

columns respectively display the number and percentage of rules per label

Figure 3.3: STOPP compliance per year

• Hospitalization duration exceeding the expected length-of-stay is represented by a binary variable which indicates if the length-of-stay is longer than was expected based on the length-of-stay indicated by the DRG-code

• Occurrence of falls is represented by a binary variable which indicates if someone fell during hospi-talization