Pattern Detection in Adverse Events reported in Patient Care in a Hospital

(1)

M

ASTER

T

HESIS

Pattern Detection in Adverse Events

reported in Patient Care in a

Hospital

Author:

Mart B

USGER OP

V

OLLENBROEK

First supervisor:

Dr. Gosse B

OUMA

Second supervisor:

Dr. Martijn W

IELING

A thesis submitted in fulfillment of the requirements

for the degree of Master of Arts

in the

Information Science Research Group

University of Groningen

(2)

(3)

iii

Declaration of Authorship

I, Mart BUSGER OP VOLLENBROEK, declare that this thesis titled, “Pattern De-tection in Adverse Events reported in Patient Care in a Hospital” and the work presented in it are my own. I confirm that:

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have con-tributed myself.

(4)

(5)

v

Abstract

IntroductionIn this thesis, exploratory research has been done to explore the possibilities of (semi-)automated pattern detection in adverse event data in a hospital. The following research question stands central in this thesis: Using ma-chine learning techniques, which patterns can be discovered in patients’ adverse event reports? The adverse event data has been collected starting in 2007 and pertains to day-to-day patient care. To detect patterns, three different machine learn-ing techniques have been used: association rule minlearn-ing, text clusterlearn-ing and text classification.

Method The first experiment has been done by generating association rules from frequent itemsets based on the textual descriptions from the adverse events. The second experiment was done using text clustering. Adverse events were clustered based on the tf-idf scores from their descriptions. This led to some interesting clusters, but mostly for clusters with a sufficient amount of available data. One of the bottlenecks was the shortage of data for some of the adverse event categories. The third and final experiment has been conducted using text classification. The to be predicted label was a binary class stating whether the CIM had been consulted or not. The idea behind this is that other adverse events could also be labeled as being needed to be consulted by CIM, based on the sim-ilarity to actual adverse events that needed the CIM’s consultation. This means that a high recall is preferred. One of the difficulties that arose during this ex-periment was the skewed distribution of the data. Only 93 of the approximately 8800 adverse events that had a label, had the label that had to be predicted.

ResultsAssociation rule mining delivered relatively abstract results with little informative value. Text clustering shows potential with several clusters clearly showing similar concepts within adverse event categories. Text classification results could only be gathered using a more balanced set. Two adverse events were found that could be relevant to the CCP in the UCMG. Accuracy scores around 75% were achieved, but more interesting were the recall scores. Because a higher recall means the system can correctly identify severe adverse events, it can use that knowledge to gain new insights in past adverse events. Recall scores around 72% were achieved.

(6)

(7)

vii

Nederlandse samenvatting

Inleiding Deze scriptie is geschreven als afsluiting van de Master Informa-tion Science, een track van de opleiding Communicatie & Informatieweten-schappen. Het onderwerp van de scriptie is ontstaan door de vraag binnen het UMCG om onderzoek te doen naar het herkennen van patronen in incident-meldingen in de patiëntenzorg. Deze vraag is ontstaan door een onderzoek van het U.S. Institute of Medicine, getiteld “To err is human”, waarin gespro-ken wordt over het verwergespro-ken en eventueel voorkomen van incidenten in een ziekenhuis.

Een van de aanbevelingen was het opzetten van een systeem dat incident-meldingen verzamelt en leert van incidenten die in het verleden plaatsgevon-den hebben. Binnen het UMCG worplaatsgevon-den inciplaatsgevon-denten verwerkt in iTask. Een melder van een incident krijgt een online vragenlijst voorgelegd met relevante vragen die bij het incident horen dat gemeld wordt. Dit houdt in dat een melder van een medicatie incident geen vragen hoeft te beantwoorden die betrekking hebben op een val incident.

De gemelde incidenten worden na inspectie en na het verwijderen van pri-vacy gevoelige informatie opgeslagen in een database. Daarna kunnen de in-cidenten door middel van iTask geraadpleegd worden door de medewerkers van de Decentrale Incidentmeldingen Commissies (DIM) en de Centrale Inci-dentmeldingen Commissie (CIM), en de melder van een incident. Omdat dit een groot aantal incidentmeldingen betreft, is het moeilijk voor een persoon om door alle incidentmeldingen te gaan en patronen te ontdekken. Het UMCG erkent daarom de noodzaak om mogelijkheden te verkennen betreffende het (semi-)automatisch herkennen van patronen in incidentmeldingen. In dit on-derzoek is een patroon is als volgt gedefinieerd: “Een patroon is een verza-meling van incidenten die een indicatie zijn van een situatie met een verhoogd risico. De verzameling incidenten moet onder dezelfde categorie vallen en er moet redelijkerwijs aangenomen kunnen worden dat de incidenten schadelijk kunnen zijn voor volgende medische ingrepen.”

Tijdens het onderzoek wordt geprobeerd antwoord te geven op de volgende onderzoeksvraag: “Welke patronen kunnen door middel van machine learning technieken ontdekt worden in incidentmeldingen in de patiëntenzorg?”

(8)

viii

opgeslagen zijn, maar er zijn drie uitzonderingen: data met betrekking tot de tijd van een incident, data met betrekking tot de datum van een incident en de omschrijving van een incident. De tijd is opgeslagen als een string1_{, met}

daarin het het uur waarop het incident plaatsgevonden heeft (‘10.00 - 11.00’). Deze string wordt omgezet naar een string die aangeeft of een incident plaats-gevonden heeft tijdens de ochtend (6.00 t/m 11.59), middag (12.00 t/m 17.59), avond (18.00 t/m 23.59) of nacht (0.00 t/m 5.59). De datum is opgeslagen in een Excel-datum formaat (numerieke waarde) en wordt omgezet naar een string met dezelfde waarde (‘11-2-2012’). De omschrijving van een incident is een geschreven tekst door de melder van een incident. Deze geschreven teksten worden op drie manieren verwerkt: tokenization, woord vervanging & het ver-wijderen van leestekens, en stemming. Het ‘tokenizen’ van een document houdt in dat het document opgesplitst wordt in losse woorden. Dit gebeurt eerst op zinsniveau om leestekens te identificeren en vervolgens op woordniveau. Daarna worden de leestekens verwijderd en de medicatietermen vervangen door zogenaamde placeholders om de variatie in de data te beperken. Een voor-beeld is het vervangen van ‘propranolol’ door ‘MED_Circ’. Ten slotte worden de omschrijvingen ‘gestemd’. Dit houdt in dat een woord terug gebracht wordt naar de stam. Een voorbeeld is het vervangen van ‘krijgen’ en ‘krijgt’ door de stam ‘krijg’. Hiermee wordt de variatie nog verder beperkt.

De eerste machine learning techniek die gebruikt wordt, is association rule mining. Een association rule mining algoritme zoekt naar regels tussen variabe-len in een dataset. In dit onderzoek wordt het FP-growth algoritme gebruikt. In het geval van een incidentrapport zijn er verschillende variabelen beschik-baar. Voorbeelden hiervan zijn de omschrijving van het incident, waar de losse woorden de variabelen zijn, of de datum en de tijd van een incident. Een asso-ciation rule heeft de vorm X ⇒ Y , wat betekent dat de aanwezigheid van vari-able X leidt tot de aanwezigheid van variabele Y . X (ook wel de ‘antecedent’ genoemd) en Y (ook wel de ‘consequent’ genoemd) kunnen in dit geval losse woorden zijn, maar ook categorische variabelen, zoals een een risicoinschatting of een inschatting van de gevolgen voor de patiënt. Een voorbeeld van een regel die geëxtraheerd is uit de incidentmeldingen zou als volgt kunnen zijn:

“Tijd: ochtend ⇒ Betrokken medewerkers: Verpleegkundige”

Uit deze regel zou kunnen worden opgemaakt dat incidenten die in de ocht-end plaatsvinden, de betrokken medewerker een verpleegkundige is. De ho-eveelheid incidenten waarvoor deze regel geldt, wordt de support van een regel genoemd. Als bijvoorbeeld in 100 van de 200 incidenten die in de ochtend

(9)

ix plaatsgevonden hebben een verpleegkundige betrokken is, is de support van een regel 100. Een tweede meetinstrument om de betrouwbaarheid van een regel te meten is de confidence van een regel. Dit is een percentage van de ho-eveelheid incidenten waarin X en Y voorkomen en de hoho-eveelheid incidenten waarin X voorkomt. Het zoeken naar regels in de incidentmeldingen is op twee manieren gedaan: op basis van de omschrijvingen, waar de losse woorden de variabelen zijn, en op basis van de metadata van de incidentmeldingen.

De tweede machine learning techniek die in het project gebruikt wordt, is text clustering. Text clustering is het opdelen van documenten in groepen (clus-ters) op basis van gelijkenissen tussen de documenten. Het doel van deze tech-niek is het vinden van relevante clusters met soortgelijke incidenten. Om vol-gens de definitie van een patroon te werk te gaan, wordt text clustering uitgevo-erd per incident categorie.

De gelijkenissen van incidenten worden bepaald op basis van omschrijvin-gen van de documenten. Er wordt gekeken naar de zoomschrijvin-genoemde term fre-quency - document inverse frefre-quency (tf-idf) scores van woorden. Deze waar-den worwaar-den bepaald door te kijken naar de hoeveelheid waarnemingen van een woord en deze hoeveelheid af te zetten tegen het aantal documenten waarin het woord niet voorkomt. Op deze manier krijgen woorden die veel voorkomen in veel documenten een lagere tf-idf score dan woorden die veel voorkomen in een aantal documenten. De achterliggende gedachte hiervan is dat de woorden met een hoge tf-idf score een beter verklarend vermogen hebben dan woorden met een lage tf-idf score. Voor ieder incident wordt vervolgens een lange feature vector gemaakt met de tf-idf scores voor alle woorden in de database. Daarnaast worden ook de tf-idf scores voor alle combinaties van twee woorden (bigram-men) en alle combinaties van drie woorden (trigram(bigram-men) toegevoegd. Hier-door zijn alle vectoren even lang en bevatten ze veel nullen voor alle woorden die niet in de betreffende incidentmelding voorkomen. Er zijn twee verschil-lende algoritmen gebruikt voor text clustering: K-Means en Ward’s clustering algoritme. K-Means is een ‘flat’ clustering algoritme. Dit houdt in dat K-Means clusters maakt zonder enige expliciete structuur en dus veel verschillende vor-men aan kan nevor-men. Het tegenovergestelde van flat clusteren is hiërarchisch clusteren. Ward’s clustering algoritme is een hiërarchisch clustering algoritme dat meer gestructureerde output heeft zoals een dendrogram.

(10)

x

en leert de labels die bij de documenten horen. Vervolgens probeert het algo-ritme nieuwe documenten (de ‘test-set’) te labelen op basis van de kennis die het opgedaan heeft uit de train-set. Het is hiervoor van belang dat de verdel-ing van labels ongeveer gelijk is, zodat de classifier voldoende data heeft om van te leren. De verdeling van het label dat in dit onderzoek geprobeerd wordt te voorspellen is helaas niet gelijk. Het gaat in dit geval om een binair label (ja / nee) dat aangeeft of de CIM geconsulteerd is of niet. De dataset bestaat uit ongeveer 8800 incidentmeldingen waarvan er 93 het label ‘ja’ hebben. Hi-erdoor is het noodzakelijk om niet met volledige dataset te werken, maar met een gebalanceerde set. De gebalanceerde set wordt gemaakt door alle ‘ja’ docu-menten te nemen en een willekeurig gelijk aantal ‘nee’ docudocu-menten toe te voe-gen aan deze ‘ja’ documenten. De documenten worden vervolvoe-gens op een hoop gegooid en daarvan wordt telkens een deel gebruikt als train-data en een deel als test-data. Er worden twee classificatie algoritmen gebruikt in dit onderzoek: Naive Bayes en een linear Support Vector Machine (SVM). Het classificeren gebeurt op basis van dezelfde gelijkenissen (tf-idf features) met daaraan toegevoegd de metadata van de incidentmeldingen.

ResultatenDe resultaten van association rule mining zijn veelal te abstract om duidelijke patronen te ontdekken. Er worden associaties tussen verschillende woorden ontdekt, maar daar kan geen concrete informatie uitgehaald worden. Mede vanwege het gebrek aan concrete resultaten is er geprobeerd om regels te ontdekken met bigrammen en trigrammen, maar ook deze regels bleken te abstract.

De resultaten van text clustering zijn gebaseerd op twee cluster algoritmen. Beide algoritmen geven verschillende clusters en sommige clusters zijn inter-essanter en concreter dan anderen. Door middel van text clustering zijn er in sommige gevallen interessante groepen van incidentmeldingen ontdekt binnen de verschillende categorieën.

Het classificeren van de incidentmeldingen heeft twee incidentmeldingen opgeleverd die potentieel interessant zijn voor de Commissie Calamiteiten Patiën-tenzorg (CCP). Daarnaast werden accuracy scores van rond de 75% behaald bij het gebruik van een gebalanceerde dataset. Bij het gebruik van een schevere verdeling (33% - 67%) worden door de linear SVM accuracy scores van 80% behaald.

(11)

(12)

(13)

xiii

Acknowledgements

I am grateful to all who helped me during my thesis. To both my supervisors from the university of Groningen, Gosse Bouma and Martijn Wieling, whose guidance, support and insight I could not have done without. Especially in the final stretch they helped me to finish my thesis on time and with that they helped me graduate within a year after starting my Master’s. I would also like to sincerely thank Stella Meijer and Ko Jans, my supervisors in the UMCG and Jan Pols, my contact between the the university and the UMCG. Without their continuous support and optimism I might have lost my optimism that I would succeed.

Lastly I would like to thank everyone who supported me in any way during my thesis.

(14)

(15)

xv

List of Figures

2.1 KDD framework . . . 7

3.1 Screenshot Search tool . . . 17

3.2 Feature vector example . . . 26

3.3 Scatter plot medication adverse events . . . 27

3.4 Regular K-Means clustering . . . 28

3.5 Top-down clustering . . . 28

3.6 Dendrogram medication adverse events . . . 30

3.7 Vector space model . . . 34

4.1 Scatterplot medication . . . 41

4.2 Dendrogram medication . . . 42

4.3 Scatter plot other . . . 43

4.4 Dendrogram other . . . 43

4.5 Scatter plot implementation of care . . . 44

4.6 Dendrogram implementation of care . . . 44

4.7 Scatter plot coordination of care process . . . 45

4.8 Dendrogram coordination of care process . . . 45

4.9 Scatter plot medical resources / ICT . . . 47

4.10 Dendrogram medical resources / ICT . . . 47

4.11 Scatter plot falling . . . 48

4.12 Dendrogram falling . . . 48

4.13 Scatter plot radiotherapy . . . 49

4.14 Dendrogram radiotherapy . . . 49

4.15 Scatter plot unknown . . . 50

4.16 Dendrogram unknown . . . 50

4.17 Scatter plot blood products . . . 51

4.18 Dendrogram blood products . . . 51

4.19 Scatter plot psychiatry aggression . . . 52

4.20 Dendrogram psychiatry aggression . . . 52

4.21 Scatter plot all categories . . . 53

4.22 Dendrogram all categories . . . 53

A.1 Full reporting process flowchart (1) . . . 69

(18)

(19)

xix

List of Tables

1.1 Count of reports since 2007 . . . 3

3.1 Dummy example data . . . 13

3.2 Number of adverse event reports by category . . . 14

3.3 Bigrams & trigrams from example document . . . 19

3.4 Distribution of n-grams in adverse event categories . . . 20

3.5 Example itemsets . . . 22

3.6 Example frequent itemsets . . . 22

3.7 Example association rules . . . 23

3.8 Distribution of classes . . . 31

4.1 Medication association rules . . . 36

4.2 Other association rules . . . 36

4.3 Implementation of care association rules . . . 36

4.4 Falling association rules . . . 36

4.5 Psychiatry aggression association rules . . . 37

4.6 Medication association rules (bigrams) . . . 37

4.7 Medication association rules (trigrams) . . . 37

4.8 Cross-category association rules . . . 37

4.9 Metadata based association rule example . . . 38

4.10 Naive Bayes classification results using balanced set . . . 55

4.11 Naive Bayes confusion matrix using balanced set . . . 55

4.12 Naive Bayes classification results using skewed set . . . 56

4.13 Naive Bayes confusion matrix using skewed set . . . 56

4.14 Most important features using Naive Bayes . . . 56

4.15 Linear SVM classification results using balanced set . . . 57

4.16 Linear SVM confusion matrix using balanced set . . . 57

4.17 Linear SVM classification results using skewed set . . . 57

4.18 Linear SVM confusion matrix using skewed set . . . 57

4.19 Most important features using Linear SVM . . . 58

A.1 List of stop words . . . 67

(20)

(21)

1

Chapter 1 Introduction

May 30th, 2016. RTL Nieuws reports that most of the hospitals in the Netherlands fail to report medical errors that have fatal consequences for patients to the health in-spection. RTL Nieuws mentions that a negative culture exists with respect to reporting medical errors in hospitals. Medical specialists are afraid that they will be blamed, that they could lose their job or have lower chances at getting another job.

While this news article had to be corrected because RTL Nieuws compared two different groups of adverse events from different time periods with different definitions of adverse events, it does re-emphasize society’s preoccupation with safety in healthcare. This brings us back to the importance of reporting adverse events in day-to-day patient health care. If more employees in health care re-port errors, certain patterns may arise that might unearth underlying problems which in turn can lead to a decrease in adverse events in hospitals. While it can not be said with certainty that hospitals are actually failing to report many of the adverse events, it does draw attention to the importance of reporting errors in health care. Moreover, collecting information about adverse events in reports and analyzing them is a time-consuming activity. Therefore, exploring options to automate this process is a logical next step. To improve the analysis of ad-verse events, the University Medical Center Groningen (UMCG) has decided to explore the options of using (semi-)automated techniques to process adverse event reports and possibly find otherwise undetected patterns in them.

1.1 Hospital situation

1.1.1 UMCG

(22)

2 Chapter 1. Introduction adverse events in health care and gave recommendations about how these ad-verse events could possibly be prevented. One of those recommendations was setting up a system that collects reports from adverse events and learn from these medical errors by building safety precautions and improving processes. This research marked the start for many hospitals, including the UMCG, to im-prove safety awareness and start collecting data from adverse events. Within the UMCG, adverse events are classified into four groups1_:

• “Incident” An unintentional event that occurred during patient care and that brought or could have brought damage to a patient.

• “Near-incident”: An incident as defined above, but that did not reach the patient because it was corrected in time.

• “Complication”: An unintentional and unwanted outcome during or re-sulting from a medical intervention. This outcome is followed by a nec-essary change of procedure or permanent damage to a patient. As a rule, specific complications inevitably occur in a certain percentage of the pro-cedures and the frequency should be monitored to keep the frequency as low as possible and acceptable.

• “Calamity”: Every unintentional event that influences the quality of care and has resulted in severe damage to, or death of, a patient. The severity of the consequences distinguishes calamities from other adverse events. Complications and calamities are collected in separate systems in the UMCG. In this project only the data pertaining to “incidents” and “near-incidents” will be used. For clarity purposes, in this project these two groups together will be called adverse events. For every adverse event, a report is written with the de-tails of such an event. To manage these data, a digital reporting system called iTask is in place. The details vary per event and for instance contain data about the time and date of an adverse event, the department where the event occurred, a written account of what took place and a risk assessment for re-occurrence and its consequences.

1.1.2 CIM & DIM

The Centralized Incident Reporting Commission (CIM) operates on a hospital-wide basis and is primarily related with the Decentralized Incident Reporting

(23)

1.1. Hospital situation 3 Commissions (DIM). The CIM advises DIMs when needed. Besides that, the CIM reviews severe adverse events, searches for patterns in all adverse events and is accountable to the board of directors. Safely reporting an adverse event is made possible through a just culture in the UCMG.

By decentralizing the process of managing and analyzing the large number of reports, analyses are made close to where adverse events occurred and usu-ally also where improvements should be considered. Together with the CIM they stimulate hospital employees to report adverse events so they can be eval-uated and learned from, which could improve health care overall. By keeping the reporting and analyzing process close to the reporter, the threshold of re-porting becomes lower. Looking at the numbers in table 1.1, it seems that this stimulating approach is working, because since the start of collecting the data on adverse events in 2007 the number of report has risen almost every year, in contrast to the total number of patient days in the hospital. Given the fact that the majority of the reported adverse events happens to patients while be-ing hospitalized, the number of patient days is chosen to create a context for the number of adverse events. There is no reason to assume that the total number of adverse events (reported and unreported) has risen.2

TABLE1.1: Count of reports since 2007

Year Number of reports Number of patient days Reports / day ratio 2007 853 2008 1.042 2009 3.385 2010 3.452 321.245 0,011 2011 4.249 321.478 0,013 2012 4.914 306.045 0,016 2013 5.743 305.632 0,019 2014 6.112 307.420 0,020 2015 5.566 303.356 0,018

1.1.3 Reporting process

Several different steps can be distinguished in the process of reporting and set-tling an adverse event. First, the reporter completes a set of standard questions

(24)

4 Chapter 1. Introduction in the electronic system iTask3_{. This information varies per adverse event and}

the reporter only has to report relevant information. The questionnaire used for this is an intelligent form presenting only relevant follow-up questions based on the given answers. An example of this is that a reporter of a medication ad-verse event does not have to answer questions pertaining an adad-verse event that occurred due to a patient falling.

After the questions have been answered, the report is sent to the “preferred coordinator”, i.e. the secretary of the DIM, of the department where the ad-verse event occurred. The report will be checked for completeness and privacy-sensitive information in free text fields (which will be removed if found). The next step is to check if the adverse event has resulted in a possible “calamity” or “complication”. If so, the reporter is warned that he might have to inform the head of the department, who in turn, should inform the medical director. Fol-lowing that, an assessment is made to see if more information is needed. This information can be provided by the reporter, by other people involved in the adverse event, by people involved in the routine of the involved department, or for instance the CIM when looking for the frequency of occurrences across the hospital. The final steps are to make an analysis of the adverse event if needed and advice the relevant parties on how to proceed. The report is then closed and anonymously held in archive. The privacy sensitive information will be kept for five years, after that only the anonymized reports will be kept. An adverse event should always be discussed with the patient and filed in his dossier, though this seems difficult to maintain especially in case of “near-incidents”. A full repre-sentation of the reporting process can be found in the form of a flowchart in appendix A.

1.2 Problem statement

The current situation shows that several steps are being taken to detect patterns of problems within the UMCG, with analyses being made and a CIM that has the best oversight over the adverse events. The CIM is dependent on a DIM for each department for drawing attention to potentially interesting patterns in ad-verse events. And therein lies the problem for detecting those patterns, because the different DIMs are not informed about the adverse events they separately re-port. It is a difficult task for a human to sift through all the available data (5500 adverse events in 2015) and detect potentially interesting patterns, especially for DIMs who do not have access to all available data. Therefore the UMCG

(25)

1.2. Problem statement 5 recognizes the need to explore the possibilities of (semi-)automated techniques to search for patterns in the available data pertaining to patients’ adverse event reports.

In the context of this research a pattern in the data is a collection of ad-verse events (i.e. more than one adad-verse event), that could indicate a situation with possible elevated risk for the hospital and more importantly: its patients. The adverse events should have a similar point of origin (i.e. medication event or falling event) and a similar overarching event that categorizes the adverse events together. There should also be an indication that the group of specific ad-verse events could be harmful to following procedures and are thus not isolated incidents. The aforementioned information leads to research which explores the possibilities of discovering patterns that would otherwise stay undetected. An example of such a pattern could be the reporting of the same faulty syringes in different departments. Because there is no existing (semi-)automated system at the UMCG that searches for patterns in patients’ adverse events data and there is no similar system in other hospitals in the Netherlands, there is no clear end-point for the research.

Three different machine learning techniques will be used to try and detect patterns in the adverse event data. The first machine learning technique is to use frequent itemsets to find association rules in the data. Two different ap-proaches will be used, one based on itemsets consisting of tokens from the event descriptions, the other one based on all other available data from the adverse events. Examples of those other data are time and date from an adverse event. These association rules follow the format of X ⇒ Y , which can be read as an occurrence of X that possibly leads to the occurrence of Y . A possible rule us-ing such a format could be that medication adverse events occur more often in the evening than in the morning, which translates to: Category: Medication ⇒ Time: evening.

The second machine learning technique is text clustering. Text clustering is an unsupervised learning technique, which means that the predicted class for the documents is unknown. Similarities between adverse events will then put them together in groups, called “clusters”.

(26)

6 Chapter 1. Introduction To get an insight in the data and give a structured approach to the problem, the following research question will be discussed:

Using machine learning techniques, which patterns can be discovered in patients’ ad-verse event reports?

From this research question, several subtopics can be derived. They are rep-resented by the following sub questions:

• What properties should a useful pattern have?

• Which machine learning techniques are suited for pattern recognition in incident reports?

• What are the most informative features for detecting patterns? • How can cross-departmental patterns be detected?

• How can new patterns be distinguished from known patterns?

• How can the system distinguish informative patterns from non-informative ones?

To get a better understanding of the problem and see some previous results, related work will be discussed in chapter 2. Each machine learning technique will be discussed separately.

Chapter 3 will focus on the resources employed in the process and explain the three different machine learning techniques used to try and detect patterns in the adverse event data. The evaluation metrics will also be explained.

Chapter 4 will present the results from the techniques used, using the eval-uation metrics explained in chapter 3 and give some examples of detected pat-terns. Chapter 4 will also give some interpretation of the the results and their usefulness will be discussed.

(27)

7

Chapter 2 Related Work

In this section, the related work for the project will be discussed. The section is divided into four parts: health care data literature that uses data mining and relevant work for each of the three machine learning techniques: association rule mining, text clustering and text classification. This chapter’s goal is to get a better understanding of machine learning with health care data and to sum-marize different approaches used in other research. This should also illustrate some of the difficulties of machine learning with health care data.

2.1 Data mining & health care data

A previously used approach to extract knowledge from databases is the Knowl-edge Discovery in Databases (KDD) framework (Marcelino et al., 2015). This framework divides the process to gain knowledge from data in five steps: se-lection, processing, transformation, data mining and interpretation / evaluation (see figure 2.1).

(28)

8 Chapter 2. Related Work In this project the same framework is used, aiming to achieve the best in-terpretation from the results of the machine learning techniques. Larose (2006) defines data mining as “the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repos-itories, using pattern recognition technologies as well as statistical and math-ematical techniques”, this means the task at hand is essentially a data mining task.

(29)

2.2. Association rule mining 9

2.2 Association rule mining

The concept of association rule mining has been briefly addressed in chapter 1 and will be explained more thoroughly in chapter 3 with an elaborate example. This section aims to list relevant work that has used association rule mining and use that information to devise the strategy explained in chapter 3.

Within the field of association rule mining, two similar algorithms are mostly used. Agrawal and Srikant (1994) introduced the first of the two: the apriori al-gorithm. This algorithm finds its basis in “market basket analyses”, which are being used to find association rules in buyer behavior in supermarkets. The apriori algorithm was fundamentally different from previously used associa-tion rule mining algorithms and significantly faster. The apriori algorithm first calculates all frequent itemset candidates with length 1 and saves them. Then the algorithm scans the entire database again for frequent itemsets with length 2 and repeats this process until no new frequent itemsets are found. The database is fully scanned in every iteration while saving frequent itemsets at the same time. This is a demanding task for computers, both memory-wise and compu-tationally.

(30)

10 Chapter 2. Related Work software development. Because they defined these 50 factors beforehand and there exist only 50 of them, getting the minimum support has been made rather easy. Ordonez, Ezquerra, and Santana (2006) discuss other metrics besides sup-port for evaluating association rules: confidence and lift. Confidence is used to measure reliability, while lift measures the dependence of a certain X on the presence or absence of a certain Y . To properly use those measures however, potentially interesting rules must first be found.

Srimani and Koti (2014) have used a rough set (RS) approach for optimal rule generation and point out that the quality of the data is of vital importance, because otherwise finding patterns in the data becomes an arduous task. The RS approach tries to discover redundancies and dependencies between given features of a problem to be classified. Association rule mining is not limited to market basket analyses and appliance to textual data, Olukunle and Ehikioya (2002) have shown to apply the method to medical image data. They use the FP approach and extract features from medical image data beforehand, maintain-ing a maximum of five features (items) per itemset. This makes the association rule finding process considerably easier, considering there exist only 31 unique frequent itemsets.

2.3 Text clustering

(31)

2.4. Text classification 11 methods are needed. To that end, De Kok, Plank, and Van Noord (2011) have used grafting, a technique initially proposed by Perkins, Lacker, and Theiler (2003). Grafting is used as a feature selection technique that iteratively adds promising features to a feature set A. This is done by selecting a feature from an unused feature set B and testing the new A by optimizing the weights in the feature set. Features are only added if their contribution is higher than their penalty. While the aforementioned approaches all use clustering and the tech-niques are generally applicable, they do not use text clustering. Lu et al. (2013) use text clustering to detect topics in messages from online health communities. They represent the text as features based on the frequencies of n-grams up to three and frequencies of medical domain specific words. One of the main con-clusions is that the use of those medical domain specific words increases scores for the chosen evaluation metrics significantly.

2.4 Text classification

Tsumoto and Hirano (2010) conducted research in medical risk management, specifically in risk aversion of adverse events related to nursing and infection control. They have split the problem into three distinctive parts: risk detection, risk clarification and risk utilization. During the risk detection part, they use a decision tree to determine the most important features for nursing adverse events. The emphasis lies on the fact that medical specialists should evaluate the data.

Krishnaiah, Narsimha, and Chandra (2013) researched the possibilities of predicting lung cancer for patients. They used a combination of text classifi-cation techniques and rule induction to develop a prototype system. The key point of their research is that machine learning techniques work better when combined with each other, instead of applying them separately. Koh and Tan (2011) outline several applications for data mining in health care and use a de-cision tree classifier as an example to predict whether patients were diabetic or non-diabetic. Seven predefined variables such as age and body mass index were used in the decision tree, which yielded an average accuracy score of 84,05%, precision of 78,4%, recall of 75,2% and f-score of 76,8%. These metrics will be elaborated on in chapter 3. They do however stress the need of clean health care data and while they did not use text mining, they do realize its potential and recommend further research in it.

(32)

(33)

13

Chapter 3 Method

This chapter will explain which tools were used, how the experiments were conducted, what settings were used and which evaluation metrics were used. As in the previous chapter, each of the three machine learning techniques will be explained in a separate section.

3.1 Data

The data for this research is being supplied by the UMCG through iTask. iTask is the provider that stores all information pertaining to an adverse event when it happens. To use the data in the system, an export to Excel is made in a for-mat where each adverse event is placed on a row and all relevant attributes are placed in the columns on that row. A simplified representation of dummy data can be found in table 3.1. This data entry will serve as an example in this chapter. Due to the confidentiality of the data, no real data can be shown.

TABLE3.1: Dummy example data

Meldings-nummer Datum inci-dent Gebeurtenis beschrijving Tijdstip inci-dent Aard gebeurte-nis Mogelijke gevolgen patiënt xxxxxx 01-11-2013 verwisseling patient-gegevens, waardoor ECG onder verkeerde patientnummer en patient kreeg propranolol ipv metoprolol. 11.00 -12.00

(34)

14 Chapter 3. Method Because a pattern in the data should have a similar point of origin1_{, the}

cate-gorical classification of the adverse events in a pattern should be the same. This division of adverse events is made using the predefined categories in which the adverse events were classified when they are reported. Within those categories should be sufficient data to distinguish patterns from each other. Some cate-gories are better represented in the data than others, this is illustrated in table 3.2.

TABLE3.2: Number of adverse event reports by category

Adverse event category English

Adverse event category Dutch

Number of reports Medication Medicatie 11616

Other Overig 6560

Implementation of care Uitvoering zorg 5320 Coordination of care process Coördinatie zorgproces 4507 Medical resources / ICT Medische middelen / ICT 2414

Falling Vallen 2259

Radiotherapy Radiotherapie 1230

Unknown Onbekend 531

Blood products Bloedproducten 239 Psychiatry aggression Agressie psychiatrie 205

For medication adverse events, an adequate amount of data is available, but the other categories are not represented as well. This skewed distribution could make it more difficult to detect patterns in those categories. Other possible com-plications are the “Other” and “Unknown” categories, because these categories could potentially contain documents that should have been classified as one of the other categories.

3.1.1 Preprocessing

The data in their current form are raw data, which need to be preprocessed be-fore they can be used. This is important because the reports are being compared by their similarities. Similarities are more easily found if the data are equally structured. Each document has different sentences and those sentences have different lengths, but all sentences consist of words. To create ‘clean’ documents

(35)

3.1. Data 15 from only words and for example without punctuation, three different prepro-cessing steps are taken: tokenizing, word replacement & punctuation removal and stemming.

Tokenizing

First, the document needs to be tokenized. Tokenizing a document is the task of chopping up the document into pieces. This is done at two levels: sentence and word (Manning, Raghavan, and Schütze, 2008). A document is first chopped into sentences and then chopped into words, which are called tokens. A token is therefore an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. To illustrate this, the description from the example is tokenized and stemmed. The result from the tokenization can be found as output 1.

Input 1:

“verwisseling patientgegevens, waardoor ECG onder verkeerde patientnum-mer en patient kreeg propranolol ipv metoprolol.”

Output 1:

verwisseling patientgegevens , waardoor ECG onder onder verkeerde patientnummer en patient kreeg propranolol ipv . metoprolol .

Word replacement & punctuation removal

(36)

16 Chapter 3. Method a language and occur frequently in many texts. Because these words often have little to no meaning for natural language processing tasks, they are filtered out. The full list of stop words can be found in table A.1 in appendix A. The input for the word replacement & punctuation removal is shown as input 2 and its output as output 2.

Input 2:

verwisseling patientgegevens , waardoor ECG onder verkeerde patientnummer en patient kreeg propranolol ipv . metoprolol .

Output 2:

“verwisseling patientgegevens waardoor ECG onder verkeerde patientnummer en patient kreeg MED_Circ ipv MED_Circ”

Stemming

Tokens can be written differently, but still have the same semantic meaning. To illustrate this, consider an example with the words ‘wachten’ and ‘wachtte’, which are both derivations from the noun ‘wachten’ and thus have the same semantic meaning. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. In this case, both words will be will be transformed to the base from ‘wacht’. The example document before stemming is shown as input 3 and the result from the stemming is shown as output 3.

Input 3:

“verwisseling patientgegevens waardoor ECG onder verkeerde patientnummer en patient kreeg MED_Circ ipv MED_Circ”

Output 3:

“verwissel patientgegeven waardor ecg onder verkeerd patientnummer kreg med_circ ipv med_circ”

The stemming of the documents is done using the Snowball stemmer for Dutch from NLTK. Not all stems are as readable as their original tokens, this is illustrated by ‘kreeg’ being substituted by ‘kreg’ instead of it being substituted by ‘krijg’, the base form of noun ‘krijgen’. ‘Kregen’, the plural past tense of ‘krijgen’ would also be substituted by ‘kreg’.

(37)

3.2. Resources 17 This leads to shorter vectors with hopefully less low informative words. This is further elaborated on in section 3.3.

Metadata

Some of the metadata from the adverse events are not structured properly, indi-cating that they also should be preprocessed before use. This is the case for two metadata types:

• Date is formatted as an Excel data value and should be transformed into string representing the date.

• Time is represented as a string with the hour at which the adverse event occurred (‘10.00 - 11.00’). This data entry is transformed into a string with the part of day on which the adverse event occurred, i.e. ‘ochtend’ for hours 6.00 up until 11.00, ‘middag’ for hours 12.00 up until 17.00, ‘avond’ for hours 18.00 up until 23.00 or ‘nacht’ for hours 0.00 up until 5.00.

3.2 Resources

Several tools were used during the research, the most prominent ones being a search tool developed for this project, the machine learning library Scikit Learn2_{, the Natural Language Toolkit (NLTK)}3 _{and the Orange Data Mining}

Toolkit4_{. All research has been done using Python version 3.5.1 from}

Ana-conda5_.

3.2.1 Search tool

FIGURE 3.1: Screenshot

Search tool

For the purposes of this project and by request of the UMCG, a search tool has been developed to make it easier to search for keywords in the descriptions from adverse event reports. Because the search function-ality in iTask only allows for a single string as input for the query, the possibility of using Boolean values was added in this new tool which allows for more

ad-vanced queries. An example query could be ‘vallen’ AND ‘uit bed’ NOT ‘toilet’,

2_{http://scikit-learn.org/stable/} 3_{http://www.nltk.org}

(38)

18 Chapter 3. Method if one would search for adverse events where a patient fell from their bed, but not on their way to or from the toilet. In figure 3.1, a screenshot from the search tool is shown. The code for this tool can be found in Appendix B.

3.2.2 Python libraries

The first Python library that has been used is NLTK. NLTK is a toolkit that can be used for a broad range of natural language processing tasks. In this project, it is used for three tasks: tokenization, stemming and stop word removal. For tokenizing the sent_tokenize and word_tokenize functions are used. The docu-ments are first tokenized in sentences to better extract punctuation and subse-quently tokenized in words. The stemming is done by making use of the Snow-ballStemmer for Dutch from the nltk.stem library. The standard stop words list for Dutch is extracted from the nltk.corpus library and is complemented with a few medical words like ‘patient’.

The second Python library that has been used is Scikit Learn. Scikit Learn (Buitinck et al., 2013) is the primary source of machine learning tools for this project. It contains many algorithms and preprocessing tools for machine learn-ing in python. In this project, Scikit Learn’s algorithms for text clusterlearn-ing and text classification are used, combined with tools to extract features from the data.

The third and final Python library used in this project is the Orange.associate extension library from Orange (Demšar et al., 2013) which has to be installed separately from the standard Anaconda installation, in contrast to NLTK and Scikit Learn.

3.3 Feature selection

3.3.1 Text based

(39)

3.3. Feature selection 19 Text-based features are based on the frequency of tokens (terms) in the text. The text is viewed as a set of terms, where each term in the document gets a score (weight) based on how often it appears in the document. A higher oc-currence leads to a higher score for the term. However, the assumption is that certain terms are more informative for a document than others. To emphasize these terms, term frequency - inverse document frequency (tf-idf) weighing is introduced. Tf-idf weighting assigns weights to all terms in the text based on their term frequecy (tf) and on their inverse document frequency (idf), where a higher tf-idf score is better. The tf score for a term is the number of occurrences it has in the text. In order to calculate the idf score for a term, the regular doc-ument frequency (df) has to be calculated first. The docdoc-ument frequency for a term t, dftis defined as the number of documents in which t has occurred. The

inverse document frequency for term t, idft, is then calculated as log_{df t}N where

N is the total number of documents in the dataset. The final tf-idf score for a term is subsequently calculated as tft x idft. Using these tf-idf scores instead

of raw frequencies diminishes the impact of potentially less informative terms.

TABLE3.3: Bigrams & trigrams from ex-ample document Bigrams Trigrams verwissel patientgegeven verwissel patientgegeven waardor patientgegeven waardor patientgegeven waardor ecg waardor ecg waardor ecg

verkeerd ecg verkeerd ecg verkeerd

patientnummer verkeerd patientnummer verkeerd patientnummer kreg patientnummer kreg patientnummer kreg med_circ kreg med_circ kreg med_circ

ipv med_circ ipv

For the term ‘med_circ’ from the exam-ple document, the tfmed_circ is 1517 and

the dfmed_circ is 919, which makes the

idfmed_circ 1.579. Combining the tf and

(40)

20 Chapter 3. Method trigrams from the example document. Because bigrams and trigrams are rela-tively more scarce in the texts, they could receive higher tf-idf scores based on their idf scores. Using bigrams and trigrams also provides some context to the words, as opposed to only using them singularly. Based on these tf-idf scores from unigrams, bigrams and trigrams, a classification or cluster algorithm can evaluate the vector corresponding to the document and place them in a cluster or category. Not all terms are equally informative, especially terms with very low tf-idf scores can provide a wrong representation of a document with respect to the information needs.

To ensure no low informative features will be used, only the 5000 most in-formative features will be used. That is, the 5000 features with the highest tf-idf scores. The 5000 most informative features can be a combination of all three or only unigrams, bigrams or trigrams. Table 3.4 shows the distribution of uni-grams, bigrams and trigrams for each of the categories. Unless there are less than 5000 features, the three numbers add up to the maximum of 5000 features. The final row in the table is the distribution for all data.

TABLE3.4: Distribution of n-grams in adverse event categories

Adverse event category Unigrams Bigrams Trigrams

Medication 2507 2307 186

Other 3445 1501 54

Implementation of care 3365 1575 60 Coordination of care process 3074 1780 146

Medical resources / ICT 3330 1620 50

Falling 1821 2502 677

Radiotherapy 1931 2611 458

Unknown 1533 1167 135

Blood products 773 368 46 Psychiatry aggression 1108 734 219 All categories combined 3332 1606 62

3.3.2 Metadata based

(41)

3.4. Association rule mining 21 adverse event. These metadata are added to the feature vectors by one-hot en-coding them. Consider the time of an adverse event which has been prepro-cessed into a string for morning, afternoon, evening or night. Because there are four different possibilities for the time of an adverse event, morning is repre-sented as [1, 0, 0, 0], and afternoon as [0, 1, 0, 0]. Every metadata feature with m possible values is transformed into m binary features where only one feature is active (Pedregosa et al., 2011).

3.4 Association rule mining

3.4.1 Concepts

Association rule mining is a method that tries to find relations between vari-ables in a database. An association rule can typically be expressed as X ⇒ Y , where X and Y are sets of items. X ⇒ Y can be explained as follows: Given a database of transactions D, where every transaction T ⊆ D is a set of items, if T includes X it is likely that T also includes Y (Hipp, Güntzer, and Nakhaeizadeh, 2000). In this case, X is called the antecedent, while Y is called the consequent. This likelihood can be calculated by the conditional probability p(Y ⊆ T |X ⊆ T ) which is called the confidence of an association rule. This approach originated in market-basket analysis, where it was used to find out which goods are com-plementary to each other which led to the question: If a customer would buy product x and product y, what is the probability that the customer also buys product z (Ullman, Leskovec, and Rajaraman, 2011). Several measures for eval-uating association rules exist, but in this project support and confidence are used. Support is being calculated for itemsets, while confidence is calculated for association rules.

• The support supp(X) for an itemset X in a database D can be defined as the number of occurrences of X in D. This support can be a fixed number or a ratio. Only if the minimal support is gained, can an itemset be deemed useful for extracting a rule.

(42)

22 Chapter 3. Method

3.4.2 Design

The association rule mining approach is applied to the adverse event data gath-ered by the UMCG in two ways. The first approach is applying association rule mining to the text from the adverse event descriptions where all tokens from a description will form an itemset. The second approach will use all other rele-vant data collected from the adverse events such as date and time but also the functions of the involved employees as items in the itemsets. Before actual rule mining can be done, frequent itemsets need to be found first. Frequent itemsets are itemsets that have reached a predefined support within the dataset. This is best illustrated with an example in table 3.5, where the chosen support s is 3.

TABLE3.5: Example itemsets

docID Itemsets A B C D E Frequent itemsets 1 {A, B, C, D, E} 1 1 1 1 1 {A}, {B}, {C}, {D}, {E}

{A,D}, {A,E}, {B,C}, {B,D}, {C,D}, {C,E}, {D,E} {B,C,D}, {C,D,E} 2 {B, C, D, E} 0 1 1 1 1 3 {A, D, E} 1 0 0 1 1 4 {A, C, D, E} 1 0 1 1 1 5 {B, C, D} 0 1 1 1 0

From the data in this table eleven frequent itemsets were found. These fre-quent itemsets can be found in table 3.6.

TABLE3.6: Example frequent itemsets

Frequent itemset Support Frequent itemset Support Frequent itemset Support

{A} 3 {A,D} 3 {C,E} 3

{B} 3 {A,E} 3 {D,E} 3

{C} 4 {B,C} 3 {B,C,D} 3

{D} 5 {B,D} 3 {C,D,E} 3

{E} 4 {C,D} 4

(43)

3.4. Association rule mining 23

TABLE3.7: Example association rules

Rule number

Antecedent Consequent Rule Confidence 1 {C,D,E} {B} {C, D, E} ⇒ {B} 66.7%

2 {C,D} {E} {C, D} ⇒ {E} 75%

3 {A,E} {D} {A, E} ⇒ {D} 100%

4 {D,E} {A,B} {D, E} ⇒ {A, B} 25%

Following these rules and their confidence levels several observations can be made. For instance: if the itemset {C, D} occurs, there is a 75% chance that itemset {E} also occurs. Another example is if itemset {A, E} occurs, itemset {D} will always occur as well.

The algorithm used for finding association rules is the FP-growth algorithm. This algorithm is better suited for larger datasets than the apriori algorithm mentioned in chapter 2. The runtime and processing power needed increase exponentially when using the apriori approach as opposed to the FP-growth approach, for which they increase linearly. This is because the FP-growth al-gorithm only scans the the database twice and no candidate generation has to be done (Han, Pei, and Yin, 2000). The FP-growth algorithm does this by cre-ating a frequent pattern tree structure that stores the compressed information from the frequent itemsets. Some attempts have been made using the apriori approach, but they either ran too long or got memory errors because too much data was being stored. When tested on a smaller, less diverse dataset, the apriori algorithm gave no problems. For further information on the apriori algorithm, please refer to chapter 2.

3.4.3 Implementation

(44)

24 Chapter 3. Method possibly be irrelevant. Because there is no previous research using these data the initial minimum values for support and confidence were set to an arbitrary level of 0,5 and based on the results, these values were either raised or lowered. The support and confidence levels for each of the rules are mentioned in chapter 4.

Using the text based approach, the itemsets are sets made up of the tokens from the textual descriptions of the adverse event reports. Because of the high diversity in the descriptions, the frequent itemsets are likely to be relatively small, with the maximum lengths of antecedents and consequents at two or three. This makes it more difficult to find useful association rules because to even get rules, the support and confidence have to be set at a relatively low level, which in turn allows for uninformative rules to be made as well.

The second approach used to find association rules is based on all other rele-vant available data. The relerele-vant data for the approach are chosen by members of the CIM and DIM. Because these data are mostly predefined fields from the database, the possibility of extracting longer (>3) itemsets and rules is higher. The results from the association rule mining can be found in chapter 4.

3.5 Text clustering

3.5.1 Concepts

Text clustering is an unsupervised learning technique. The most important characteristic for unsupervised learning is the lack of human supervision to assign classes to the various documents. The documents used for this machine learning technique are the adverse event descriptions. The goal of unsupervised learning, in this case text clustering, is to replicate a categorical distinction that a human supervisor imposes on the data (Manning, Raghavan, and Schütze, 2008).

(45)

3.5. Text clustering 25 implies the use of hard clustering. The opposite of hard clustering is soft clus-tering, where a document has a fractional membership in several clusters. Clus-tering can also be divided in two other categories: flat clusClus-tering and hierarchi-cal clustering. Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other, while hierarchical clustering creates a hierarchy, such as a tree, which is structurally more informative com-pared to flat clustering (Manning, Raghavan, and Schütze, 2008). In this project, the flat hard clustering algorithm K-Means and the hierarchical Ward clustering algorithm will be tested to examine their effectiveness with the dataset to find relevant patterns. The choice for these algorithms was made because of the differences between them: one is flat, the other hierarchical. There could be a possibility that one these two approaches is better suited for this experiment. Both algorithms will be explained in more detail in section 3.5.2.

Normally the quality of clusters can be evaluated using homogeneity, com-pleteness, v-measures or other metrics that use distance measures or known labels to compute scores (Amigó et al., 2009). However, because no categori-cal distinction is known for possible patterns, these measures can not be used. Another possibility to evaluate clusters would be to look at the terms with the highest tf-idf scores in the clusters. By making use of a vector representation for documents, a vector space model is created. Documents are clustered to-gether based on their similarity and similarity scores are calculated based on the vectors for each document, which are based on tf-idf scores. The assump-tion is that by listing the 10 terms with the highest tf-idf scores per cluster, some knowledge about the cluster is gained and a potential pattern might be discov-ered. This is called a non-differential feature selection method that can also be used for cluster labeling (Manning, Raghavan, and Schütze, 2008).

3.5.2 Implementation

(46)

FIGURE 3.2: Feature vec-tor example

After that, for every document the textual features ex-plained in section 3.3 are extracted and vectors are created from these features. The metadata features are left out while clustering, because even after giv-ing them different (lighter) weights, their impact on the formation of the clusters was much larger than the impact from the textual features. An example of a fea-ture vector can be found in figure 3.2.

Because of the length of these feature vectors, they have been made into sparse feature vectors. A sparse feature vector is better suited for storing long vectors because it uses a compressed version of the feature vector and there-fore saves memory and is not as computationally expensive as its counterpart, a dense feature vector. Dense feature vectors use no compression and are there-fore more expensive for both memory and computing power. The K-Means and Ward clustering algorithms can work with both sparse and dense matrices, but because of the aforementioned reasons, only sparse matrices will be used.

K-Means

The K-Means clustering algorithm first randomly selects vectors as cluster cen-ters (centroids) in the vector space. From these initial centroids, called ‘seeds‘, K-Means selects new centroids by trying to minimize the average Euclidean dis-tance from the centroid to the surrounding documents in the vector space. The Euclidean distance d between vector x and vector y can be defined as d(x, y) = p(y1− x1)2+ (y2 − x2)2. The procedure is as follows: K-Means iteratively

(47)

3.5. Text clustering 27 ‘verkeerde’, ‘medicatie’, ‘verkeerde dosering’, ‘dosering’, ‘verkeerde medicatie’, ‘sticker’, ‘evs’, ‘gegeven’, ‘voorgeschreven’, ‘uitgezet’. This could imply that the adverse events in the cluster are related to wrong dosages of medication possi-bly due to an error with the ‘evs’6_{. The features with the highest tf-idf scores for}

the other medication clusters and other categories can be found in chapter 4. To see the difference between clusters, it is useful to plot the distances be-tween the clusters in a scatter plot. Because a scatter plot is a two-dimensional representation of the document distribution, the multidimensional data has to be transformed in order for it to be used. For this a Principle Component Anal-ysis (PCA) is used, which is properly explained in a tutorial by Bro and Smilde (2014). They used a PCA in the field of chemistry but emphasize that it is not limited to that field and is generally applicable. A PCA reduces the dimen-sionality in the data by merging several possibly correlated variables into new uncorrelated variables, called principle components. This is done by assigning weights to the different variables and calculating the averages from them, creat-ing a so called linear combination. Each variable is assigned a weight and a new vector t in the form of t = w1× x1 + · · · + wj × xj is created, retaining as much

information as possible from the original vector. Thus, a PCA is used to max-imize the variation in the vectors while reducing the dimensionality, because the variation or the lack thereof is what defines to which cluster a document is assigned. The aim is to create clusters with high intracluster similarity and low intercluster similarity, meaning that documents in a cluster should be similar to each other but dissimilar from documents in other clusters. In order to get the best distribution of clusters, K-Means clustering will be used with a starting value K = 10 and based on the results, new values for K will be tested.

FIGURE 3.3: Scatter plot medication ad-verse events

A scatter plot uses the two princi-ple components created by the PCA and shows the distance between the documents based on the vector val-ues calculated by the PCA. This gives an insight in the intra- and interclus-ter similarity: if the documents from a certain cluster are spatially close to each other in the scatter plot, the as-sumption is that there is a high intr-acluster similarity. The same concept applies to the intercluster similarity, if the clusters are spatially close to each

(48)

28 Chapter 3. Method other in the scatter plot, the assumption is that the intercluster similarity is high. These scatter plots can be made for every category of adverse events. In figure 3.3 the scatter plot for medication adverse events can be found. A full expla-nation of the scatter plot will be given in section 4.2.1. The scatter plots for the other adverse event categories can be found in chapter 4.

Ward’s clustering algorithm

Hierarchical clustering has no pre-specified number of clusters, but a certain level of depth in the hierarchy has to be chosen for which there is adequate con-fidence that the the clusters are informative. However, hierarchical clustering does have a choice to either use a bottom-up approach or a top-down approach. As their names suggest, the top-down approach starts at the top of the hierar-chical tree structure and splits each of the nodes until the individual documents are reached at the bottom, while the bottom-up approach starts with the indi-vidual documents and merges those into clusters with the least between-cluster distance. The top-down approach is essentially using a flat clustering algorithm recursively, creating clusters from clusters. A possible approach could be using a K-Means algorithm with k=2 recursively and stopping a level where there is enough confidence in the informativeness of the cluster distribution. An exam-ple of such an approach is illustrated in figures 3.4 and 3.5, but will not be used further in this project.

FIGURE 3.4: Regular K-Means clustering

FIGURE 3.5: Top-down clustering

(49)

3.5. Text clustering 29 The reason for this is the fact that even small clusters that could potentially be a pattern can be very interesting to look at and those small clusters are likely to be absorbed into another (larger) cluster to get an equal distribution. Patterns in the small cluster will not be visible within the merged cluster.

Because the documents are clustered based on the (dis)similarity of the tf-idf values in their vector representation, it is important to normalize the documents for length. This is done by using cosine similarity. Cosine similarity is a measure used to calculate the similarity between documents or clusters based on their vector representation, taking into account the difference in length of the clusters or documents. Manning, Raghavan, and Schütze (2008) define cosine similarity as

cos_sim(d1, d2) =

~

V (d1) · ~V (d2)

|~V (d1)||~V (d2)|0

where the numerator represents the dot product of the vectors ~V (d1)and ~V (d2)

with the Euclidean length and the denominator being used to length-normalize the vectors. From the cosine similarity, the distance for each document can be calculated as distdoc = 1 − cos_sim(doc). From these distance calculations, a

linkage matrix is created in order to map the distances.

(50)

FIGURE 3.6: Dendrogram medication adverse events created by

Ward’s clustering algorithm

3.6 Text classification

3.6.1 Concepts

Text classification is a supervised learning technique which can be done by mak-ing use of a predefined set of rules or by makmak-ing use of machine learnmak-ing. The rule-based approach would define several rules beforehand and based on those rules classify document in separate classes. Using machine learning however, those rules need not be defined beforehand but can be inferred from the data. This has the added benefit of not having to maintain and alter the rules each time the data changes.

(51)

3.6. Text classification 31 event descriptions will serve as the documents that need to be classified. Text classification is usually evaluated by comparing precision, recall, f-scores, accu-racy scores and confusion matrices. Each of them is briefly summarized, using “Introduction to Information Retrieval", by Manning, Raghavan, and Schütze and a binary classification example with classes 0 and 1:

• Precision (P): For class 1, all documents classified correctly as class 1, di-vided by all documents classified as 1. This is the fraction of all documents correctly classified as 1 (true positives - tp), divided by all documents cor-rectly (tp) and erroneously (false positives - fp) classified as 1. P = _{tp+f p}tp • Recall (R): For class 1, all documents correctly classified as class 1, divided

by all documents correctly classified as class 1 with all documents from class 1 that were classified as class 0 (false negatives - fn). R = _{tp+f n}tp • F-score: The harmonic mean of precision and recall. This is defined as

Fβ = (1+β

2_{)·(P ·R)}

β2_{·P +R} . β can be used to emphasize on either precision or recall,

with values between 0 and 1 emphasizing on precision and values larger than 1 emphasizing on recall.

• Accuracy score: The fraction of all correctly classified documents divided by all classified documents.

• Confusion matrix: A table layout with the numbers of all correctly and erroneously classified documents for each class.

3.6.2 Implementation

TABLE3.8: Distribution of classes

Label Number of occurrences

Yes 93

No 8776

The label being used for classification is a bi-nary class that shows whether the CIM has been contacted or not about an adverse event potentially being harmful for the hospital or its patients. Unfortunately, not for all data that label is available. The UMCG started

Pattern Detection in Adverse Events reported in Patient Care in a Hospital

M

ASTER

T

HESIS

Pattern Detection in Adverse Events

reported in Patient Care in a

Hospital

Author:

Mart B

V

First supervisor:

Dr. Gosse B

Second supervisor:

Dr. Martijn W

A thesis submitted in fulfillment of the requirements

for the degree of Master of Arts

in the

Information Science Research Group

University of Groningen

Declaration of Authorship

Abstract

Nederlandse samenvatting

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Hospital situation

1.1.1

UMCG

1.1.2

CIM & DIM

1.1.3

Reporting process

1.2

Problem statement

Chapter 2

Related Work

2.1

Data mining & health care data

2.2

Association rule mining

2.3

Text clustering

2.4

Text classification

Chapter 3

Method

3.1

Data

3.1.1

Preprocessing

3.2

Resources

3.2.1

Search tool

3.2.2

Python libraries

3.3

Feature selection

3.3.1

Text based

3.3.2

Metadata based

3.4

Association rule mining

3.4.1

Concepts

3.4.2

Design

3.4.3

Implementation

3.5

Text clustering

3.5.1

Concepts

3.5.2