Classifying patient specific documents in the clinical domain: Using a Bag-of-Words representation and machine learning

(1)

(2)

documents in the clinical domain

Using a Bag-of-Words representation and machine

learning

Casper van der Kerk

11356847

A thesis presented for the degree of

Master of Science

Master Medical Informatics

University of Amsterdam

The Netherlands

25-06-2019

(3)

the clinical domain

Using a Bag-of-Words representation and machine learning

Student

Name Casper van der Kerk

E-mail c.o.vanderkerk@amsterdamumc.nl

Student No. 11356847

SRP Mentors

Name Matthan W.A. Caan, PhD

E-mail m.w.a.caan@amsterdamumc.nl

Department Biomedical Engineering & Physics

Institution Amsterdam UMC

Name Ricardo R. Lopes

E-mail r.riccilopes@amsterdamumc.nl

Department Biomedical Engineering & Physics

Institution Amsterdam UMC

SRP Address

Amsterdam University Medical Center - Location AMC Department Biomedical Engineering & Physics

Meibergdreef 9 1105 AZ Amsterdam

(4)

Name Dr. Danielle Sent, assistant professor

E-mail d.sent@amsterdamumc.nl

Department Medical Informatics

Institution Amsterdam UMC - UvA

Practice teaching period November 2018 - June 2019

(5)

(6)

The electronic health records (EHR) in hospitals are known for their large reposi-tories of unstructured data. Adding a layer of structure to this data may increase the productiveness of medical professionals as well as reduce the frequency of er-rors, and reducing costs. This thesis explored the application of a machine learning algorithm to classify medical documents from an EHR using various techniques. This thesis used free-text medical documents from the MIMIC-III database to show that a random forest classifier or, in some cases, a linear classifier are sufficient to classify the documents into pre-determined classes. Additionally, a short literature survey indicated that new information extraction techniques have been developed that can report information from free-text documents, such as medication dosage. Techniques like this may allow free-text documents to be clustered based on the context of their contents.

Keywords: Information Extraction, Machine Learning, Multi-label classification, Word Embedding

Samenvatting

Het elektronisch patienten dossier (EPD) staat bekend om zijn grote verzameling ongestructureerde data. Het structureren van deze data kan de productiviteit van medische professionals verhogen, de prevalentie van fouten verlagen, en kosten be-sparen. Deze scriptie onderzoekt het toepassen van een machine learning algoritme om medische documenten uit het EPD te classificieren met behulp van verschillende technieken. Deze scriptie heeft vrije tekst documenten uit de MIMIC-III database gebruikt om aan te tonen dan een random forest classifier of, in sommige gevallen, een lineaire classifier voldoende is om medische documenten in vooraf bepaalde klassen te classificeren. Daarnaast heeft een kort literatuur onderzoek aangetoond dat er nieuwe information extraction technieken zijn ontwikkeld die informatie uit vrij tekst velden kan rapporteren, zoals medicatie dosis. Dit soort technieken kun-nen gebruikt worden om vrije tekst documenten te clusteren op basis van de context van hun inhoud.

(7)

1 Introduction 8

1.1 Study Objectives . . . 9

1.2 Outline of this thesis . . . 9

2 Background 10 2.1 EHR documents . . . 10

2.1.1 Classifying free text . . . 10

2.2 Bag-of-Words representation . . . 10

2.2.1 Word vectors . . . 11

2.2.2 Term Frequency - Inverse Document Frequency . . . 13

2.3 Classifiers . . . 14

2.3.1 Support Vector Machine . . . 15

2.3.2 Random Forests . . . 16

2.4 Grid Search . . . 17

2.5 Visualization techniques . . . 17

2.5.1 Confusion Matrix . . . 17

2.5.2 Pairplot . . . 18

2.5.3 t-Distributed Stochastic Neighbor Embedding . . . 18

3 Method 20 3.1 MIMIC-III . . . 20 3.2 Preprocessing . . . 21 3.2.1 Data cleaning . . . 21 3.2.2 Data loading . . . 21 3.2.3 Word Vectorizing . . . 21 3.3 Experiments . . . 22 3.3.1 SVM . . . 22 3.3.2 Random Forest . . . 22 3.4 Visualization . . . 23 3.4.1 Confusion Matrix . . . 23 3.4.2 t-SNE . . . 23

3.4.3 Relative Feature Importance . . . 24

4 Results 25 4.1 TF-IDF Vectorizer . . . 25

4.2 SVM . . . 25

(8)

5 Discussion 30

5.1 Reflection on results . . . 30

5.2 Alternative data source . . . 31

5.3 Bag-of-Words weaknesses . . . 32

5.4 Neural Networks . . . 32

5.5 Information Extraction . . . 32

6 Conclusion 34 A Appendix 39 A.1 Decision Boundary Plots . . . 39

A.2 Individual t-SNE plots . . . 42

A.3 t-SNE Subclusters . . . 46

(9)

Introduction

The electronic health records (EHR) in hospitals are known for their large reposi-tories of unstructured data [1]. Unstructured data is data that lacks a pre-defined organization, making it difficult to interpret for traditional programs. One exam-ple of such a repository of unstructured data are the medical documents which are stored within the EHR. These documents contain correspondence about the patient that have been sent to the hospital. These include a wide variety of document types such as, discharge letters from the hospital, referral letters from general practition-ers, nursing reports, and others. Since these documents are unstructured, medical professionals lose much time on searching through them [2]. By applying accurate classification tags to these documents the productiveness of medical professionals may increase as well as reduce the frequency of errors, and reducing costs [3, 4].

Medical professional have been complaining about the difficulty of finding doc-uments they were looking for in the patient files. Currently, all these docdoc-uments are represented in a long list, without any descriptive file name. While classification tags are present, they are not always accurate. As a result, medical professionals may need to search a list of multiple documents with the same classification tag and non-descriptive file names. Furthermore, the majority of these documents are not generated from a template and are written in free text. Documents in free-text form allow the writer to more naturally document their findings when compared to structured data like a template [5].

To achieve this classification task good quality information extraction (IE) is required to extract information from these free-text documents that would define a document into a class. A machine learning algorithm can be used for this goal [6]. The idea of automated extraction of clinical information from free-text docu-ments within an EHR is not novel, and literature discussing automated IE dates back to the early ’70s [7]. However, most publications on the matter appeared after the ’90s [8]. While there are many new publications regarding the subject of docu-ment classification using machine learning algorithms, there has not been a major implementation yet.

Previous research has already proven that machine learning algorithms can pro-vide a good solution for IE in classification tasks. For example, neural networks shows promise for automatically assigning correct MeSH terms to biomedical arti-cles and to automatically assign ICD-9 codes to medical records [9, 10]. It has also been shown that a good application of IE is a natural language processing (NLP) system which can extract predefined types of information from text documents in

(10)

different clinical domains [5].

Neural networks have shown promise to fulfill these types of classification tasks. However, this thesis takes a different approach, and investigates the effectiveness of a non-neural network machine learning technique.

1.1 Study Objectives

This thesis aims to answer the following research questions:

• How accurate is the classification of medical documents in an EHR using a machine learning algorithm?

• What kind of information can be extracted from medical documents?

This thesis provides insight in different approaches for document classification using a machine learning algorithm, as well as investigate the performance of different implementations of these techniques.

1.2 Outline of this thesis

Chapter 2 provides a background on the domain of this project and describes docu-ments that are being classified in this project. Furthermore, this chapter explains the techniques used in this project such word vectorizing, support vector machines and random forests. Chapter 3 goes over the methods and the materials that have been used in the experimental phase of this project. Chapter 4 then presents the results obtained from these experiments. Chapter 5 discusses the results of the experiments and compares them to related work in literature, as well as postulate some ideas for future research. Finally, Chapter 6 provides a conclusion on the whole project.

(11)

Background

2.1 EHR documents

Medical documents are omnipresent within the EHR. They are used to record the health of the patients and form a way of communication between medical profession-als. When a patient is admitted to the hospital they start accumulating documents in their patient file. For example, a radiologist or an ECG operator write a report to inform the treating physician about the results of the investigation of their pa-tient. Another example is physicians themselves writing reports in the EHR on their findings for future reference, like discharge summaries or consult reports. General practitioners or physicians from other hospitals may sent letters or reports to the hospital a patient is referred to. Or the nurse on the nursing ward want to make a note in the file of a patient. All these documents are placed within the file of the patient. While these documents do already have some sort of classification tag in the EHR it is often done manually and, as a result, is not always an accurate classification.

2.1.1 Classifying free text

Free-text documentation gives the writer more freedom to express their thoughts, and to add nuances to their work. For example, a note from a physician may contain mixed positive and negative assessments (e.g. “Normal body temperature, high blood pressure.”). Notes like this contain a lot of information which is not easily extracted by a computer. To classify documents based on the information they contain an Information Retrieval (IR) method is needed. A traditional IR system builds an index of terms from the provided documents, and records the term location and frequency [6]. These terms are also called features.

These features are not pre-defined and are extracted by the IR system based on the occurrence of all the text in the provided documents. The feature extraction representation focused on in this thesis is the Bag-of-Words representation.

2.2 Bag-of-Words representation

The Bag-of-Words (BoW) representation is a frequently used feature extraction technique. In a BoW representation text is represented is such a way that it describes the occurrence of words within the document. It is described as a “bag” because

(12)

any information about the order of words or their structure is lost. Therefore, this representation can only be used on the presence of words, not their context. The BoW representation relies on the assumption that documents that are similar in meaning contain similar content. As well as the assumption that something can be learned from the content of the document without its context [11]. However, algorithms do not work well with words, the text must first be converted to numbers. Word vectors are a method of converting words into strings (vectors) of numbers that describe the meaning of that word. Words that are semantically similar have similar vectors. This is called word embedding [12]. Section 2.2.1 provides more background on word vectors.

2.2.1 Word vectors

The collection of all documents used is called a corpus. For example, this corpus contains the following four documents:

• It was the best of times • It was the worst of times • It was the age of wisdom • It was the age of foolishness

A word vocabulary can now be constructed, based on the corpus, using only the unique words. This yield as vocabulary of ten words, also known as “tokens”.

• It • was • the • best • of • times • worst • age • wisdom • foolishness

The free-text documents in the corpus can now be converted to word vectors. Since the vocabulary is 10 tokens long each document in the corpus can be described by vector of 10 positions, each position representing a token in the vocabulary.

The easiest way of converting free-text into word vectors is by assigning a boolean value to each position in the word vector, 0 if the token from the vocabulary is absent in the document and a 1 if the token is present in the vocabulary. For the first document in the corpus the vector would be constructed as follows:

• It = 1 • was = 1

(13)

• the = 1 • best = 1 • of = 1 • times =1 • worst = 0 • age = 0 • wisdom = 0 • foolishness = 0

When all documents in the corpus are scored the following four vectors are con-structed:

“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] “It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] “It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0] “It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Now each document is represented by a series of numbers that describes the content of the document, while discarding the context of these documents. These vectors could now be used for modelling. However, as the amount of tokens in the vocabu-lary grows, so does the length of the vector. As can be seen in the above example, these vectors contain a lot of 0 values, making them “sparse vectors” (see Figure 2.1). Sparse vectors take up a lot of space in computer memory decreasing the per-formance of modeling. Therefore, it important to decrease the size of the vocabulary by decreasing the amount of (unique) tokens. This can be done via stemming or removing stopwords [13, 14].

Sparse vector = [0, 0, 0, 1, 0] Dense vector = [1, 1, 0, 1, 1]

Figure 2.1: Difference between a sparse and a dense vector. A sparse vector has a lot of 0-values, a dense vector has a few 0-values

Reducing tokens

An easy and effective way of reducing the size of the vocabulary is character normal-ization [15]. This technique removes all punctuation, converts the whole vocabulary to lowercase and removes any non-alphanumerical characters.

Stopwords can be removed to further decrease the amount of tokens in the vo-cabulary [14]. Stopwords are the most common words used in a given language. These usually refer to short stopwords such as, the, a, or, which, et cetera. Since these words are used so often they are not very useful to distinguish one type of document from an other and, can therefore, be disregarded.

(14)

An other useful way to reduce the amount of unique tokens in the vocabulary is stemming. Stemming reduces a token back to its root (or stem). This causes tokens that are in different forms to be represented as the same token. For example, the tokens connecting, connection, connections, and connected can all be written as connect. Furthermore, stemming also reduces tokens that are in different tenses. For example, am, are, and is can be written as be. In essence, stemming is a crude method that usually removes the end of a token like -ED, -ING, and -ION. One of the more popular and effective stemming algorithms is Porter’s algorithm [13].

Although it is not used in this thesis, it must be noted that lemmatization is a more sophisticated method compared to stemming [16]. Lemmatization uses a vocabulary and morphological analysis of a token with the aim to return that token to its dictionary form, also known as a lemma. Whereas stemming might reduces the token saw to only the letter s, lemmatization may return either saw or see depending on whether it was a noun or a verb.

2.2.2 Term Frequency - Inverse Document Frequency

In the task of classifying documents using the whole vocabulary is inefficient, es-pecially in corpora with large documents. Therefore, the tokens in the vocabulary should be weighted on their importance to the type of document. When a token is used often within a document it will have a high scoring term frequency. However, if this same token is used in many of the documents in the corpus it is not a very important token to that particular type of document (the next section provides an example). In other words, it is a poor token to distinguish one type of document from another. Therefore, word vectors scores are usually weighted by converted to a term frequency - inverse document frequency (TF-IDF) [17]. In TF-IDF the oc-currence of a token within a document is counted. This is the term frequency. Then the term frequency is adjusted from the amount of times the token occurs in the whole corpus. This is the inverse document frequency. By applying this technique tokens that occur frequently throughout the corpus are penalized, while rare words get a higher score.

The highest scoring tokens can then be used for classification. These tokens are called features.

A technical example

Let there be a corpus with two documents. The vocabulary for each document is listed below. Document 1 contains a total of 6 words, document 2 contains a total of 8 words, both documents have 4 tokens.

Document 1 tokens Token occurrence

This 1

Is 2

A 2

Document 1

Document 2 tokens Token occurrence

This 1

Is 2

Another 3

Document 2

(15)

Thus, the term frequency for the token “this” is calculated as such: tf (“this”, d1) = 1 6 ≈ 0.17 tf (“this”, d2) = 1 8 ≈ 0.13

The inverse document frequency is a constant for each token. It is calculated by taking the log of the amount of documents within the corpus divided by the fre-quency of documents where the token occurs. In case of the token “this”, it occurs in two documents, and the corpus consists of two documents.

idf (“this”, D) = log 2 2

= 0

The tf-idf score is then calculated by multiplying the term frequency score by the inverse document frequency score.

tf idf (“this”, d1, D) = 0.17 × 0 = 0

tf idf (“this”, d2, D) = 0.13 × 0 = 0

The token “the” has such a low TF-IDF score because it occurs in every document in the corpus. To achieve a high TF-IDF score a token must occur frequently within a document, but be rare within the corpus. For example, when the TF-IDF score for the token “Another” is calculated for document 2.

tf (“another”, d2) =

3

8 ≈ 0.38 idf (“another”, D) = log 2

1

≈ 0.30 tf idf (“another”, d2, D) = 0.38 × 0.30 = 0.114

Since the token “Another” does not occur in document 1 the term frequency for this token in document 1 will be zero (0₆ = 0). Therefore, the TF-IDF score will also be zero for this token in document 1.

2.3 Classifiers

In machine learning there are two main types of learning, supervised learning and unsupervised learning. In supervised machine learning there is prior knowledge about what the output of the machine learning algorithm should be, such as a pre-defined class or label. Unsupervised machine learning does not have a pre-pre-defined output and tries to learn some underlying structure in the data. In the case of this thesis, the output is a pre-defined document class. Therefore, the machine learning task focused on in this thesis is a supervised one. One such supervised machine learning algorithm is a classifier. A classifier is used to train a model, also known as a black box, that can predict the output class with a certain accuracy.

There exist many different classifiers. This thesis will focus on two of them. The support vector machine (SVM) and the Random Forest Classifier.

(16)

2.3.1 Support Vector Machine

The support vector machine (SVM) is a supervised learning algorithm prominent in text classification tasks [18]. A SVM in its most simple form is a linear SVM. A linear SVM separates data belonging to different classes by forming a hyperplane with maximum margins [19] (see Figure 2.2). Datapoints that define the margins are called support vectors. The space between the two margins is called the decision surface.

Figure 2.2: A linear SVM trained with samples from two classes. The goal is to maximize the margin. Samples on the margin are called the support vectors.

The broadness of the decision surface is dependent on the degree of tolerance (also known as the C-value). Bigger C-values increase the penalty when the SVM makes a miscalculation. Therefore, the decision surface gets narrower as the C-value increases to reduce the margin of error (see Figure 2.3).

(a) C = 1 (b) C = 0.01

Figure 2.3: Two figures demonstrating the effect of different C-values on the margin of a linear SVM. Image from Github [20].

Multiple studies conclude that an linear SVM is a formidable method for text classification [18]. Furthermore, while these studies only focused on text

(17)

classifi-cation with two classes, other research has showed that SVM’s are also capable in multiclass text classification [21].

However, not every problem can be solved linearly, as can be seen in Figure 2.4a. No straight hyperplane can be constructed that perfectly separates the two classes. Therefore, a SVM can be made into a non-linear classifier by performing a “kernel trick”. This allows the SVM to construct a hyperplane that is non-linear in the feature space of the input data. There are multiple types of non-linear SVM’s depending on the kernel used. Some of the common kernels are the polynomial (Figure 2.4b) and the radial basis function (RBF) kernel (Figure 2.4c).

(a) Non-linear problem (b) Polynomial kernel (c) RBF kernel

Figure 2.4: Non linear problem solved with different SVM kernels

2.3.2 Random Forests

A random forest consists of an ensemble of multiple decision trees aimed at at classification or regression task. A decision tree is a method to model a set of sequential, hierarchical decision to lead to some final output (see Figure 2.5). The concept of ensemble learning is that a group of weak learners (decisions tree) can combine to form a strong learner (random forest) [22].

A random forest starts by creating many bootstrap samples [23]. These samples are randomly taken from the input data with replacement until each bootstrap

sam-Figure 2.5: A simple example of a decision tree. Each point where the tree splits is called a node. The numbers at the bottom represent the (binary) class to which the data in that branch of the tree is classified to.

(18)

ple is as large as the input data. This means certain datapoints are represented in a bootstrap sample more than once, while other are not present at all. The datapoints that are not present in a bootstrap sample are called out-of-bag samples and form a test set for that bootstrap sample. Next, a decision tree is fit to each bootstrap sample. In a decision tree the data is split by binary partitioning into homogeneous regions. These regions are called nodes. The split is done according to some scoring method, usually the Gini index [24]. Furthermore, at each node only a small amount of the sample from the bootstrap are available for the partitioning, usually around the square root of the number of samples in the bootstrap. The splitting of the data at nodes continues until a division no longer reduces the Gini index, after which the branch terminates in a “leaf node”. The decision tree is then used to predict the out-of-bag samples. Finally, the outcomes for each out-of-bag sample from all decision trees are averaged to yield the final prediction, with ties split randomly [25]. The decisions trees in a random forest can grown considerably large. Especially if all branches are allowed to split until they reach a leaf node. Not all of these splits are necessary for the accuracy of the model. Furthermore, a decision tree that fits the train data perfectly has a lower accuracy on new data when compared to a less perfect tree due to overfitting [26]. Such a decision tree can be made smaller, also known as “pruning”, by removing branches that do not have a lot of impact on the final score. Pruning prevents overfitting and achieves a higher accuracy on new data.

2.4 Grid Search

A hyperparameter is a parameter that controls the learning process of a model [27]. Examples of hyperparameters are the kernel in a support vector machine, or the maximum tree depth in a random forest. To increase the performance of a model it is important to optimize these hyperparameters as much as possible. A grid search is a technique used to find the most optimal hyperparameter which results in the most accurate predictions. A grid search is an exhaustive search through a subset of manually defined hyperparameters. A model is trained with with each set of hyperparameters from the grid and their performance is evaluated using an dedicated evaluation set that was not used during the training phase. The best performing set of hyperparameters is then retained for further use.

2.5 Visualization techniques

This section will explain how to interpret some of the visualization techniques used in this thesis.

2.5.1 Confusion Matrix

A confusion matrix is a way of describing the performance of a supervised learning algorithm. Each row represents one of the actual class labels, while each column stands for each predicted class label. This way it is easy to read which classes had wrong predictions, to which class a sample was wrongly predicted, and how much of the samples of that class were wrongly predicted. Because of this, if an algorithm has

(19)

100% accuracy a perfect diagonal matrix would be constructed. Figure 2.6 shows an example of a confusion matrix using percentages instead of absolute values. In this example 77% of the samples belonging to class B are predicted correctly. From the samples belonging to class B 9% are predicted as belonging to class A, and 14% of the class B samples are predicted as belonging to class D.

Figure 2.6: Example of a confusion matrix. The rows represent the actual labels, while the columns represent the predicted labels. Label A is correctly predicted 94% of the time, label B 77%, while labels C, D, and E are correctly predicted 100% of the time.

2.5.2 Pairplot

A pairplot visualizes the relation between two variables as well as the distribution of the single variables (see Figure 2.7). Pairplots are useful when the amount of variables or features are small, or when insight in certain specific variables is needed.

2.5.3 t-Distributed Stochastic Neighbor Embedding

A good way to gain insight into a dataset is by plotting the datapoints. However, if the data contains more than two dimensions it can be difficult to create a plot that is easy to interpret. The t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique to transform high dimensional data into lower dimensional data while retaining information. Specifically, similar objects in the high dimensional data are plotted as nearby points, and dissimilar objects are plotted as distant points. In essence, it plots the data in less dimensions while maintaining the distribution of the input data. t-SNE is an formidable tool to visualize clustering in the data.

The t-SNE technique is similar in function to the principal component analysis (PCA). However, whereas PCA is a mathematical technique, t-SNE is a proba-bilistic technique [28]. The drawback to t-SNE being probaproba-bilistic is that it is a computaionally heavy technique. Therefore, when working with very high dimen-sional data, it is recommended to first reduce the amount of dimensions using a more efficient technique like PCA prior to using t-SNE.

(20)

Figure 2.7: Example of a pairplot constructed using the iris test data and the seaborn package for python. The distribution of a single variable can be seen on the diagonal. The relation between two variables can be seen in the lower triangle. The upper triangle mirrors the lower triangle.

(21)

Method

3.1 MIMIC-III

The source of the data in this study is the Medical Information Mart for Intensive Care III (MIMIC-III) [29]. MIMIC-III consists of deidentified health-related data associated 53,423 distinct hospital admissions for patients aged 16 or above to the intensive care unit in the Beth Israel Deaconess Medical Center between 2001 and 2012, and is freely-available.

The MIMIC-III database contains over 2 million free-text documents which are labeled into the following fifteen classes:

Case Management

• • Consult • Discharge Summary

ECG

• • Echo • General

Nursing

• • Nursing other • Nutrition

Pharmacy

• • Physician • Radiology

Rehab Services

• • Respiratory • Social Work

The distribution of the documents over these classes is represented in Figure 3.1.

Figure 3.1: Histogram representing the distribution of the free-text documents from the MIMIC-III database

(22)

3.2 Preprocessing

3.2.1 Data cleaning

The Natural Language Toolkit (NLTK) was used for the cleaning of the data [30]. As a result of deidentifying the MIMIC data much information in the documents became nonsense and had to be cleaned. This mainly included any dates within the documents. See Figure 3.2.

Planned Discharge Date: [**2115-4-1**] Insurance Update Primary insurance / reviewer: Hospital days authorized to: Current Discharge Plan: Home with services [**Month (only) 60**] require VNA services Barrier(s) To Discharge: None Family Meeting: Yes Referrals: Narrative / Plan: Unclear what level of services will be required at discharge. The patient remains intubated and vented in the ICU and will likely require some skilled nursing services when stable to be discharged. Case Management will follow for DC needs. If VNA services needed can use: [**Company 980**] - [**Telephone/Fax (1) 981**] Partners [**Name (NI) **] [**Name2 (NI) **] - [**Telephone/Fax (1) 982**]

Figure 3.2: Example of a document from the Case Management class. Any identifiers like names, phone numbers or dates have been replaced with random numbers and have been marked with two asterisks on either side.

Additionally, any punctuation marks were removed, each document was fully converted to lowercase, and cut up in individual words. The latter is also know as tokenization. To further save on memory the English NLTK stopwords corpus was used to remove commonly used words that do not hold much information, such as (“a”, “an”, “the”), and each word was reduced to its stem using the Porter Stemmer from NLTK.

3.2.2 Data loading

Every document in the corpus received a label between 0 and 14 based on its class. Since the data is heavily skewed towards some classes (see Figure 3.1), and since pro-cessing the entire dataset is very time consuming, a cut off function was constructed. This function allows the user to set a maximum amount of samples for each class. To ensure each class contains a homogeneous mix of documents, each class was shuffled before the samples were picked. A experiments where conducted with a cutoff of 10,000 unless mentioned otherwise, meaning each class had a maximum of 10,000 documents.

3.2.3 Word Vectorizing

Every word in every document was converted into a numerical value using the Tfid-Vectorizer function. This function creates a TF-IDF Bag-of-Words model by assign-ing a numerical value to every word based on how often it appears in the text. It then rescales these values based on how often they appear across all documents. The TfidfVectorizer was configured with three variables. A variable for the maximum amount of features was set. This variable determined how many tokens with the

(23)

highest TF-IDF value would be considered features to pass to the classifier. Two variables were set to determine the minimum and maximum document frequency (min df and max df, respectively). These two variables dictated the frequency a to-ken must occur in the corpus minimally and maximally to be considered a feature. The minimum was set as an absolute value, while the maximum was set as a ratio.

3.3 Experiments

Two different classifiers were tested on the bag-of-words model. The SVM and the Random Forest. First the TF-IDF data was shuffled and split into 80% training data and 20% test data. Then to determine the most optimal hyperparameters, a grid search was conducted for each classifier. After the model was trained on the training data it was tested against the test data and an accuracy score was calculated. Finally, the results were visualized using techniques described in the Section 3.4.

All experiments were run in a Python 3.6.8 environment.

3.3.1 SVM

The SVM classifier function used was the C-Support Vector Classification (SVC) function from scikit learn [31, 32].

Grid Search

For the grid search of the TF-IDF Vectorizer and the SVM classifier the hyperpa-rameters were used as presented in Table 3.1.

Max. Features 100 250 500 750 Min. Document Frequency 5 10 20 30 Max. Document Frequency 0.7 0.8 0.9

C-value 0.001 0.01 0.1 1.0 10 Kernel Linear Polynomial Radial Basis Function

Table 3.1: Values of the hyperparameters used in the SVM grid search. The values above the double line are hypervariable belonging to the TF-IDF Vectorizer. The values below the double line belong to the SVM classifier.

3.3.2 Random Forest

The Random Forest classifier function used was the RandomForestClassifier function from scikit learn.

Grid Search

For the grid search of the TF-IDF Vectorizer and the Random Forest classifier the hyperparameters were used as presented in Table 3.2.

(24)

Max. Features 100 250 500 750

Min. Document Frequency 5 10 20 30

Max. Document Frequency 0.7 0.8 0.9

No. of Estimators 10 20 30

Tree Depth 5 10 25 None

Table 3.2: Values of the hyperparameters used in the Random Forest grid search. The values above the double line are hypervariable belonging to the TF-IDF Vec-torizer. The values below the double line belong to the Random Forest classifier. Different size corpus

An additional experiment was conducted with the random forest classifier to in-vestigate the effect of a corpus of a different size. In essence, this means varying the cutoff variable. Furthermore, the other variables were also manually altered to further explore their effect on the accuracy of the model. While this method is more manual than the grid search, it allows for more precise exploration into the effects of the variables on the accuracy. The values used in this manual search are presented in Table 3.3.

Cutoff 100 1,000 10,000 200,000

Max Features 50 100 200 500

Min. Document Frequency 5 10 20

Max. Document Frequency 0.7 0.8 0.9

No. estimators 5 10 20

Tree depth 5 10 None

Table 3.3: Values of the variables used in the manual Random Forest grid search. The values above the double line are hypervariable belonging to the TF-IDF Vec-torizer. The values below the double line belong to the Random Forest classifier. The Cutoff variable dictates the maximum amount of documents in a class.

3.4 Visualization

This section touches upon the visualization techniques used. A more in depth ex-planation can be found in Section 2.5.

3.4.1 Confusion Matrix

The performance of every classifier was plotted as a confusion matrix. The confusion matrix gives insight in which classes were wrongly predicted and into which classes these were wrongly predicted into.

3.4.2 t-SNE

A t-SNE plot was created from the TF-IDF bag-of-words data. This plot provided insight in the structure of the data by reducing the dimensionality of the data to two dimension. Using this technique clusters could be identified in the data.

(25)

Additionally, the decision boundaries were plotted for each fitted classifier on top of the t-SNE plot. This plot provided insight in how the classifier learned to classifier the different clusters.

3.4.3 Relative Feature Importance

The top twenty most important features of each model were plotted against their relative importance. These were the features extracted which had the most impact in classifying documents. Additionally, the top five most important features were plotted against each other in a pairplot using the seaborn API. The pairplot gives insight in the distribution of each feature on the diagonal, and how the features are correlated to each other in the other plots.

However, due to time constraints and limitations of the scikit learn library these plots could only be generated from the Random Forest classifier.

(26)

Results

4.1 TF-IDF Vectorizer

The grid search for both the SVM and the Random Forest classifier returned that a TF-IDF vectorizer configured with maximally 750 features, a minimum document frequency of 5, and a maximum document frequency of 70% yielded the best per-formance. With these setting the t-SNE plot in Figure 4.1 was constructed. t-SNE plots of each individual class can be found in Appendix A.2. It appears that doc-ument classes are clustered together automatically, with certain classes containing subclusters. From one of these subclusters a sample was taken to investigate the documents in these clusters. The first thirty tokens from these document samples are presented, along their approximate position in the t-SNE plot, in Appendix A.3. The distribution of the test and train data can be seen in Figure 4.2.

4.2 SVM

The grid search for the SVM returned that the best performance is gained with the Radial Basis Function (RBF) kernel with a C-value of 1.0. The confusion matrix for the SVM classifier with these settings is presented in Figure 4.3. The decision boundary of a SVM with these settings plotted on the t-SNE plot can be found in the appendix (Figure A.1).

4.3 Random Forest

The grid search for the Random Forest classifier returned that 30 estimators with no maximum tree depth set yields the best performance. These settings yielded the confusion matrix shown in Figure 4.4. The decision boundary of a RFC with these settings plotted on the t-SNE plot can be found in the appendix (Figure A.2). The twenty most important features in this model are plotted in Figure 4.5, the top five of these features were used to generate a pairplot (Figure 4.6).

The top two accuracy scores of every cutoff value in the manual search are presented in Table 4.1, sorted by highest accuracy.

(27)

Figure 4.1: t-Distributed Sto chastic Neigh b or Em b edding (t-SNE) plot on the v ectorized data of the highest scoring mo del: V ariables; Maxim um features=500, Minim um Do cumen t F requency=5, Maxim um Do cumen t F reque ncy=70%

(28)

Figure 4.2: A histogram presenting the distribution of the training and test data in the experiments.

Figure 4.3: Confusion matrix of a SVM classifier configured with the radial basis function kernel and a C-value of 1.0. TF-IDF vectorizer setting: maximum fea-tures=750, minimum document frequency=5, maximum document frequency=70%.

(29)

Figure 4.4: Confusion matrix of a Random Forest classifier configured with 30 es-timators and no maximum tree depth. TF-IDF vectorizer setting: maximum fea-tures=750, minimum document frequency=5, maximum document frequency=70%.

Figure 4.5: The top twenty most important features in the Random Forest classifier plotted against their relative importance.

(30)

Figure 4.6: Pairplot plotting the occurrence of the five most important feature in the Random Forest classifier against each other.

Accuracy n(Cutoff ) Max Features Min df Max df No. of estimators Tree depth

0.9561 106,988(10,000) 500 20 0.8 20 None 0.9552 106,988(10,000) 500 20 0.7 20 None 0.9518 13,168(1,000) 500 5 0.8 20 None 0.9518 13,168(1,000) 500 10 0.9 20 None 0.9433 1,498(100) 500 5 0.8 20 None 0.9367 1,498(100) 500 10 0.7 20 None 0.9306 1,105,797(200,000) 100 5 0.9 20 10 0.9301 1,105,797(200,000) 100 20 0.7 20 10

Table 4.1: Accuracy scores and variable values used in the Bag-of-Words feature extraction (sorted by highest accuracy).

(31)

Discussion

5.1 Reflection on results

The t-SNE plot (Figure 4.1) indicates that the TF-IDF values of every document class are automatically clustered together. Some document class clusters are over-lapping with each other, such as Nursing and Nursing other. This can also be seen in the confusion matrices of both the SVM and Random Forest classifier (Figure 4.3, 4.4), as most of the inaccuracies are surrounding these two classes. Furthermore, documents belonging to the pharmacy class are poorly classified. This is proba-bly due to the fact that the dataset was significantly imbalanced, as can be seen in Figure 4.2. This could also explains why the consult documents were classified poorly as well. Finally, the documents in the general class were classified poorly as well. In the t-SNE plot it is shown that documents from the general class are spread throughout the plot, which might explain why these were classified so poorly.

The most important features plotted in Figure 4.5 do not appear to obviously define a document into one class or another. While these features had the most impact in defining a document into a class, they cannot show that, for example, the feature action is indicative for the class Radiology. In the pairplot in Figure 4.6 it is shown that presence of the feature trace is strongly correlated with the presence of the feature technic primarily in ECG documents. Other features are more obviously correlated, such as the feature radiolog being dominantly present in radiology documents.

During the training of the SVM classifier with the radial basis function kernel, the variable gamma (γ) was ignored due to time constrains and lack of experience. The γ value determines the influence of features on the decision boundary. In essence, this determines how tight the decision boundary lies around the data points. See Appendix A.4 for an example of this. Adding the γ variable to the grid search may further improve the performance of the SVM classifier.

The decision boundary plots in Appendix A.1 visualize how the classifiers sepa-rate the data into classes. However, these classifiers were trained on the two dimen-sional data outputted by the t-SNE. Therefore, they may not accurately represent the classifier trained on the high dimensional data.

(32)

Subclusters

Subclusters were identified within some of the classes in the t-SNE plot (see Figure 4.1). One such example of subclustering is found in the Rehab Services class. A sample from these clusters can be found in Appendix A.3. This subclustering in-dicates that documents in certain classes used fundamentally different words. The Bag-of-Words representation only takes the used words into account, not their con-text or meaning. Therefore, nothing can be said about the structure or meaning of these documents. However, based on only the word use, it appears that certain classes can be further classified in smaller subclasses. Further research into these subclusters is required to investigate where this subclustering originated from. One possible explanation is that different medical workers write free-text reports using different phrasings, making a subcluster unique to specific medical worker.

Manual Search

The results at first glance seem to indicate that the model performs best when a cutoff point of 10,000 entries is chosen. However, this result is probably due to the fact that in these experiments the depth of each tree in the random forest classifier was uncapped, allowing each tree to grow until every branch reached a leaf node. Not specifying a maximum tree depth causes the trees to grow very large. While this does yield higher accuracy it also decreases the speed and efficiency of the random forest. Additionally, the experiments which had a cutoff point of 200,000 did not have an experiment with an unlimited tree depth, since these experiments took too long to run. Furthermore, all the highest scoring experiments had 20 trees in the forest (No. of estimators). This was expected, as generating more trees generally increases accuracy, while decreasing efficiency [33].

5.2 Alternative data source

This research was conducted on the MIMIC-III database which had documents which were already in plain text form. For application of document classification in an EHR, it may be fair to assume that the documents are usually in PDF form. If this model were applied to such an EHR the documents must first be converted into plain text in order for the model to be able to process the text. One way of achieving this would be to use Optical Character Recognition (OCR). OCR is a technique used to convert printed or handwritten text into plain and editable text [34]. A popular OCR engine is the Tesseract engine [35]. This is an open source engine originally developed by HP and currently maintained by Google. Since Tesseract has a relative low error rate in the English language (<7%) it may prove to be a good match for converting EHR documents in PDF form [36].

However, since many documents in the EHR have different layout and may con-tain different kinds of images and graphs, further experimentation is required to determine whether the Tesseract engine would be suitable. Additionally, the use of this engine on languages other than English should also be explored further.

(33)

5.3 Bag-of-Words weaknesses

The Bag-of-Words (BOW) feature extraction may be a popular technique in both text extraction and image extraction, it may not be the best fit for medical doc-uments [37]. BOW feature extraction has two major weaknesses [38]. First, the order of words in a sentence is lost. This leads to sentences using the same words in a different order having the same representation in the BOW. For example, a note from a medical professional saying: ’Severe nut allergy, potential psychosis’ would be treated the same as a different note saying ’Severe psychosis, potential nut allergy’. Second, a BOW does not retain any further semantic value. For example, the words “HIV”, “AIDS”, and “Cancer” will all have the same distance from each other in feature space. While the words “HIV” and “AIDS” should be closer to each other than they should be to “Cancer”.

While these weaknesses are not a threat to a classification task which does not take the semantic values of the document into account, it may form a problem when BOW is used to perform more detailed tasks. E.g. classifying patients into treatment classes, or generating a summary of the document.

Therefore it is advisable to look into other feature extraction methods such as Word2Vec. While Word2Vec may underperfom on a small dataset, it has less over-head than BOW and is more scalable to tasks beyond binary classification [39]. Furthermore, Word2Vec provides complementary features such as semantics, that are lost to TF-IDF, making Word2Vec an interesting candidate for further experi-mentation [40].

5.4 Neural Networks

Both the SVM and Random Forest classifier performed relatively well in the classi-fication on the MIMIC-III data. However, recent studies showed that both Convo-lutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (Bi-LTSM) networks can perform well with a similar classification task in the Brazil-ian judiciary system [41, 42]. Furthermore, a deep learning framework known as “DeepLabeler” has outperformed SVMs in automated ICD-9 coding on patient records [10].

Neural networks may provide a more sophisticated and flexible way of automating the classification of medical patient documents.

5.5 Information Extraction

Further development after good document classification has been achieved would be to apply Information Extraction (IE) to the classified documents. IE is a Natural Language Processing (NLP) task that allows a system to automatically extract semi-structured data from free-text [43]. Multiple fields have been developing an IE system for an EHR with promising results [43, 44]. There exist various variations on IE, e.g. clinical Text Analysis and Knowledge Extraction System (cTAKES) which has also has positive results, and may be better applicable to more different domains [45]. Wang et al. published a comprehensive overview listing frequently used IE frameworks, tools, and toolkits [5].

(34)

A study conducted by Dietrich et al. demonstrates an ad hoc IE technique that was able to extract information, such as medication dosage, from unstructured data, such a hospital discharge letters [46]. Techniques like this could further reduce the time lost by search for a specific document by classifying documents based on their information. For example, documents could be clustered by high blood-pressure based on the actual values reported in the documents, when compared to support vector machines and random forests.

Finally, Wang et al. reports a discrepancy between clinical studies using EHR data and studies using clinical IE. This is described to be caused by the clinical domain being dominated by rule-based IE, which are considered obsolete, instead of a machine learning approach, which dominates the research domain. Additionally, a lack in collaboration experience between NLP experts and clinicians is reported, further widening the gap between state-of-the-art IE research and applied IE in the clinical domain.

(35)

Conclusion

This thesis took a high level look at automated text classification in the clinical domain. The data made available to this thesis showed that the task of classifying documents within an EHR may be simple enough to be performed by a random forest classifier or, in some cases, even a linear classifier. Thus making a neural net unnecessary. However, this thesis also explored related works conducted in the medical field. Neural nets appear to be performing well in the research domain, as well as some clinical classification tasks.

Furthermore, the current state of information extraction for medical documents was touched upon. There have been recent developments that allow an information extraction tool to report key information from free-text documents, such as medica-tion dosage. Further research on this topic may result in the possibility of clustering free-text documents based on the context of their content.

In conclusion, machine learning is a very suitable tool for classification tasks within the clinical domain. However, due to the lack of clinical data available to researchers, and the absence of collaboration experience between researchers and clinicians a good application of a machine learning algorithm for a classification task remains to be seen. Further research, development, and collaboration between researchers and clinicians is required in order to establish a widely used classification algorithm in the clinical domain.

(36)

[1] B. Shin, F. H. Chokshi, T. Lee, and J. D. Choi, “Classification of radiology reports using neural attention models,” in Proceedings of the International Joint Conference on Neural Networks, vol. 2017-May, pp. 4363–4370, IEEE, may 2017.

[2] K. Jensen, C. Soguero-Ruiz, K. Oyvind Mikalsen, R.-O. Lindsetmo, I. Kousk-oumvekaki, M. Girolami, S. Olav Skrovseth, K. Magne Augestad, and J. Carlos, “Analysis of free text in electronic health records for identification of cancer pa-tient trajectories OPEN,” 2017.

[3] K. J. Biese, C. R. Forbach, R. P. Medlin, T. F. Platts-Mills, M. J. Scholer, B. McCall, F. S. Shofer, M. Lamantia, C. Hobgood, J. S. Kizer, J. Busby-Whitehead, and C. B. Cairns, “Computer-facilitated review of electronic med-ical records reliably identifies emergency department interventions in older adults,” jun 2013.

[4] G. Burger, A. Abu-Hanna, N. De Keizer, and R. Cornet, “Natural language processing in pathology: A scoping review,” may 2016.

[5] Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, and H. Liu, “Clinical information extraction applications: A literature review,” jan 2018.

[6] S. G. Small and L. Medsker, “Review of information extraction technologies and applications,” 2014.

[7] H. P. Dinwoodie and R. W. Howell, “AUTOMATIC DISEASE CODING: THE ’FRUIT-MACHINE’ METHOD IN GENERAL PRACTICE,” tech. rep., 1973. [8] M. H. Stanfill, M. Williams, S. H. Fenton, R. A. Jenders, and W. R. Hersh, “A systematic literature review of automated clinical coding and classification systems,” Journal of the American Medical Informatics Association, vol. 17, no. 6, pp. 646–651, 2010.

[9] A. Rios and R. Kavuluru, “Convolutional neural networks for biomedical text classification,” in Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics - BCB ’15, vol. 2015, pp. 258– 267, NIH Public Access, sep 2015.

[10] M. Li, Z. Fei, M. Zeng, F. Wu, Y. Li, Y. Pan, and J. Wang, “Automated ICD-9 Coding via A Deep Learning Approach,” 2015.

(37)

[11] D. Metzler, A Feature-Centric View of Information Retrieval. Springer, Berlin, Heidelberg, 2011.

[12] O. Levy and Y. Goldberg, “Dependency-Based Word Embeddings,” tech. rep., 2014.

[13] M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130—-137, 1980.

[14] H. Saif, M. Fern´andez, Y. He, and H. Alani, “On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, pp. 810–817, 2014.

[15] N. Grabar, P. Zweigenbaum, L. Soualmia, and S. Darmoni, “Matching con-trolled vocabulary words,” in Studies in Health Technology and Informatics, vol. 95, pp. 445–450, 2003.

[16] J. Plisson, N. Lavrac, and D. Mladenic, “A rule based approach to word lemma-tization,” pp. 83–86, 2004.

[17] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries,” in Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855, vol. 42, pp. 40–51, 2003.

[18] F. Colas, P. Pacl´ık, J. N. Kok, and P. Brazdil, “Does SVM Really Scale Up to Large Bag of Words Feature Spaces?,” Advances in Intelligent Data Analysis VII, pp. 296–307, 2007.

[19] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algo-rithms and representations for text categorization,” pp. 148–155, 1998.

[20] L. Chen, “Bite Sized Machine Learning - SVM Basics.” https:

//github.com/lilly-chen/Bite-sized-Machine-Learning/blob/master/ SVM/SVM-Basics.ipynb, 2019. Accessed: 2019-06-19.

[21] J. D. M. Rennie and R. Rifkin, “Improving Multiclass Text Classification with the Support Vector Machine,” October, no. October, pp. 0–14, 2001.

[22] L. Breiman, “Random Forests,” vol. 45, no. 8, pp. 5–32, 2001.

[23] A. Cutler and J. R. Stevens, “[23] Random Forests for Microarrays,” Methods in Enzymology, vol. 411, pp. 422–432, 2006.

[24] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. Wadsworth and Brooks/Cole, 1984.

[25] D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler, “Random forests for classification in ecology.,” Ecology, vol. 88, no. 11, pp. 2783–92, 2007.

[26] R. Rastogi and K. Shim, “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning,” in VLDB, pp. 24–27, 1998.

(38)

[27] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimiza-tion,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012.

[28] L. van der Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

[29] M. Ghassemi, T. J. Pollard, P. Szolovits, L. Shen, L.-W. H. Lehman, L. Anthony Celi, M. Feng, R. G. Mark, A. E. Johnson, and B. Moody, “MIMIC-III, a freely accessible critical care database,” Scientific Data, vol. 3, p. 160035, 2016. [30] S. Bird, E. Loper, and E. Klein, Natural Language Processing with Python.

O’Reilly Media Inc., 2009.

[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-chine Learning in {P}ython,” Journal of MaMa-chine Learning Research, vol. 12, pp. 2825–2830, 2011.

[32] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “{API} design for machine learning soft-ware: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122, 2013.

[33] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest?,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7376 LNAI, pp. 154–168, Springer, Berlin, Heidelberg, 2012.

[34] C. Patel, A. Patel, and D. Patel, “Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study,” International Journal of Computer Applications, vol. 55, no. 10, pp. 50–56, 2012.

[35] R. Smith, “An overview of the tesseract OCR engine,” in Proceedings of the In-ternational Conference on Document Analysis and Recognition, ICDAR, vol. 2, pp. 629–633, IEEE, sep 2007.

[36] R. W. Smith, “History of the Tesseract OCR engine: what worked and what didn’t,” in Document Recognition and Retrieval XX (R. Zanibbi and B. Co¨uasnon, eds.), vol. 8658, p. 865802, International Society for Optics and Photonics, feb 2013.

[37] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual catego-rization with bags of keypoints,” in European Conference on Computer Vision (ECCV) International Workshop on Statistical Learning in Computer Vision, pp. 59–74, 2004.

[38] Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Doc-uments,” 2014.

(39)

[39] C. A. Turner, A. D. Jacobs, C. K. Marques, J. C. Oates, D. L. Kamen, P. E. Anderson, and J. S. Obeid, “Word2Vec inversion and traditional text classi-fiers for phenotyping lupus,” BMC Medical Informatics and Decision Making, vol. 17, p. 126, 2017.

[40] J. Lilleberg, Y. Zhu, and Y. Zhang, “Support Vector Machines and Word2vec for Text Classification with Semantic Features,” pp. 136–140, 2015.

[41] N. Silva, F. Braz, and T. Campos, “Document type classification for Brazil’s supreme court using a Convolutional Neural Network,” in Proceedings of The Tenth International Conference on Forensic Computer Science and Cyber Law, pp. 7–11, 2018.

[42] F. A. Braz, N. C. da Silva, T. E. de Campos, F. B. S. Chaves, M. H. S. Ferreira, P. H. Inazawa, V. H. D. Coelho, B. P. Sukiennik, A. P. G. S. de Almeida, F. B. Vidal, D. A. Bezerra, D. B. Gusmao, G. G. Ziegler, R. V. C. Fernandes, R. Zumblick, and F. H. Peixoto, “Document classification using a Bi-LSTM to unclog Brazil’s supreme court,” 2018.

[43] H. Liu, S. J. Bielinski, S. Sohn, S. Murphy, K. B. Wagholikar, S. R. Jonnala-gadda, K. E. Ravikumar, S. T. Wu, I. J. Kullo, and C. G. Chute, “An infor-mation extraction framework for cohort identification using electronic health records.,” AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, vol. 2013, pp. 149–53, 2013.

[44] S. Datta, E. V. Bernstam, and K. Roberts, “A frame semantic overview of NLP-based information extraction for cancer-related EHR notes,” arXiv preprint arXiv:1904.01655, 2019.

[45] S. Rea, J. Pathak, G. Savova, T. A. Oniki, L. Westberg, C. E. Beebe, C. Tao, C. G. Parker, P. J. Haug, S. M. Huff, and C. G. Chute, “Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project,” Journal of Biomedical Informatics, vol. 45, pp. 763– 771, aug 2012.

[46] G. Dietrich, J. Krebs, L. Liman, G. Fette, M. Ertl, M. Kaspar, S. St¨ork, and F. Puppe, “Replicating medication trend studies using ad hoc information ex-traction in a clinical data warehouse,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, 2019.

(40)

Appendix

(41)

Figure A.1: SVM decision b oundary plotted on t-SNE plot of the v ecto rized data. V ariables; Maxim um features=500, Minim um Do cumen t F requency=5, Max im um Do cumen t F requency=70%, Kernel=Radial Basis F unction, C-v alue=1.0

(42)

Figure A.2: Random F orest decision b oundary plotted on t-SNE plot of the v ectorized data. V ariables; Maxim um features=500, Minim um Do cumen t F reque ncy=5, Maxim um Do cumen t F requency=70%, Num b er of estimators=30, Maxim um T ree depth=None

(43)

A.2 Individual t-SNE plots

(a) Case Management , Num b er of datap oin ts = 967 (b) Consult , Num b er of datap oin ts = 98 (c) Dischar ge Summar y , Num b er of datap oin ts = 10,000 (d) ECG , Num b er of datap oin ts = 10,000 Figure A.3: t-SNE plots of the individual classes. All the plots where generated with the same TF-IDF V ectorizer settings; Cutoff = 10,00 0, Maxim um F eatures = 750, Minim um Do cumen t F requency = 5, Maxim um Do cumen t F requency = 0.7

(44)

(e) Echo , Num b er of datap oin ts = 10,000 (f ) Gener al , Num b er of datap oin ts = 8,301 (g) Nursing , Num b er of datap oin ts = 10,000 (h) Nursing other , Num b er of datap oin ts = 10,000 Figure A.3: t-SNE plo ts of the individua l classes. All the plots where generated with the same TF-IDF V ectorizer settings; Cutoff = 10,000, Maxim um F eatures = 750, Minim um Do cumen t F requency = 5, Maxim um Do cumen t F requency = 0.7

(45)

(i) Nutrition , Num b er of datap oin ts = 9,418 (j) Pharmacy , Num b er of datap oin ts = 103 (k) Physician , Num b er of datap oin ts = 10,000 (l) R adiolo gy , Num b er of datap oin ts = 10,000 Figure A.3: t-S NE p lots of the individu al cla sses. All the plots where generated with the same TF-IDF V ectorizer settings; Cutoff = 10,000, Maxim um F eatures = 750, Minim um Do cumen t F requency = 5, Maxim um Do cumen t F requency = 0.7

(46)

(m) R ehab Servic es , Num b er of datap oin ts = 5,431 (n) R espir atory , Num b er of datap oin ts = 10,000 (o) So cial Work , Num b er of datap oin ts = 2,670 Figure A.3: t-S N E plo ts of the individua l classes. All the plots where generated with the same TF-IDF V ectorizer settings; Cutoff = 10,000, Maxim um F eatures = 750, Minim um Do cumen t F requency = 5, Maxim um Do cumen t F requency = 0.7

(47)

A.3 t-SNE Subclusters

’titl’, ’defer’, ’bedsid’, ’swallow’, ’evalu’, ’histori’, ’thank’, ’refer’, ’yo’, ’man’, ’readmit’, ’rehab’, ’pt’, ’origin’, ’admit’, ’approxim’, ’one’, ’month’, ’ago’, ’date’, ’rang’, ’fever’, ’back’, ’pain’, ’found’, ’pansensit’, ’enteroccu’, ’fecali’, ’av’, ’endocard’

(a) First 30 tokens from a document in subcluster A

’titl’, ’bedsid’, ’swallow’, ’evalu’, ’his-tori’, ’thank’, ’consult’, ’yo’, ’speak’, ’male’, ’esrd’, ’hd’, ’mwf ’, ’bilater’, ’bka’, ’dm’, ’cad’, ’admit’, ’found’, ’hyperkalem’, ’hd’, ’pt’, ’asymp-tomat’, ’without’, ’chest’, ’pain’, ’sob’, ’ekg’, ’show’, ’mild’

(b) First 30 tokens from a document in subcluster A

(c)

’attend’, ’physician’, ’namei’, ’medic’, ’diagnosi’, ’icd’, ’reason’, ’referr’, ’evalu’, ’treat’, ’histori’, ’present’, ’ill’, ’subject’, ’complaint’, ’age’, ’adm’, ’afib’, ’coumadin’, ’chf ’, ’cad’, ’present’, ’lethargi’, ’home’, ’found’, ’melena’, ’hct’, ’diagnos’, ’gib’, ’like’

(d) First 30 tokens from a document in subcluster B

’attend’, ’physician’, ’namei’, ’referr’, ’date’, ’medic’, ’diagnosi’, ’icd’, ’tkr’, ’reason’, ’referr’, ’eval’, ’tx’, ’histori’, ’present’, ’ill’, ’subject’, ’complaint’, ’yo’, ’f ’, ’left’, ’knee’, ’replac’, ’year’, ’ago’, ’pt’, ’fall’, ’sustain’, ’distal’, ’femor’

(e) First 30 tokens from a document in subcluster B

Figure A.4: Example of subclusters within the same class using different words. Textbox a and b contain the first 30 tokens of a subcluster of the class Rehab Services. Textbox d and e contain the first 30 tokens from a different subcluster of the same class. The position of these clusters are indicated in Figure c.

(48)

A.4 Radial Basis Function Kernel Example

(a) Gamma = 1 (b) Gamma = 10 (c) Gamma = 20

Figure A.5: SVM using a Radial Basis Function kernel with varying gamma values. The higher the gamma value the more influence the feature have on the decision boundary. A higher gamma value yields a tighter decision boundary. Image from Github [20].