Clinical Information Extraction

(1)

Master Thesis

Clinical Information

Extraction

Lautaro Quiroz

(10849963)

Supervisor:

Dr. E. Kanoulas

Assessor:

MSc. L. Mennes

November 2016

(2)

(3)

I would like to thank my supervisors Lydia Mennes and Dr. Evan-gelos Kanoulas, for their guidance and support. Without them, this work would not have been possible.

(4)

Documentation is a fundamental element in any process, present-ing negative characteristics such as bepresent-ing expensive, time-consumpresent-ing, difficult to maintain, perceived as a dry repetitive task that does not directly contribute to the solution of the problem, and with a value that is difficult to assess, it is often underestimated. In some domains, it is possible to allow people to go to the source for information, but this is not always the case. In the medical field, precise and accu-rate information is invaluable and mandatory, and so, huge efforts are undertaken in order to secure the correctness of the data. Clinical research shows that between 65% up to 100% of nurses shifts han-dover information is lost after only a few cycles. Focusing on clinical shift handovers and taking it as practical example, we investigate the usefulness of constructing an Artificial Intelligence pipeline to auto-matically fill in handover forms. These methods could potentially be applied to any kind of form fill out.

State of the art methods for Clinical Information Extraction rely on a heavy hand-crafted feature engineering and make use of clinical software outputs to carry out their tasks. We present methods that avoid using domain dependent features and avoid having to manually select them. They integrate syntactic, semantic, and document struc-tural information and act either as unique prediction units or as part of ensemble methods. We demonstrate that these approaches show promising results and are able to reach state of the art scores.

(5)

1 Introduction 6

1.1 Research questions . . . 9

1.2 Contributions . . . 9

2 Background 11 2.1 Machine Learning models . . . 11

2.1.1 Artificial Neural Networks . . . 11

2.1.2 Conditional Random Fields . . . 12

2.2 Related work . . . 13

2.2.1 Information Extraction . . . 13

2.2.2 The CLEF e-Health challenge . . . 14

3 Experimental setup 16 3.1 Datasets . . . 16

3.1.1 NICTA Synthetic Nursing Handover Data . . . 16

3.1.2 WSJ corpus . . . 21

3.2 Evaluation metrics . . . 22

4 Methodology 24 4.1 Hand-crafted versus automatic feature based models . . 24

4.1.1 Artificial Neural Networks . . . 24

4.1.2 Conditional Random Fields . . . 28

4.2 Feature set extension . . . 28

4.2.1 Feature set description . . . 29

4.2.2 Feature integration . . . 36

4.3 Model stacking . . . 36

4.3.1 Hierarchical classification . . . 38

4.3.2 Layered prediction . . . 38

4.3.3 Neural Conditional Random Fields . . . 38

4.4 Dataset limitations . . . 39

4.4.1 Out of vocabulary words . . . 39

4.4.2 Data skewness . . . 39

5 Results and analysis 41 5.1 Hand-crafted versus automatic feature based models . . 45

5.1.1 Automatic feature based models using word em-beddings . . . 46

5.1.2 Automatic feature based models using multiple features . . . 54

(6)

5.1.4 Conclusions . . . 57

5.2 Model stacking . . . 59

5.2.1 Hierarchical classification . . . 59

5.2.2 Layered prediction . . . 60

5.2.3 Neural Conditional Random Fields . . . 61

5.3 Dataset limitations . . . 64

5.3.1 Out of vocabulary words . . . 65

5.3.2 Dataset skewness . . . 65

(7)

Introduction

Clinical documentation is at the core of the health industry. Medical institutions and professionals must ensure the availability and correct-ness of patients’ data. Inaccurate information in this field can lead to critical negative effects palpable in the very short term and irreversible long lasting consequences.

Artificial Intelligence proposes the opportunity to revolutionize the Medical field. This could be achieved in different ways, from accu-rate algorithms for early detection of diseases [1] [2] [3] [4], systems that aid in better/more effective drug development and testing, re-searched in [5] [6], medication management [7], to algorithms that help in processing and searching through the vast medical data insti-tutions have [8] [9].

Focusing on medical information management, not only laypeople but also clinicians have problems understanding medical documenta-tion created by other professional groups. With this in mind, and considering that tons of new information is being generated on a daily basis, professionals face a big challenge in the timely and efficient gen-eration and sharing of this data [10].

Even though medical institutions have policies and procedures to ensure the correct documentation of cases, so that data is accurate and available to relevant professionals, the information loss is still high [11]. Taking the case of nurse shifts information handover as an example, while the procedures might vary in different parts of the globe, for instance due to better or worse conditions and to the tech-nologies employed, the basis of this pipeline remains the same: nurses and doctors have a time-fixed shift during which they are responsible for looking after a number of patients, and at the end of the shift they have to transfer to the professional who is going to take over next, the knowledge of the status of each patient as well as the authority and responsibility to continue their treatment; this knowledge transfer includes information such as: basic patient personal data (e.g. name, surname, age); a description of the problem history and the conditions in which the patient entered the hospital; an explanation of the pa-tients’ diagnosis; the suggested treatment, the medication and dosage he/she is taking; and lastly, information about the current status,

(8)

highlights of the evolution or observations during the last shift, and the future steps and expected outcome. In the medical literature these kind of forms are referred to using different denominations, such as, handover, handoff, sign-out, shift-report ; in this work we are going to consistently call them handover forms.

Handover forms constitute the primary (if not the single) tool pro-fessionals have in order to transfer knowledge about the subjects; these forms contain critical information about the patients care and have a direct impact on their safety. It is normal that this information ex-change at the end of a shift is carried out orally, the person responsible for the last shift will start commenting on each patients situation and the receiver of this information will have to capture as much informa-tion as possible. A study has shown that in this way, 65% to 100% of the shift information is lost after only a few cycles [11]. As health-care became more specialized, the number of professionals involved in assisting an individual raised, and with that higher decentralization more and more complex handover forms had to be completed. Ineffec-tive information handover can contribute to failures in patients safety, for example, in medication errors [12] [13], wrong-site surgery [14], and patients death [15] [16]. Several studies tried to address the rele-vance of handover forms in the medical field. One study on incidents reported by surgeons showed that in 43% of the cases communica-tion breakdown was a contributing factor, where two thirds of these communication related problems were due to handover issues [17]. Even though handover forms are common practice in communication between physicians, a study found errors in 67% of the forms [18]. Another work among nurses showed handover reports were a common concern due to incompleteness or missing information [19].

With the objective of minimizing the aforementioned disadvan-tages of the current nurse shift information handover and following the CLEF 1 [20] nurse handover information pipeline, the new pro-cedure is depicted in Figure 1.1. In this manner, oral information exchange between nurses is first translated to written free text reports using Speech Recognition, and then with Information Extraction tech-niques, relevant chunks of text are identified in order to fill in the han-dover form fields; later on, the resulting forms are reviewed by humans and saved as final versions in electronic devices.

Figure 1.1: AI assisted handover pipeline.

This work presents a thorough analysis of existing techniques used to extract information from plain text data, and proposes new

(9)

meth-sented in this report are going to be used to fill in nurse shifts handover forms in a completely automatic fashion, although they have no par-ticular restrictions that would limit their applicability to other tasks and domains.

In Figure 1.2 an example of a written nurse handover is shown. This extract constitutes a full document, and is the result of the speech recognition step in the pipeline discussed above.

Figure 1.2: Written nurse handover.

Figure 1.3 shows a representation of the structure of the form. It consists of a total of 36 fields, one of which is a special label dubbed ”NA” used for identifying irrelevant information, that is, tokens which should not be included in the form; thus, the remaining 35 labels corre-spond to relevant information, and can, at a higher level of abstraction, be grouped into 5 broader categories named: Patient introduction, My shift, Appointment, Medication, and Future care.

(10)

1.1 Research questions

The main goal of this thesis is to extract relevant information from clinical plain text with the purpose of completing a nurse shift han-dover form. In order to achieve this objective, the following research questions naturally raise and will be investigated:

RQ1: Do hand-crafted and automatic feature based models pro-duce comparable results?

Can we use statistical models that automatically extract fea-tures from data to complete handover forms? How good is the performance compared to hand-crafted feature based models? RQ2: What are the most important features for automatically filling in nurse handover reports?

How do automatically extracted features affect the statistical models’ behaviour and thus, which features are important for this task? How can features of different nature be integrated in the information extraction system in order to successfully clas-sify medical plain text? The set of features include: word-level features such as embeddings, part-of-speech tags, and named-entity-recognition tags; and document-level features, like sen-tence and section locations.

RQ3: Can we stack several models in a pipeline so as to maxi-mize the correctness of the report fill in?

Are models good enough when acting alone? or do ensemble methods boost the overall performance?

RQ4: Can we alleviate dataset limitations, such as unseen words and training distribution skewness?

Can we process the data before it is fed to the statistical models and yield better performance?

1.2 Contributions

The main contributions of this work are the following:

• We implement Information Extraction state of the art solutions to this task, tackling it as a multi-class classification problem. • We apply different Machine Learning algorithms, discuss their

performance, and contrast them to state of the art solutions. • We explore and set up different classification pipelines and

(11)

• We analyse the dataset, focusing on the limitations it carries, and we introduce alternatives to overcome them.

The remaining of this report is organised as follows: In Chapter 2 solutions oriented to solve similar tasks are presented, we analyse their suitability to this problem and how they can be adapted; in Chapter 3 the datasets are introduced with their respective analysis, as well as the evaluation criteria used to assess the performance of the solutions; in Chapter 4 a detailed explanation of our methods is given, concerning design and implementation aspects of each one of them; Chapter 5 shows the results obtained when running our methods, and present an explanation for their behaviour. Lastly, in Chapter 6 a summary of the work is carried out.

(12)

Background

2.1 Machine Learning models

In the following sections we give an introduction to how the Machine Learning models used in this thesis work. Two main families of models are implemented in this work, known as ”Artificial Neural Networks” and ”Conditional Random Fields”.

2.1.1 Artificial Neural Networks

Artificial Neural Networks are mathematical models consisting of neu-rons as basic processing units. Each neuron has a weight, a bias, and an activation function associated to it; it receives a real-valued number as input and outputs a new value based on the previous mentioned elements. These neurons are usually stacked together and organized into layers. It is common that neurons of a given layer are connected to all other neurons in their neighbouring layers, receiving the name of ”fully-connected networks”, but this structure is dependent on the neural network type; it can also be the case that connections exist between neurons within a layer. Given this structure and considering that neurons of a given layer perform operations by weighting the re-sults of previous neurons, we can think that units at this level make more complex and abstract decisions that ones at previous layers.

Learning in Neural Networks is achieved by iterative training. The model receives a training sample, performs a forward computation to obtain the output, and then compares the predicted value to the de-sired result and modifies its sets of weights and biases by propagating back a measure of the error. In this way, the model parameters consist of the neurons’ weights and biases.

Taking the classification of words as example, in its basic form, a neural network receives a word as input and outputs the probability distribution over the categories. A widely used extension to this ap-proach is to not only input a single word, but to use a sliding window on it and also feed its neighbouring tokens.

(13)

structure and how a neuron computes its output value.

Neurons compute their output values as linear combinations of

their inputs and weights: y = P

ixiwi. Figure 2.1 shows an

exam-ple of a single neuron computing the value of y from the input val-ues x1, x2, x3 with weights w1, w2, w3. These neurons can be stacked

together constructing more complex architectures, such as fully con-nected forward networks. Figure 2.2 shows an example of a Feed-forward Neural Network.

Figure 2.1: Perceptron example.

Figure 2.2: Feed-forward network example.

Figure 2.3: Artificial Neural Network example.

2.1.2 Conditional Random Fields

While Artificial Neural Networks receive a word and its contiguous tokens as input and outputs a probability distribution for that single word, Conditional Random Fields (CRFs) [21] try to shape the joint probability of a whole sequence of words. Intuitively, there is valuable information in the sequentially of the data, and CRFs allow us to incorporate it in the model.

In this work we use a particular type of Conditional Random Fields, known as ”Linear chain CRFs”. These models receive a set of handcrafted features, which are fed as ”feature functions”; these are constructed upon: a) token features (this includes current and neigh-bouring tokens) and the label to be predicted; b) current and previous label. We can take the following binary feature functions as example:

f (prev word, curr label) =   

 

1 if prev word is ’Bed’ and curr label is ’Current bed’

0 otherwise

f (prev label, curr label) =   

 

1 if prev label is ’First name’ and curr label is ’Last name’

(14)

In a more general case, a feature function is specified by the pre-vious and current label, the token features, and the sequence position to be predicted: f (yi−1, yi, x, i)

The model has a weight associated to each feature function. These weights compose the model’s parameters. With the inclusion of these weights, we can compute the probability of a sequence of length N with J feature functions as:

p(y|x, λ) = 1 Z(x)exp N X i J X j λifi(yi−1, yi, x, i)

with Z(x) being the partition function that runs through all pos-sible sequences of labels y.

Thus, learning in CRFs consists of adjusting this set of weights so as to maximize the likelihood of the true sequences. Because this opti-mization is intractable, several iterative algorithms can be applied [22]. Figure 2.4 shows an example of a Linear-chain CRF structure in which the sequence of words w1, w2, w3, w4 are associated with tags

t1, t2, t3, t4.

Figure 2.4: Linear chain CRF example.

2.2 Related work

2.2.1 Information Extraction

Information Extraction is not a new research topic, having to deal with the understanding of text, it is closely related and integrates ideas from other areas, like Natural Language Processing (NLP). NLP has been used in the past to parse and extract information from medical records. In [23] [24] authors apply a rule-based approach with the goal of identifying drug and dosage phrases in discharge summaries, while in [25] authors use a hand-crafted drug dictionary for this purpose. In this same task, authors in [26] [27] use a combination of rules and lexicon sets to extract drug terms. Machine Learning models have also been used in extracting drug terms from medical discharge sum-maries. Authors in [28] use Hidden Markov Models (HMMs), while in [29] and [30] they use domain-specific features integrated into a

(15)

domains, Information Extraction has been applied to the analysis of chest radiology reports [32], cancer-related radiology [33], mammog-raphy [34] and pathology [35] records. Our work is different from the previous ones in that we neither implement rule-based approaches nor we make use of features that are created from other medical systems. In this kind of tasks, where data from a particular document must be extracted to complete a set of fields, it is common to come across ap-proaches that can be viewed as a extended Named-entity-recognition (NER) systems, assigning a tag to each token in the text. This allows us to apply any existing token labeling algorithm. Researches have made use of Neural Networks with word embeddings to solve Named-entity-recognition tasks before, as in [36]. In [37] authors apply a Con-volutional Neural Network that processes entire sentences to extract relevant features, and show state of the art results in four chosen to-ken labeling tasks: Part-of-speech tagging, Named-entity-recognition, Chunking, and Semantic role labeling. Based on this original work, others try to extend it with additional features. Authors in [38] use a context window Neural Network (also found in [37]) that integrates and learns tag embeddings to perform the prediction step. In [39], this idea of a Convolutional Neural Network that acts on words as units is enhanced by maintaining a character-level embedding structure, and report scores outperforming state of the art methods. Inspired by the work in [37], we construct a similar Convolutional Network to extract relevant features from sentences in the clinical dataset.

Moving away from Neural Network techniques, NER systems often implement Conditional Random Fields (CRFs) to perform the task. Authors in [40] integrate word embeddings into a CRF to identify entities, such as genes and proteins, in the Biomedical domain.

The work in [41] proposes the use of Structural Support Vector Machines and shows a better performance when compared to CRFs applied to the Clinical domain.

In our work, we implement a CRF model that extends the set of used features mentioned in the previous related work. Our models combine domain-independent features, such as lexical, syntactic, se-mantic features, and higher order features like sese-mantic embeddings clustering.

2.2.2 The CLEF e-Health challenge

In recent years many challenges related to the Medical domain have been proposed. Some of the most important and interesting competi-tions aiming at recognizing medical entities in plain text include the i2b2 Workshops 1 _{[42] [43] [44] [45] and the CLEF e-Health labs} 2

[46] [47] [10] [48].

The CLEF 2016 e-Health competition covered a wide range of top-ics, going from Clinical handover Information Extraction, to

multilin-1

https://www.i2b2.org/NLP 2

(16)

gual settings, and Medical Information Retrieval. The starting point of the challenge consisted on the fact that not only laypeople but also clinicians have problems understanding medical documentation cre-ated by other professional groups. With this in mind, and considering that tons of new information are being generated on a daily basis, professionals face a big challenge in the timely and efficient generation and sharing of this data [49].

The CLEF organisers’ approach [20], published in 2015 in a top-tier medical informatics journal, consists of a Conditional Random Field making use of syntactic, semantic, and in-domain features extracted from other Medical software, to classify tokens among the entire tag set. Additionally, authors in [50] and [51] focused on the exploration of the feature space using a Conditional Random Field as classifier, in combination with hand-crafted tagging rules. They construct features resulting from syntactic, semantic, and in-domain systems. As in the previous methods, in our work we investigate the usefulness of Condi-tional Random Fields and hand-crafted features, but we only utilize domain independent features; moreover, we do not use hand-crafted rules to perform the classification task.

(17)

Experimental setup

3.1 Datasets

3.1.1 NICTA Synthetic Nursing Handover Data

The primary source for data used in this work corresponds to the datasets provided by the CLEF e-Health challenge, named by its creators: NICTA Synthetic Nursing Handover Data [20] [52]. This dataset was originally developed for Clinical Speech Recognition and Clinical Information Extraction related to nurse shifts handover at Australia’s National Information and Communications Technology re-search center (NICTA). The dataset is a compendium of synthetic medical records created considering the most common chronic condi-tions in health priority areas in Australia.

The dataset is subdivided into three groups, meant to be used as: a) training; b) validation; and c) testing sets. Table 3.1 presents an overview of the datasets. These numbers account for tokens found in the data, with punctuation removal.

Table 3.1: Datasets overview

Dataset # docs # tokens # word Token overlap Token overlap types w/stopwords w/o stopwords

Training 101 7451 1347 -

-Validation 100 6798 1291 645 (49.96%) 560 (43.38%) Testing 100 5741 1213 527 (43.45%) 453 (37.35%)

As it is shown in the table, the datasets are roughly the same in terms of size, namely: number of records included, number of tokens, and number of unique words (also referred to as ”word types”). But nearly half of the word types present in the validation and testing set are not seen in the training group (50.04% and 56.55%, respec-tively, when considering stopwords, and 56.62% and 62.65% excluding them). This is, definitely, an obstacle to overcome if, for instance, the statistical algorithms are going to make use of words as basic units of information. This characteristic does not affect all categories evenly, some of them have a higher percentage of unseen tokens than others. Figure 3.1 shows the ratio of unseen tokens to total tokens per tag.

(18)

From this analysis we can hypothesis that some categories are more difficult to predict than others.

Figure 3.1: Testing set unseen tokens per tag (on top of the bars, the number of unseen tokens is plotted).

Looking at the training, validation, and testing sets, we can ob-serve that there is a mismatch between the tags that are included in one and the other. Namely, the training dataset consists of 36 tags, 3 of which are not found in the validation set (tags: ’Appoint-ment/ Procedure ClinicianGivenNames/ Initials’, ’Appoint’Appoint-ment/ Pro-cedure Ward’, ’Appointment/ ProPro-cedure City’ ); and 1 which is not included in the testing dataset (tag: ”Appointment/Procedure City”). The validation set includes 36 tags, 3 of which are not seen in the afore-mentioned group (’PatientIntroduction Title’, ’Appointment/ Proce-dure ClinicianTitle’, ’Appointment/ ProceProce-dure Hospital’ ). Finally, the testing set has 37 tags, with 2 novel inclusions with respect to the training tags group (tags: ”Appointment/Procedure ClinicianTitle”, ”Appointment/Procedure Hospital”). When computing the classifica-tion scores, we do so considering only the tags that are included in the respective dataset.

The ”NA” tag accounts for the majority of the tokens in each one of the datasets. Because we are dealing with a multiclass clas-sification task, this difference in label group sizes will play a signif-icant role at prediction time. Figure 3.2 shows a plot of the train-ing empirical distribution, depicttrain-ing the token count and the

(19)

asso-ture Goal/TaskToBeCompleted/ExpectedOutcome” tag with 496

to-kens, while the smallest ones have 2 or 3 tokens (e.g.

”Appoint-ment/Procedure Ward”, and ”Appoint”Appoint-ment/Procedure City”). As it can be seen, the vast majority of the tags has a very low probability mass associated to them.

Figure 3.2: Training set empirical distribution (on top of the bars, the number of tokens included in each category is shown).

Another important aspect that determines the complexity of the classification task is the tag overlap that each label carries; that is, how the same word can be assigned to different labels. Figure 3.3 shows this analysis on the testing dataset. The majority of the labels intersect, primarily, with the ”NA” tag. But this is not always the case, for instance, the tag ”PatientIntroduction GivenNames/Initials” overlaps most of the times with ”Medication Medicine”; although this sounds unlikely, the reason for it is a particular token (”per”) which is one of the most repeated tokens in ”Medication Medicine”, and has an occurrence as ”PatientIntroduction GivenNames/Initials”. Categories that have a very high overlapping ratio also consist of few word types. If we contrast the number of training label samples with the num-ber of training documents, it is clear that not all labels appear in all documents. Figure 3.4 makes this idea more explicit and shows the probability distribution of a tag being included in a document.

There is only one tag which we could be completely sure it would be included in the document, and that is the ”NA” tag. Not even basic information that identifies a patient appear in all 101 training documents. 7 tags are above the 0.9 probability threshold, while only

(20)

Figure 3.3: Testing set tag overlap (for each tag, the overlap ratio and total number of word types is specified).

10 tags are above the 0.5 probability threshold. Given the fixed set of categories in the handover form, several of them will be left blank due to missing information; also considering the ”NA” tag is the only tag that present in all training documents, it is more likely to be predicted than the rest of the tags. This supposes a higher level of complexity for the classification task.

As it was previously stated, looking at the categories of the han-dover form, we hypothesize that some categories are easier to predict than others. With the aid of the empirical distribution, the ratio of un-seen words, and the tag overlap, we can conclude that the dataset con-sist of easier tags, such as ”PatientIntroduction Ageinyears” or ”Pa-tientIntroduction Gender”, and much more difficult ones, like ”Med-ication Status” and ”Future Alert/Warning/AbnormalResult”.

The NICTA dataset comes with a defined set of features at the word level. This features cover the syntactic, semantic, and statistical related space. Table 3.2 shows a description of the features. These fea-tures were extracted using Stanford CoreNLP by the Stanford Natural Language Processing Group [53], MetaMap 2012 by the US National Library of Medicine [54], and Ontoserver by the Australian Common-wealth Scientific and Industrial Research Organisation [55].

(21)

Table 3.2: NICTA dataset features.

Type Name Definition Origin

Syntactic

Word Word itself None

Lemma Word lemma CoreNLP

NER Word NER label CoreNLP

POS Word POS tag CoreNLP

Parse tree Parse tree of the sentence

from the root to the current word

CoreNLP

Dependents Dependents of the word CoreNLP

Governors Governors of the word CoreNLP

Phrase Phrase that contains this

word

MetaMap

Semantic

Top 5 candi-dates

Candidates retrieved from Unified Medical Language System (UMLS)

MetaMap

Top mapping UMLS mapping for the

con-cept that is the best match with a given text snippet

MetaMap

Medication score

1 if the word is a full term in Anatomical Therapeutic Chemical List (ATCL); else 0.5 if it can be found in ATCL; 0 otherwise

NICTA

Statistical

Location Location of the word in the

document

NICTA Normalized

term

fre-quency

Token occurrence / tokens in document

NICTA

Top 5 candi-dates

Candidates retrieved from

Systematized

Nomen-clature of Medicine

(SNOMED)

Ontoserver

Top mapping SNOMED mapping for the

concept that is the best match with a given text snippet

Ontoserver

Top 5 candi-dates

Candidates retrieved from Australian Medicines Ter-minology (AMT)

Ontoserver

Top mapping AMT mapping for the

con-cept that is the best match with a given text snippet

(22)

Figure 3.4: Label document probability

3.1.2 WSJ corpus

In some cases, the statistical models implemented in this work make use of part-of-speech tags as features, with the intuition that POS labels add important syntactic information and might aid in learning. There are two ways in which we make use of POS tags: in the simplest version, we take the POS name as an additional feature; in the second one, we learn mathematical representations for POS labels and words and further use those entities as features in the model. In order to learn such representations, we utilize the Penn Treebank which is a collection of syntactic trees of sentences taken from the Wall Street Journal corpus. The original treebank was pre-processed by removing punctuation and eliminating sentences longer than 40 words. Table 3.3 shows statistics about the corpus.

Table 3.3: Penn treebank (WSJ sample) stats.

# Sentences # Tokens # Word types

(23)

3.2 Evaluation metrics

The methods presented in this work will be evaluated in terms of the well known classification metrics, namely Precision, Recall, and F1-score. In this task, we have to classify tokens as members of 36 different groups (39 if we consider the tag mismatch mentioned in the previous section), therefore, it is important to have a single number representative of the overall performance of the approach. For this rea-son, the precision, recall, and f1-scores will be computed and reported individually per tag, as well as aggregated using micro and macro av-erage scores. Despite these measures are popular and widely used in classification tasks it is important to present a brief explanation on how the label aggregation is computed.

Given the values of Table 3.4 we can compute the evaluation scores as follows:

P recision = True positives

True positives + False positives

Recall = True positives

True positives + False negatives F 1 = 2 × P recision × Recall

P recision + Recall

Table 3.4: Example performance table for a given class A. True values

A Not A

Predicted values A True positives False positives

Not A False negatives True negatives

The macro-averaged scores are computed taking the average of the results obtained for each class.

Let Mi(tpi, f pi, f ni) be the macro-averaged function computed for

tag i, dependent on the True positives, False positives, and False negatives values for that tag, the formula can then be expressed as:

Mmacro= 1 N N X i M (tpi, f pi, f ni)

The micro-averaged scores are calculated as:

Mmicro = N

X

i

λiM (tpi, f pi, f ni)

Where λi is the relative frequency of the class in the dataset,

com-puted as:

λi =

Number of instances of class i Number of instances in the dataset

(24)

Note that while the macro-averaged calculation assigns an equal weight to every label, the micro-averaged one considers the relevance of the class in terms of size. Because of this distinction, it is to be expected that macro-averaged results will give an overall estimate of the performance, while the micro-averaged one will express how well we do on the bigger classes.

Last, to have a better understanding of what is going on in the classification procedure, it will be beneficial to separate the score of the ”NA” tag from the rest of the categories.

(25)

Methodology

4.1 Hand-crafted versus automatic

fea-ture based models

We begin this chapter by investigating the performance achieved by Conditional Random Fields (CRFs) [21] [22] and Artificial Neural Net-works (ANNs), and focusing on RQ1: Do hand-crafted and automatic feature based models produce comparable results?These two models are interesting to compare because they either receive a set of manu-ally engineered features, as in the case of CRFs, or they automaticmanu-ally learn them, as in ANNs, lowering the effort of the arduous task of designing features.

4.1.1 Artificial Neural Networks

ANNs come in several flavours, throughout the years many architec-tures have been developed. In this work we compare the usefulness of Neural Networks in their most basic form (the ”Feed forward Neural Networks”; ”Recurrent Neural Networks”, due to their natural applica-tion to sequences processing; an architecture inspired from the related literature, dubbed ”Last tag Neural Networks”; and ”Convolutional Neural Networks”, because of their success at extracting features.

Regarding implementation details, the neural architectures are im-plemented in Tensorflow [56] and Theano [57].

Feed forward Neural Networks

In this first approach, a context window one hidden layer feed forward neural network is implemented. The neural net receives n-vector rep-resentations which correspond to the words in the sliding window, and outputs a probability distribution over the tag set. The output tag set consists of 39 tags, considering the 36 validation tags plus the 3 additional tags in correspondence with the tag mismatch between the training, validation, and testing sets.

Figure 4.1 shows a graphical representation of the neural network. It has an input layer of the size of the vocabulary, by which receives a

(26)

one-hot encoding representation of the word (a series of them if we use a context window greater than 1). It has a word embeddings weight matrix W1 and a bias vector b1. After the word embedding vectors are concatenated and the bias added, a tanh function is applied for non-linearity. Finally, a softmax layer is added with weights W2 and bias b2. Although in Figure 4.1, for the sake of simplicity and readability, only some neuron connections are shown, in reality all layers are fully connected.

Figure 4.1: Context window feed forward neural network. The neural network is trained using Backpropagation [58] and Ada-grad as Ada-gradient optimizer [59]. We run different experiments to set the optimal initial training and fine tuning (if applicable) learning rates, as well as the minibatch and context window sizes.

With the purpose of preventing the model to shape the noise in the training data, a Dropout layer [60] is used.

Under the assumption that the word embeddings present a highly non-linear property, we add an extra hyperbolic tangent hidden layer to the above architecture to properly separate the datapoints. This new hidden layer size is a hyper-parameter to adjust. In this latter configuration, the initialization of the weights, the regularization, and the training process are consistent with the single hidden layer archi-tecture.

Recurrent Neural Networks

The Feed Forward Networks classify a token by only making use of its word representation and the representations of its context words. While Feed-Forward Neural Networks are constrained to receiving a

(27)

an internal state of all the units in the sequence. This property makes them attractive to text processing.

We start with a simple RNN, but because the maximum length of the sentences in the training data reaches up to 65 tokens, in order to prevent vanishing gradients [61] [62], we move forward in complexity to a Long Short Term Memory unit [63] following [64], and a Gated

Recurrent Unit network based on [65]. We explore the impact of

using unidirectional and bidirectional networks [66], as well as gradient clipping on these architectures to prevent exploding gradients [61]. Regularization is prevented by applying Dropout [60] on the hidden layer activations.

Last tag Neural Network

With the idea of appending a tag embedding to each token represen-tation in the context window, as proposed by [38], we implement a feed forward neural network that appends a representation of the last predicted tag in the sentence. In this way, we are adding information of previously predicted tags and, hopefully, allowing the Neural Net to learn realtions among them. At each sentence we set the last predicted tag value to the <PAD > tag, as depicted in Figure 4.2. Overfitting is prevented by applying an L2 regularization on the word embedding matrix, the tag embedding weights, and the softmax layer weights (in case that a second hidden layer is used, the L2 is also applied to that layer weight matrix).

(28)

Convolutional Neural Networks

In the convolutional setting, the input to the neural network consists of a sentence matrix, namely a stack of vector representations for the words in a context window, to which a set of convolutional filters of different region sizes are applied, followed by a max pooling and a hy-perbolic tangent nonlinear operation. Next, the vectors resulting from the max pooling operations are concatenated so as to form a hidden layer, to which a softmax classifier is attached. Figure 4.3 shows the architecture of the convolutional net. In the graph, a sentence matrix of 4 words is presented, to which convolution filters of sizes 2 and 3 are applied (in this case, there are 2 filters per region size). As a result of convolving the sentence matrix with filters of size 2, a 3 dimensional vector is produced as an output. At this point the nonlinearity is applied, and the max pooling operation acts on each of the filters’ re-sult. At the end, the softmax layer is presented (for simplicity reasons not all connections are drawn, but it corresponds to a fully connected layer).

Figure 4.3: Convolutional neural network.

In order to prevent overfitting, Dropout [60] is applied to the hid-den layer at training time. We run experiments to set the optimal Dropout hyperparameter value. Authors in [67] show that enforcing an additional L2 regularization on the weights have little effect on the end results, hence we do not apply an L2 regularization.

As it is the case with the hidden layer feed forward networks described above, we investigate the usefulness of training word em-beddings from scratch, interjecting them with pre-trained representa-tions, and switching between fine tuning them or just learn the filters

(29)

described before, abbreviations used in the following sections, their regularization methods, and the page in which they are described.

Table 4.1: Neural Network architectures summary.

Full name Abbreviation Regularization Page Feed-forward Feed-forward Dropout 24 Simple Recurrent. Simple RNN Dropout 25 Long-Short Term Memory LSTM Dropout 25 Gated-Recurrent-Unit GRU Dropout 25

Last Tag Last Tag L2 26

Convolutional Convolutional Dropout 27

4.1.2 Conditional Random Fields

The advantage of Conditional Random Fields [21] [22] with respect to other non-sequential Machine Learning models, such as Neural Net-works, is that while the later compute the probability of a given se-quence by assuming independence between tags, as shown in Equa-tion 4.1, CRFs model the probability of the entire sequence by con-sidering how likely a tag is given a certain clique of the graphical model as well as how likely the precedence of tag pairs are, as shown in Equation 4.2. p(y|X) = T Y i p(yi|X) (4.1) p(y|X) = exp n_XT 1 aunary(yt) T −1 X 1 apairwise(yt, yt+1) o /Z(X) (4.2)

The CRFs used in this work correspond to ”Linear chain Con-ditional Random Fields”, and are trained using the Limited-memory Broyden-Fletcher-Goldfarb-Shanno method [68] [69] and L2 regular-ization in order to avoid overfitting the training data.

The Conditional Random Field models are implemented using CRF-Suite [70] (following [71] when real-valued features are used).

4.2 Feature set extension

Generally speaking, Machine Learning models for classification work by finding relations between the features of its input entities. While different in the objective function they optimize and the optimization algorithm they use, much of their performance depends on the quality of the features they use.

The original dataset comes with a set of features at the word level, e.g. word lemma, word part-of-speech tag, word named-entity-recognition, etc. We use a subset of these features, composed only by

(30)

domain-independent features, and extend them and investigate ways in which all this related information can be brought together so as to maximize the correctness of the classification prediction. Additionally, while searching for a suitable model, we analyse what is the impact of each feature, and what is the reason for that behaviour. In this section, we aim at answering RQ2: What are the most important features for automatically filling in nurse handover reports?

4.2.1 Feature set description

In this section we describe the set of features used. They can be categorized in the following groups, each one of them adding a new information level to the feature space making them attractive to in-vestigate:

• Lexical features: including word surface forms, tokens’ prefix and suffixes, word capitalization and digit indicators.

• Positional features: includes the document section number where the token lays.

• Syntactic features: including word tenses and Part-of-speech tag embeddings.

• Semantic features: including Word2Vec similar words, Word2Vec vector clustering, and Latent Dirichlet Allocation for Topic Mod-elling.

• Knowledge-based features: this includes information taken from Google Knowledge Graph.

Lexical features

Word surface form The word surface form corresponds to the form

of a word as it appears written in the text. While in CRFs we can use this feature as a string value, Neural Networks need a real-valued vector form. We experiment with different representations and com-pare their performance. Given the size of the datasets (aprox 8000 tokens), we hypothesize it does not contain enough tokens to prop-erly train the word embeddings from scratch (that is, starting from an arbitrary initialisation that does not include prior linguistic knowl-edge) and that pre-trained vectors add expressive power to the models because they were already trained to incorporate semantic relations between the words. For this reason, the word embeddings matrix will be randomly initialized and intersected using pre-trained vector rep-resentations. We experiment with the Google newswire [72] and the GloVe [73] embeddings. The former word embeddings were trained following the Word2Vec implementation [74], and they contain 300

(31)

di-Table 4.2: Word embeddings initialisation.

Initialization Description

Google newswire embeddings Random initialization intersected

with the Google newswire pre-trained vectors 1

GloVe embeddings Random initialization intersected

with the GloVe pre-trained vectors2

Random Random initialization with no

in-tersection

Probabilistic represent the word as the

probabil-ity of it being assigned to each tag

One-hot encoding One-hot vector representation

datasets, and contain 300 dimensional vectors as well. We experi-ment with the following word embedding settings: a) training word embeddings from scratch; b) initializing them with pre-trained repre-sentations and fine-tuning them with the clinical dataset. Table 4.2 shows the methods used to initialise the word representations.

Tokens’ prefix and suffixes We utilize the first and last three

characters of every token as additional features.

Is the word capitalized? This feature looks at a token and

indi-cates if it contains an upper-cased character. Figure 4.4 depicts the prior and posterior probability of a label given whether a token is capitalized or not.

As it can be seen in the graph, if we know that the word is cap-italized, some probability mass is going to be concentrated in the names and last names categories: ”PatientIntroduction GivenNames/ Initials”, ”PatientIntroduction Lastname”, ”PatientIntroduction Un-derDr Lastname”, but the label that accounts for most of the mass is ”PatientIntroduction Gender”; this is because many times sentences start with capitalized personal and possessive pronouns referencing the patient (e.g. ”He came in with headache and vertigo...”). It is also to be noticed that while ”PatientIntroduction UnderDr Lastname” has a relatively high probability, ”PatientIntroduction UnderDr Given-Names/Initials” has almost none; this can be explained by the few samples this category has (namely, 15 samples).

In the case that the given sample is not capitalized, then the ma-jority of the probability is going to be assigned to the ”NA” tag.

Does the token contain a digit? This feature looks at a token

and indicates if it contains a digit. Figure 4.5 illustrates the prior and posterior distributions of the labels given this feature. There are no surprises in this case, as it could be expected, the categories that are naturally related to numbers account for most of the probability

(32)

Figure 4.4: Feature is the word capitalized.

mass: ”PatientIntroduction Ageinyears”, ”PatientIntroduction Cur-rentRoom”, and ”PatientIntroduction CurrentBed”.

Positional features

Section number When telling a story, writing a document, or

nar-rating the status of a patient, there is a latent inherent structure that we use to put all details in a clear manner. We can expect that these handover written reports also have a section construction, such as: Introduction of the patient, Current status, and Future steps.

Figure 4.6 shows how each tag is distributed across document sec-tions, considering that each document can be divided into 6 sections. It is interesting to notice that some categories are almost exclu-sively used at the very beginning of the descriptions (these include introductory information, such as, names, rooms and beds), while others are evenly used throughout the document (like the ”NA” tag). The rest of the labels share the characteristic of not appearing in the very beginning, and having a slight preference for being used in the middle or the end of the document. It becomes clear in this graph the use of the ”PatientIntroduction Gender” tokens; as mentioned before, firstly the patients are introduced, and then referenced using personal and possessive pronouns.

(33)

Figure 4.5: Feature is the word a digit.

Syntactic features

Word tense If we consider the way in which the handover

knowl-edge is communicated from a temporal point of view, it seems natural to think that there is a clear distinction between current information, as the patient’s data and current status; past information, like chronic conditions and patient’s history; and actions which are placed in the future, such as expected outcomes and future care. Having this in mind, using the part-of-speech tags and the dependency parsing we try to associate the temporal information of each token in the dataset. The procedure is shown in Algorithm 1.

(34)

Figure 4.6: Feature section number distribution.

Function extract tenses(sentence)

Determine position of verbs in the sentence; switch # positions do

case 0 do

Assign tokens to present tense ; // Default value

end case 1 do

Assign tokens to that verb’s tense; end

case > 1 do

foreach token in sentence do determine token tense(token); end

end end

Function determine token tense(token) if token is a verb then

Assign token to that verb’s tense; end

if token has no governor then

Assign token to present tense ; // Default value

else

/* Backtrack to find an associated verb. */

determine token tense(governor(token)); end

Algorithm 1: Tense extraction algorithm.

In addition to the algorithmic procedure presented before, we take care of the following cases: a) punctuation marks have no tense, and we assign an ”NA” value to them; b) a POS tag ”MD” + a word surface ”will” gets a ”Future” tense assigned to it; c) a POS tag

(35)

The part-of-speech tag verb tense mapping was taken from the Penn Treebank description, and it is listed in table 4.3.

Table 4.3: POS tag tense mapping.

POS tag Tense VB Present VBD Past VBG Present VBN Past

POS tag Tense VBP Present VBZ Present MD Modal

Because this algorithm relies on previous POS tagging and depen-dency parsing steps, the quality of the outcome will depend on the quality of the parsing algorithms; if the sentences are wrongly parsed, then the tense extraction will not be reliable. Figure 4.7 shows an ex-ample of a wrongly tagged sentence, which makes the whole sentence be associated to a ”Past” tense, when it should actually be attached to a ”Present” one.

Figure 4.7: POS incorrect parsing example.

Another drawback of the algorithm is that discontinuous present tense forms like ”He’s also got” and complex formations like the one shown in Figure 4.8 will not be correctly assigned.

Figure 4.8: Incorrect tense extraction example.

Part-of-speech tag As it was mentioned before, the training, val-idation, and test datasets include part-of-speech labels for the tokens in the text. These are the result of parsing the written documents using Stanford CoreNLP POS tagger [53] [75] [76].

In addition to the given parsing, we trained vector representations for the word tokens and POS tags and use these mathematical ob-jects as supplementary features. We hypothesize that using vectors previously trained in this way include more information than sim-ple one-hot encoding or random initialisations, and thus improves the

(36)

prediction performance. We denominate this embedding initialisation as ”Skipgram initialisation”. Implementing the well-known Word2Vec Skipgram model [74], we use the WSJ corpus which is annotated with part-of-speech labels and we train a neural network model to predict the POS labels that conform the context of a given POS tag. In the example of Figure 4.9 the label ”DT” would be used to predict ”IN” and ”NNP”, and the label ”NNP” would be used to predict ”DT” and ”CD”.

Figure 4.9: POS embeddings example.

As a pre-processing step, we set a minimum count threshold, and we replace words that appear less than than number of times (rare words) by a special token ”<UNK >”.

The neural network is implemented in Tensorflow [56]. It is trained with Backpropagation [58] and minibatches. The loss function to op-timize is based on the noise-contrastive estimation function [77]. Semantic features

The following features aim at achieving a better generalisation perfor-mance by finding relations in the token’s semantic space. W2V vectors directly uses the pre-trained embeddings values, while W2V similar words look for words shared between tokens, and K-means and Latent Dirichlet Allocation act at a higher level of abstraction by first clus-tering tokens and later on using their associated centroids and topic numbers.

Because we are using an external data source, it is possible that we cannot retrieve a value for a word in the training set. In this case, we complete the feature values with a default symbol.

Word2Vec vectors We use a pre-trained Word2Vec model and feed

the vector dimensions values as numeric features.

Word2Vec similar words We use a pre-trained Word2Vec model

to extract the five most-similar words of a given token by comparing the cosine similarity of the vectors in it, and use these related words as additional string features in the classification model.

K-means clusters Having the vector representations of the words

in the dataset taken from a pre-trained Word2Vec model, we group them into 100 clusters using a K-means model [78] [79] and utilize the

(37)

Latent Dirichlet Allocation Using the English Wikipedia arti-cles3_{and Gensim’s Topic Modelling tool [80], we train a Latent}

Dirich-let Allocation model (LDA) [81]. Next, we use the five most probable topic numbers assigned to each token in the dataset as string features. Knowledge-based features

Google Knowledge Search Graph Google provides a search API

that allows us to query the Knowledge Graph 4_{, a network that has}

millions of entries that describe real-world entities like people, places, and things. We access this knowledge database and expand the feature set with the entities related to each token in the dataset.

4.2.2 Feature integration

Different Machine Learning models require different feature represen-tations. Conditional Random Fields, for example, allow us to integrate feature values either as strings, integers, or as real-valued vector repre-sentations of those strings. In the case of experimenting with Artificial Neural Networks, we are obliged to provide a real-valued object as in-put to the model; in these cases, we transform the string feature value into a vector representation.

When using CRFs, we implement combinations of the features previously described and compare their performance. From the neural network side, because we have a small training dataset, we hypothesize neural networks would not be able to effectively train parameters for all the aforementioned features. For this reason, we select a subset of the features list. Moreover, we hypothesize all neural network models are expressive enough to learn complex datapoint relations, but due to the limitations of the dataset they will all perform at a comparable level. Consequently, we use focus on two neural architectures, one that represents a simple model and another one that is considerably more complex. A hidden feed forward and a convolutional network with the previously described architectures are used, where embeddings for the features are either learnt from scratch (using a random initialisation) or fine tuned while training for classification (that is, embeddings are initialised using pre-trained representations). Table 4.4 outlines which features are implemented in which models and how they are used.

4.3 Model stacking

Previously, statistical models were trained in order to receive a sample (it could be a word sample, or an entire sequence sample) and discrim-inate among the entire category set. Given the training data size, the number of output labels, and thus the number of samples per label, we argue that breaking down the classification task into different steps

3

https://dumps.wikimedia.org/enwiki/ 4

(38)

Table 4.4: Feature integration.

Model Feature Implementation

Initialisation

CRF Word surface As string value.

Word lemma As string value.

Part of speech tag As string value.

Named entity recog-nition

As string value.

Parse tree As string value.

Word governors As string value.

Word dependents As string value.

Prefix & suffix As string value.

Word capitalised? As boolean value.

Word has digit? As boolean value.

W2V vectors Vector dimensions as

numeric values.

W2V similar words Top 5 most similar

words as string val-ues.

K-means Centroids as string

values.

LDA Top 5 most

proba-ble topic numbers as string values.

Section number As string value.

Word tense As string value.

Feed-forward net & Convolutional net

Word surface Real-valued

vec-tor with random

initialisation and

pre-trained

embed-dings intersection.

Part of speech tag Real-valued

vec-tor with random

initialisation.

Real-valued

vec-tor with skipgram

initialisation. Named entity

recog-nition

Real-valued

vec-tor with random

initialisation.

Section number Real-valued

vec-tor with random

initialisation.

Word tense Real-valued

(39)

yield better results. In this section we investigate RQ3: Can we stack several models in a pipeline so as to maximize the correctness of the report fill in?

For the purposes of this experiments, the models to be trained con-sist of Neural Network and Conditional Random Field architectures like the ones detailed above, as well as Random Forests for classifica-tion [82].

4.3.1 Hierarchical classification

If we look at the handover form presented in Figure 1.3, we can see that the 36 tags can be aggregated into bigger groups (as denominated before, into meta-tags). We hypothesize this document structure can be exploited to achieve better performance. The proposed pipeline consists of training a model to predict on the meta-tags and then using separate models to further discriminate at the detailed-tag level.

4.3.2 Layered prediction

Another way to structure the prediction task into steps is to select a subset of the 36 tags and perform a first classification on them, for example, take predict whether the data points belong to the category ”NA” or not; and later on, take the ”not NA” marked samples and assign them to the remaining tags, as presented in Figure 4.10.

Figure 4.10: Layered prediction pipeline.

4.3.3 Neural Conditional Random Fields

If we take into account the independence assumption established when computing the joint probability of a sequence of labels using neural networks (Equation 4.1), we can observe that there is information in the label precedence that these models are not able to capture. We would expect that a combination of labels such as ”PatientIntroduc-tion CurrentBed” - ”Future Goal/TaskToBeCompleted/ ExpectedOut-come” won’t be predicted by the neural network models if we have not seen them in the training data; but cannot assure this is the case when independently predicting a label at a time. One way in which we can enhance the inference step is by taking the pre-activations of the neural network and feeding them as real-valued unary features in a Conditional Random Field, which will also incorporate the relevant knowledge of label combinations in its pairwise feature functions. The inference step of the CRF will later make use of all this information to perform the classification.

(40)

Following the work in [83], we interface these two models scaling the neural layer values by a real number, which is a new hyperparam-eter to be set.

4.4 Dataset limitations

In a previous chapter we presented an analysis of the dataset, and we identified two important characteristics: a) the number of unseen words between datasets; and b) the skewness of the empirical distri-bution towards the ”NA” tag and the lack of data, due to some tags having only a few training samples. In the following subsections we will describe a group of approaches that try to overcome this situa-tion and answer RQ4: Can we alleviate dataset limitasitua-tions, such as unseen words and training distribution skewness?

4.4.1 Out of vocabulary words

We use an English Thesaurus part of the MyThes project [84], and we replace unseen words in the validation and testing set by synonyms that are found in the training set. We argue this method reduces the number of out of vocabulary words, but it also includes noise in the datasets, because there is no account for replacing synonyms altering the meaning of the sentences.

4.4.2 Data skewness

Data augmentation

The main idea behind data augmentation is to increase the dataset size by generating more training samples and thus increasing the gen-eralization property of the models. One approach to achieve this goal could be to gather clinical shift records and manually annotate them following the procedure applied to generate the NICTA dataset. This strategy would result in high quality training data, given that the proper labeling procedure is applied, at the expense of a very time consuming task. Instead, we opted for a more automated, though noisier, technique. As described in [85], text has an inherent syntac-tic and semansyntac-tic structure that narrows the possibilities of generating more data samples, and the most natural choice ends up being re-placing words by their synonyms using a dictionary. For this purpose, we use an English dictionary included in the MyThes project [84], and replace words in a given clinical label with synonyms. We repeat this process making sure not to exceed the number of instances of the biggest category (i.e. the ”NA” category), because we do not want to shift the skewness from the ”NA” tag to another one.

(41)

Samples normalization

In this case, similarly to the above procedure, we generate more data points from the smaller categories, but this time we do so by simply duplicating the existing samples. No word replacements take place during this procedure.

Objective function modification

We change the Neural Network cross-entropy loss function and add a weighted term that penalises making ”NA” predictions. This ap-proach aims at reducing ”NA” misclassifications, but it is also ex-pected to hinder the classification of true ”NA” samples.

(42)

Results and analysis

Baseline results

NICTA Conditional Random Field approach

We begin by presenting the baseline results obtained using the NICTA team Conditional Random Field classification [20] which sets up one of the baseline methods. Table 5.1 shows the results for the training, validation, and test set. Figure 5.1 plots the confusion matrix of the classification outcomes.

Table 5.1: NICTA baseline results.

Dataset Micro Macro NA

Precision Recall F1 Precision Recall F1 Precision Recall F1 Training 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Validation 0.651 0.458 0.538 0.478 0.319 0.340 0.654 0.936 0.770 Testing 0.524 0.398 0.452 0.411 0.247 0.263 0.644 0.873 0.741

If we look at the numeric results, we can see that the CRF is greatly overfitting the training data, achieving the maximum scores for that dataset. Furthermore, examining the confusion matrix, it is clear that almost every category has some degree of misclassifi-cation with the ”NA” label. The most conflictive labels appear to be ”PatientIntroduction AdmissionReason/Diagnosis”, ”PatientIntro-duction Disease/ProblemHistory”, ”MyShift Status”, ”Appointment/

Procedure Description”, and ”NA”. One characteristic that

iden-tifies these five categories is that these are text-free, narrative cate-gories; they involve a somewhat short description, instead of an speci-fied/predictable value that could be drawn from a fixed set. Categories which can be associated with this idea of picked from a fixed-size set, seem to be correctly classified (take, for instance, ”PatientIntroduc-tion GivenNames/ Initials” ”PatientIntroduc”PatientIntroduc-tion Gender ).

CLEF e-Health 2016 participating teams’ results

(43)

partici-Figure 5.1: NICTA baseline confusion matrix.

Table 5.2: CLEF e-Health 2016 Task 1 participants results.

Team run Dataset Micro Macro NA

Precision Recall F1 Precision Recall F1 Precision Recall F1

TUC-MI-A Training 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Validation 0.461 0.322 0.330 0.542 0.463 0.500 0.721 0.872 0.789 Testing 0.423 0.300 0.311 0.503 0.443 0.471 0.726 0.850 0.783 TUC-MI-B Training 0.998 0.995 0.996 0.997 0.998 0.997 0.998 0.997 0.998 Validation 0.511 0.382 0.386 0.577 0.509 0.541 0.737 0.862 0.794 Testing 0.493 0.369 0.382 0.500 0.505 0.503 0.812 0.802 0.807 ECNU ICA-A Training 0.995 0.992 0.994 0.995 0.991 0.993 0.993 0.998 0.995 Validation 0.467 0.329 0.345 0.655 0.478 0.553 0.667 0.927 0.775 Testing 0.493 0.406 0.374 0.510 0.522 0.516 0.816 0.788 0.802 ECNU ICA-B Training 0.454 0.328 0.344 0.461 0.528 0.492 0.864 0.706 0.777 Validation 0.483 0.313 0.331 0.603 0.454 0.518 0.677 0.920 0.780 Testing 0.428 0.292 0.297 0.581 0.459 0.513 0.675 0.881 0.764

Runs TUC-MI-A and TUC-MI-B are based on two different CRF hyperparameter settings on a set of 41 features based on Stanford CoreNLP, Latent Dirichlet Allocation, regular expressions, and Word-Net’s and UMLS’ ontologies. On the other hand, runs ECNU ICA-A and ECNU ICA-B use a combination of NICTA’s baseline CRF with a set of hand-crafted rules.

Lookup prediction

A simple but very informative approach to classify text tokens consists of selecting the most common label for which a given input token ap-pears assigned to in the training dataset. Because, as we have already seen, there are many tokens in the validation and testing set that are not part of the training set, we could opt to: a) sample a label from the empirical distribution; or b) assign a special label for those novel tokens (e.g. an ”#IDK#” label).

(44)

Table 5.3: Lookup baseline results.

Method Dataset Micro Macro NA

Precision Recall F1 Precision Recall F1 Precision Recall F1 Sampling Testing 0.463 0.377 0.416 0.355 0.256 0.271 0.700 0.890 0.783 No sampling Testing 0.550 0.373 0.445 0.469 0.253 0.289 0.772 0.859 0.813

Despite its simplicity, this method achieves a high performance, matching the ones of the NICTA approach (even slightly better) with a much lower complexity and using much less information.

The values obtained for both methods are roughly the same, with only some small differences in the micro- and macro-averaged preci-sion score. When we do not sample from the empirical distribution, all the novel labels end up in this general-purpose category termed ”#IDK#”. If we select a label for those tokens, then the majority of these assignments are translated to misclassifications in the ”NA” label.

Random prediction

The random prediction baseline consists of simply picking up a label from the empirical distribution for each token, independently from one another. Table 5.4 show the scores obtained for this baseline.

Table 5.4: Random baseline results.

Precision Recall F1 Precision Recall F1 Precision Recall F1 Testing 0.027 0.025 0.026 0.017 0.016 0.015 0.412 0.445 0.428

Naturally, when randomly picking labels under the empirical dis-tribution, the majority of the samples are going to be labeled ”NA”, as it accounts for the vast majority of the probability distribution mass. Consequently, with a high number of false-positives for the ”NA” la-bel and a few true-positives for the remaining lala-bels, these values can be taken as the lowest baseline scores.

Collobert’s tagger

Reproducing the work done in [37] for the convolutional window ap-proach yields to the results shown in Table 5.5 and Figure 5.2. We implement the neural network using one hidden layer, with a tanh ac-tivation function. We apply Dropout [60] to prevent overfitting. We experiment with different convolution filter numbers and filter sizes, training and fine-tuning values, and position embedding sizes.

Training is done using Backpropagation [58] with minibatches, and Adagrad [59] as optimizer. The optimal parameters correspond to minibatches of 256 samples, a unique learning rate of 0.1, position embedding of 100 dimensions, and 400 filters of sizes 2, 3, and 5.

(45)

Table 5.5: Collobert’s neural network baseline results.

Figure 5.2: Collobert’s baseline confusion matrix.

the labels ”Future Goal/TaskToBeCompleted/ExpectedOutcome” and ”NA”.

Bonadiman’s tagger

Following [38], we set up an equivalent neural network, with a one hidden layer architecture running through a tanh nonlinearity, with an L2 regularization on the word embeddings, the label embeddings, and the softmax layer weights. Training is done using Stochastic Gradient Descent and Backpropagation [58], using Adagrad [59] as optimizer. The initial learning rate is set to 0.01, and we implement a learning rate decay approach, by halving the learning rate value if the current epoch validation error is larger than the previous one. We initialize the word embeddings randomly and intersect them with the pre-trained Google newswire vector representations.

We run experiments using a one hidden layer and a two hidden layer architecture, in the latter case, using second hidden layers of 50, 500, 1500, and 3000 neurons; varying the context window size, between 1 (no context window), 3 (that is, 1 word to each side), 5, and 7; and modifying the label embedding dimensions, using values between 1 and 250. Lastly, we also study the influence of accumulating minibatch gradient updates and performing a single update with the mean of those gradients.

(46)

Best results were obtained when using a window size of between 5 and 7 tokens and a one hidden layer architecture with label embed-dings size of 5. Summing or averaging gradients does not produce a relevant difference.

Table 5.6 and Figure 5.3 show the obtained results and the corre-sponding confusion matrix. By looking at the figure, we can conclude that the model is doing a good job on easy labels like ”PatientIntro-duction Ageinyears”, ”PatientIntro”PatientIntro-duction CurrentRoom”, and tientIntroduction CurrentBed”, but missing other easy ones like ”Pa-tientIntroduction GivenNames/ Initials”, and confusing most of the labels as ”NA”, also visible on the ”NA” recall values.

Table 5.6: Bonadiman’s neural network baseline results.

Figure 5.3: Bonadiman’s baseline confusion matrix.

5.1 Hand-crafted versus automatic

fea-ture based models

In this section we aim at comparing the usefulness of utilizing hand-crafted versus automatic features, and to observe the classification behaviour when using different feature combinations, hence answering RQ1: Do hand-crafted and automatic feature based models produce