Annotating Dutch pathology reports with machine learning

(1)

Annotating Dutch pathology reports with machine learning

Scientific Research Project

T.J.Henrich

10507280

July 2019

Location

Amsterdam UMC - Location AMC Meibergdreef 9, 1105 AZ Amsterdam

Department of Medical Oncology Mentor

Martijn G.H. van Oijen, PhD Associate Professor

Tutor

Ronald Cornet, Dr. Ir. Associate Professor

Period

(2)

Abstract

Introduction

The nationwide network and registry of histo- and cytopathology in the Netherlands (PALGA) stores excerpts of pathology exams in the Netherlands. These pathology excerpts contain conclu-sion reports with the most important findings, which are written in free-text. Concluconclu-sion reports contain medical information that is indispensable for evaluation and monitoring of population screening. Finding reports on a specific subject is a difficult task when the reports are written in free-text. Annotating free-text highlights key elements that can be used in search queries. Conclusion reports are annotated with codes, providing researchers a way to use the excerpts for population screening. A pathologists annotates a conclusion report by selecting codes from the PALGA thesaurus, a list of terms with corresponding codes. Manually coding is time consuming and shown to be prone to error, automatic coding can potentially solve both problems. In the past a rule based system called ”Vertaalmodule” was created to perform automatic coding in conclusion reports. Machine learning can be used to annotate free-text, often with great results. While there is a rule-based system that can code conclusion reports, there is no machine learning solution yet. The main question answered in this thesis is derived from the potential machine learning has: How can machine learning be used to annotate pathology reports? To find an answer, several research questions have been created:

- What applications of machine learning are used to annotate clinical text in recent literature? - What is the performance of machine learning in pathology excerpts in coding and NER? - How does weak supervision effect the performance of machine learning?

Background

Clinical language processing is a field of study that focuses on (semi)automatically processing the clinical language to create predictive models or structure data. The use of clinical language processing has increased a lot in recent years due to the large size of unstructured data gathered by medical professionals and progress in the development of machine learning. Literature shows that machine learning models have successfully been used to annotate data before. The main advantage of machine learning is the large amount of models available, and the ability to chose the one that fits the data best. The main disadvantage of machine learning is the large amount of annotated data needed to train them.

Methods

Two data sets provided by PALGA were used. The first dataset consists of 25,000 excerpts with different types of pathology examinations, the second dataset consists of 25,000 excerpts with colorectal pathology examinations. A training set was developed by annotating both data sets using the existing rule-based system, the Vertaalmodule. Because the annotated terms selected by the Vertaalmodule are not perfect, using the generated training set for machine learning is called weak supervision. A third dataset, the validation set, was provided by pathologist G.Burger and consists of 1,000 conclusion reports with different types of pathology examinations. The validation set and has been labeled separately by two pathologists.

A CRF and a bidirectional long short term memory model(BLSTM) were developed. For the BLSTM combinations of word embedding, character embedding and conditional random fields were used and compared. The trained models were first tested on a ten percent sub-sample of the training sets and then on the validation set. To evaluate the performance of created the machine learning models an F1-score for automatic coding and named entity recognition(NER) tasks was calculated. The F1-score was calculated for every class(codes or NER tags) and for the average of all classes combined.

(4)

Results

The literature search resulted in 24 articles after applying the exclusion criteria. The best per-forming techniques in the articles had F1-scores between 0.59 and 0.98. A BLSTM with word and character embedding had the best performance. For autocoding the highest average F1-score was 0.80 in the colorectal set and 0.74 in the mixed set. Compared to two pathologists, the highest average F1-score was 0.63. For NER the highest average F1-score was 0.89 in the colorectal set and 0.87 in the mixed set.

Discussion and conclusion

Literature shows NER and Autocoding can successfully be implemented using machine learning. In this thesis, successful is defined as F1-scores higher than 0.8. The developed models can successfully be used to perform NER in pathology reports with F1-scores over 0.8. In autocoding, the highest performance model reached an F1-score of 0.79, very close to 0.80. The developed models perform similar to the models found in other literature. Average F1-scores in autocoding compared to pathologists are much lower, but very close to the Vertaalmodule. The validity of the results is limited because the dataset used to compare coding to pathologists is not a golden standard. Because weak labels are used, annotating a large number of conclusion reports with the Vertaalmodule saves time while still getting a clear understanding of required sample sizes and model performance on pathology reports. A limitation of weak supervision is that the performance of the developed models is very similar to the rule based system used to generate the weak labels. In future research, manual improvement of the generated training sets can increase the value of weak labels, improving the performance of the developed models. The good performance of NER tasks also enables the use of new tags in future research. In the medical and research field, this research can be used as a guideline to create and train machine learning models annotating pathology reports. Machine learning can code reports, but the current models are not yet production ready because improvements to the training sets need to be made. Machine learning can also add new types of classification to the data, increasing the use cases for other research.

Machine learning can be used to annotate pathology reports using the BLSTM model with word and character embeddings. It is also possible to use labels generated by the Vertaalmodule to train machine learning models but the quality of the weak labels needs to be improved manually before the performance of the trained model can become better than the Vertaalmodule.

(5)

1 Introduction

Since its founding, the nationwide network and registry of histo- and cytopathology in the Nether-lands(PALGA) has gathered over 73 million pathology excerpts with more than 2 million being added each year.[1] The purpose of PALGA is to promote communication and the provision of information within and between pathology laboratories, and to make the resultant knowledge available to health care. The excerpts contain important medical information that is indispens-able for evaluation and monitoring of population screening. The conclusion report, an important part of the pathology excerpt, is written in free-text. As described by Wilkinson et al.[2] data needs to be structured, annotated, or classified to allow computers and researchers alike to find the data they need. Excerpts are currently annotated manually by pathologists assigning codes from the PALGA thesaurus. The PALGA thesaurus consists of 15,168 unique terms and with 6,213 unique codes. With over 73 million pathology excerpts, manually coding all the data is time con-suming and prone to error[3]. Fortunately with the development of clinical language processing, a field of study that focuses on processing clinical text data, automatic structuring of data becomes possible. Automatic annotation of data saves time. Instead of asking highly trained pathologists to code their excerpts, a machine can annotate parts of it.

In an effort to replace manual coding, a rule base system was developed by a pathologist to automatically assign codes to the pathology excerpts by processing the conclusion reports. As described by Burger et al.[4], who is also the creator of the rule-based system, similar state of the art rule-based systems achieve performance up to F1-scores of 0.9. Rule-based systems are deterministic in nature and try to fully emulate the decision making of a pathologist. Although high F-scores can be achieved downsides to rule-based systems are the amount of expert knowledge needed to modify them and the amount of rules that can accumulate over time. While there is a rule based system to annotate pathology excerpts, this thesis proposes an alternative approach using Machine learning. Machine learning does not rely on the expert knowledge of the developer like a rule-based system but on data. Instead of creating rules, machine learning makes decisions based on the data it is given. According to Jovanovi´c et al.[5] Machine learning often outperform rule-based systems, but require large expert annotated corpus, which are expensive to develop. When going beyond simple machine learning algorithms and into deep learning neural networks, even more training samples are needed. To reduce this problem Wang et al. [6] introduced weak supervision for clinical data, a method to train machine learning models on automatically gener-ated imperfect labels for a dataset. Models trained with weak supervision were able to achieve similar performance compared to models trained with strong supervision. Using machine learn-ing and weak supervision provides a new way to annotate pathology excerpts. Where rule-based systems rely on a pathologist and developer to create and maintain rules in a rule-based system, machine learning also offers flexibility of changing the training data to change the way it structures new data. Machine learning can potentially annotate reports even further over time as more labels get added to training data. Where the developed rule-based system focuses on coding pathology excerpts, machine learning can also perform named entity recognition(NER). The main question we answer in this thesis is derived from the potential machine learning has: How can machine learning be used to annotate pathology reports? To answer this question, several research ques-tions have been created:

- What applications of machine learning are used to annotate clinical text in recent literature? - What is the performance of machine learning in pathology excerpts in coding and NER? - How does weak supervision effect the performance of machine learning?

The main objective of this thesis is to use machine learning to annotate pathology reports, measure the performance and compare the performance with other literature. In this thesis performance is measured with F1-scores. We train machine learning to recognize named entity’s in a conclusion report and to assign the right codes to an excerpt. A good machine learning model can provide

(6)

2 Background

2.1 The data - pathology excerpts

In this thesis we focus on pathology excerpts containing a conclusion report that summarizes the most important findings of the pathologists. When a pathologist performs an examination of tissue, a report with their findings and an excerpt of this report are written. All excerpts are stored in a central database at PALGA. The conclusion report is written in free-text, in Dutch and is kept as concise as possible. To be able to use and find the conclusion reports, they are coded using the PALGA thesaurus. The thesaurus contains a list of terms with their corresponding PALGA code. PALGA codes are linked to SNOMED CT codes, but we will be focusing on the PALGA codes in this document. A conclusion report can be seen as a list of words and the entirety of the PALGA codes as a their potential label. Because the conclusion reports are written in clinical language that is supposed to be readable by other pathologists, the reports contain a lot of terms that do not have a code. It is also possible for a single word to have multiple codes, usually this occurs because the word is a concatenation of two words, something very common in Dutch. An example of a frequent concatenation is seen with ”resectie”(resection), often used alone but also combined with another word creating ”colonresectie”(colon resection). Another frequent occurrence is a single code that consist of multiple words. In current practice, coding is done manually by the pathologist that sends in the excerpt.

For this thesis conclusion reports are taken as raw input data and the labels of a report as output data. For manually coding, this is how a pathologist can easily code the document. For use in machine learning and deep learning models the conclusion reports are first converted to words, keeping the order they were previously in. Every word gets a tag if it corresponds with a term in the thesaurus or an ”O”, which stands for outside of thesaurus, if it does not. The ”O” tag ensures that machine learning models can identify the difference between terms in the thesaurus and terms not in the thesaurus. A full conclusion report can be seen as sequential data consisting of a string of text with a fixed order. The problem machine learning has to solve is being able to determine the right annotation for every named entity in a conclusion report. Because PALGA codes follow a structure that allow terms to be grouped together in named entities machine learning can be trained to perform both Named Entity Recognition(NER) and automatic coding(autocoding). Named entity recognition and automatic coding

Autocoding is adding codes to a conclusion report without manual effort. In autocoding the input of a machine learning algorithm is a conclusion report and the result is one or more PALGA codes. NER is meant to recognize and classify named entity’s. In a conclusion report named entities are created by grouping PALGA codes together. Examples of named entities in conclusion reports are Procedure(P), Topography(T) and Morphology(M). More information on PALGA codes and NER can be found in section 3.

(7)

2.2 Machine learning

Despite machine learning being a popular term in recent research, the term was already used by Arthur Samuel[7]in 1959. Machine learning is a collection of algorithms and statistical models. The purpose of machine learning is to allow a machine to learn to perform a task without using explicit instructions[7], instead a dataset of examples with an outcome is given. A dataset with input and known output is also known as a training set. Machine learning algorithms use a training set to create a decision structure that enables them to take a pathology report and provide a prediction for the desired outcome. The techniques we explore in this document are word and character embedding, conditional random fields, random forest, support vector machine and neural networks.[8] Before explaining the different machine learning algorithms and statistical models, we will will first explain the different types of machine learning.

Machine learning types

There are multiple ways of teaching a machine learning algorithm to perform a task. The two main types of learning are supervised learning and unsupervised learning. Supervised learning is where the machine learning algorithm learns by receiving a set of labeled data, where conclusion report X has outcome Y. The benefit of supervised learning is that you can teach the model what is correct and what is not. The downside is that supervised learning often requires a large amount of manually labeled data. Manual annotation of data is time consuming and requires expert knowledge[3]. Unsupervised learning only requires a set of data and does not rely on experts to annotate the data, the machine learning algorithms will learn structures and relations between structures by itself. The downside to unsupervised learning is the inability to reproduce specific tasks such as autocoding or NER. In NER and autocoding, we want the machine learning to generate a specific outcome, a NER tag or code. Because of the specific outcome, supervised learning is required. A third type of learning is transfer learning. In transfer learning, machine learning techniques trained for a specific task are used for another simular task. Transfer learning can be used on top of unsupervised and supervised learning. Machine learning techniques that are used in transfer learning are often used on a large publicly available corpus and can be unsupervised or supervised.[9]

(8)

2.3 Random Forest

A random forest is a supervised learning algorithm consisting of multiple decision trees. A decision tree is a list of yes/no questions that lead to a decision as illustrated in the separate trees in figure 1. All decisions in a tree are created by training the model with labeled data. In a random forest multiple decision trees are created that each get their own outcome. Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. This is aimed to increase diversity and prevent over-fitting. The outcome of each tree is gathered and the final class decision is then made by a majority vote. The result of a random forest classification are the final class in addition to the voting results.

Figure 1: A random forest classifier visualized [10]

Advantages and disadvantages

Random forests are widely used to reduce over-fitting, something that can occur when there is not enough diversity of training samples or not enough training samples.[11] Another appealing aspect of a random forest, as described by Breiman[12], is the fact that it does not require as much fine-tuning of parameters. Tuning the default parameters often only leads to small performance gains. The disadvantage of a random forest compared to other machine learning algorithms is that it is not considered state of the art. Another disadvantage of random forest is need for feature engineering when performance is lacking. Features are attributes shared by individual pieces of the data, feature engineering is used to create features using domain knowledge. Feature engineering is seen as time consuming and thus expensive.

(9)

2.4 Support Vector Machines

Support Vector Machine(SVM) is a supervised learning technique extensively used in text clas-sification, image classification and bioinformatics. In Linear SVM, the problem space must be linearly separable. In a simple 2d hyperplane with linearly separable data a support vector ma-chine classifies new data depending on their position to the line as shown in figure 2.

Figure 2: Simple 2d support vector machines by Chen[13]

The hyperplane has one dimension less then the dimensions of the data. Therefore the hyperplane will look like a line in a 2d space, a slice in a 3d space and so on. The support vectors are the nodes in the feature space that lie on the boundary and based on their position a hyperplane is drawn. The hyperplane is used to classify other data points. SVM can also be used for non-lineair problems by changing the kernel to a polynomial or Radial Basis Function. The difference between the three kernels is shown in figure 2.

The main benefit of SVM is that it offers a range of kernels to fit linear and non linear classification problems. This also means creation of an SVM model requires selection of the correct kernel and features combined with parameter tuning for each viable option. SVMs also take longer to train than other machine learning techniques and might cause issues when training on large sets of pathology reports. SVM is also less effective when classes overlap, or are very similar, something present in the PALGA thesaurus.

(10)

2.5 Neural networks

Artificial neural networks are a framework for many different machine learning algorithms to work together and process complex data inputs. An artificial neural network consist of neurons that are able to make decisions depending on what the layer is meant to do. Figure 3 shows a simple artificial neural network with hidden layers, an input layer consisting of words in a conclusion report and an output layer with our tag. The hidden layer contains weights and gradients that establish the optimal path for a given input. Depending on the input, the optimal path will point towards an output neuron, in autocoding, the output layer contains a neuron for every possible PALGA code. In NER the output layer contains a neuron for every possible entity. The main types of neural networks currently being used in clinical language processing, convolutional neural networks(CNN) and recurrent neural networks(RNN) will be discussed.

Figure 3: A simple representation of an Artificial Neural Network

2.5.1 Convolutional neural networks

A CNN is a neural network that uses mathematical convolutions to compute the output. A con-volution is a specialized kind of linear operation and one or more concon-volutions form a concon-volution layer. A convolutional layer consists of filters and feature maps. CNNs are often used for image classification with great success but are also used for textual classification, and as described by Kim Y.[14] A basic CNN model for sentence classification can be found in figure 4.

In a neural network seen in figure 3, every output unit interacts with every input unit. Convo-lutional networks typically have sparse interactions, meaning not all output units interact with every input unit. Another common layer in a CNN is the pooling layer, figure 4 uses max pooling. A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. The max pooling operation reports the maximum output within a rectangular neighborhood.[15]

(11)

Figure 4: A simple representation of an Convolutional Neural Network by Kim Y.[14]

Sparse interactions reduce the memory requirements of the model and improve its statistical efficiency. It also means that computing the output requires fewer operations. For example, when processing pathology reports, the input report can have 80 words, but meaningful features can be detected such as specific word combinations with kernels that occupy three to five words. CNN’s require large amounts of data to reach a good performance and as explained by Goodfellow et al. [15] are weak to overfitting when data sets being too small or when classes do not have enough support.

2.5.2 Recurrent neural networks

RNNs are sequential architectures able to process arbitrary sequences of inputs, i.e. they have been designed to recognize patterns in sequences of data, such as text, handwriting, the spoken word, etc. Long short-term memory (LSTM) was introduced in 1997 by Hochreiter and Schmid-huber as a promising recurrent architecture. LSTM cells have the capability of “remembering” values over arbitrary time intervals, and therefore they are appropriate to process and predict time series given sequences of labels of unknown size. With recurrent neural networks, the LSTM can go forward or backwards, creating an even stronger Bidirectional LSTM. An example of a simple BLSTM network is shown in figure 5. In figure 5 a sentence ”EU rejects German call” is split into words and put into a BLSTM. The layers of consist of a forward and backward LSTM layer with weights and gradients, deciding the output of a word. ”EU” and ”German” are given a tag, where ”rejects” and ”call” are not by tagging them as ”O”.

The main benefit is the ability to include long range context because it is able to bridge long time lags between input and output. In pathology reports, the context is very important for classification with negations and speculations being present.

(12)

Figure 5: A bidirectional recurrent neural network by Huang et al.[16]

2.6 Embeddings

Some of the machine learning algorithms are incapable of working directly with strings and plain text. Word and character embedding provide a way to represent the words found in the conclusion reports as digits. With a training set, word and character embeddings can be trained using neural networks. The resulting embedded vectors are representations of categories where similar categories are closer to one another. A simple example of a word embedding can be found in figure 6. Embeddings can be pre-trained on large corpus and then Incorporated in a model. Another method to implement embeddings is to add a embedding layer that is not pre-trained on a large external corpus but trained on the actual training data used in the model.

(13)

Figure 6: Some example word vectors mapped to two dimensions, by Suriyadeepan Ram[17].

Embeddings allow the use of words and tags in models that cannot process strings of text like support vector machines or recurrent neural networks. Relationships between words can be inferred by the word and character vectors, giving a word or character more information to process and compare. One of the biggest advantages of embeddings is that they can be incorpusted as a layer in convolutional or recurrent neural networks. This means a layer is trained as a word or character embedding with the training set. A disadvantage is that issues with the embeddings are hard to resolve, since the relations are inferred by the word and character vectors and there are too many relations to manually check.

2.7 Conditional Random Fields

Proposed by Lafferty et al.[18] in 2001, Conditional Random Fields(CRFs) are described as a framework for building probabilistic models to segment and label sequence data. Simply put CRFs are a discriminative model, used for predicting sequences able to predict labels by taking context information into account. As a discriminative model CRF learns boundaries to classify different words.

Conditional random fields can be used as alone or as a layer in convolutional and recurrent neu-ral networks. CRF models are aimed to increase the use of context information. Furthermore CRF overcomes label bias often seen in a similar technique called Maximum-entropy Markov

(14)

3 Methods

Machine learning in other research

To compare the performance of machine learning found in other research, articles published on PubMed including machine learning and natural language processing were found using the follow-ing search query.

("Machine Learning"[Mesh]) AND

"Natural Language Processing"[Mesh] AND "loattrfull text"[sb](full text available)

The results were screened by checking the title and abstract for machine learning technique used on clinical documents. Examples of clinical documents are clinical notes in a electronic health record or radiology reports. Machine learning applied to medical research articles or other non-medical documents were excluded. Articles that did not report a F1-score to show the performance were also excluded. The remainder of the articles were assessed for eligibility by reading the full article using the same exclusion criteria. For each resulting article the best performing model was selected and the F1-scores of those models was compared.

Data

For this thesis two data sets used for training were provided by PALGA. A validation test set was provided by pathologist G. Burger. All data sets were annotated with codes from the PALGA thesaurus [19]. The PALGA thesaurus consist of 15,168 unique terms and has 6,213 unique codes. Because there are multiple terms that can be used to describe the same thing, some codes are con-nected to more then one term. When multiple terms have the same code, some of them are marked as V terms. V terms are the preferred terms, where non V terms are synonyms. The terms also have SNOMED-CT codes, providing a way to use the annotations across all languages support by the SNOMED-CT terminology. This thesis focuses on PALGA codes as they are already available in the data sets. PALGA codes are structured starting with a letter followed by 5 digits. The letter indicates what type of term is coded, the 5 categories are Topography(T), Procedure(P), Morphology(M), Diseases(D), Etiology(E) and functional disorders(F). The following 4 digits refer to the hierarchy of the term. For instance M40000 stands for inflammation and M41000 means acute inflammation. The last number in the PALGA code categorizes the concept: 0 : benign - 1 : borderline - 2: in situ - 3: malign - 6: metastasis.

Data preparation

All data sets were inspected and prepared for machine learning. In both training sets, samples with a high word count were excluded to prevent padding smaller conclusion reports excessively. The amount of padding needed for the models was decided by inspecting the amount of words in each conclusion report, padding was performed to match the report with the most words. In all training sets, reports that could not be coded by the Vertaalmodule were also excluded. A single conclusion report can include multiple lines of several excisions or polyps each line with their own subset of codes. These codes were grouped together as a single list of codes, the result of an entire a conclusion report.

Weak Supervision

According to Jovanovi´c et al.[5] Machine learning often outperforms rule-based systems, but re-quires large expert annotated corpus, which are expensive to develop. A high quality annotated corpus is often called a golden standard and labels are seen as ground truth. The golden standard in coding pathology reports can be reached by having at least two pathologists code a conclu-sion report and reaching a consensus together. Using data in a golden standard to train machine learning is considered ”strong” supervision. Another way to label data is to use weak supervision,

(15)

imperfect labels, often containing a degree of error. Wang et al. [6] proposed a method of clinical text classification using weak supervision, by labelling a training set with a rule-based system. Even though the rule-based system was used to generate the labels, when compared to a golden standard the model trained with weak supervision outperformed the rule-based system in two out of three test cases. Most importantly the study shows that training with weak labels can achieve results similar to training on true labels. The labels of the training set are generated by a machine and not by experts. To create a training set for the models, a rule-based system was used to generate labels. The existing rule-based system creates a string of codes from a conclusion report. This string of codes is mapped to a the list of words in a conclusion report. The final training set consists of separate words with their tag. The assigned codes are then split into two outputs, PALGA codes and NER tags. NER tags are based on the first letter of the PALGA code. One of the difficulties with this type of data are multi-word tags that span over multiple words or the opposite, multi tag words. To accurately tokenize multi tag words the first word that generates a tag is given a “b-“ in front of the Letter outcome. The words in the middle get an “i-“ tag and the words at the end get an “e-“ tag. Multi-tag words get multiple tags, some of them are separated by a “*” sign but others are combinations without a ”*” sign.

Figure 7 shows how the Vertaalmodule was used to label training data. Figure 7: Weak supervision to label training data

(16)

Model selection

After comparing machine learning techniques described in the background section, a classic ma-chine learning model and a neural network were selected. As a classic mama-chine learning model CRF was selected because CRF is created for sequential data, allowing the use of context informa-tion. Because sequences are very important in conclusion reports a recurrent neural network used for sequences, a BLSTM was selected. Embedding and CRF were used as layers in the BLSTM model. In total four models were created:

- A BLSTM with word embedding

- A BLSTM with word and character embedding - A BLSTM with word embedding and a CRF layer - A CRF model

Details of the created models containing layer connections and shapes can be found in the appendix.

Performance measure

The performance is measured by letting the created models select the labels for conclusion re-ports that already have labels. Every label correctly selected is seen as a true positive(TP), every missed label as a false negative(FN) and every predicted label that was not in the true labels is identified as a false positive(FP). We compared performance with our primary outcome measure, the F1-score. The F1-score is calculated using the precision and recall. The formulas for precision, recall and F1-score are:

precision = T P

T P + F P (1)

recall = T P

T P + F N (2)

F 1score = 2 · precision · recall

precision + recall (3)

F1-scores are measured as an average, calculating the FP, TP and FN of all codes or named entities combined. Average F1-scores are reported as micro and macro F1-scores. Micro average is calculated globally by counting the total true positives, false negatives and false positives. Macro calculates metrics for each label, and finds their unweighted mean. This does not take code or entity imbalance into account. The F1-scores are also measured for the individual codes and named entities. An example, the F1 score of all named entities combined can be 0.8, while P terms have an F1-score of 0.9 and T terms an F1-Score of 0.7. Calculating the F1-score for every code or entity is important to detect changes in rare codes or entities.

Baseline

The baseline performance is the performance of the Vertaalmodule. To measure the performance of the Vertaalmodule, we compared the outcomes of autocoding by the Vertaalmodule to coding of the two individual pathologists in the validation set..

Autocoding

Two of the developed models were trained on the colorectal training set, the BLSTM with word and character embedding and the BLSTM with word embedding. Performance of autocoding in colorectal models was measured by leaving out 10 percent of the samples. To measure the effect of sample size, the colorectal models were trained and tested with 5.000, 10.000, 15.000 and the full set of conclusion reports. Both models were also trained on the complete mixed training set. For the mixed trainingset performance of autocoding was measured by leaving out 10 percent of training samples and by annotating the conclusion reports in the validation set. In the validation set, F1-score were calculated by comparing codes generated by machine learning to codes of the individual pathologists. A representation of the split between test, training and validation data

(17)

Figure 8: The method used to split train and test data, resulting in a test set and training set can be found in figure 8.

Named entity recognition

In the pathology reports, named entities are coded with PALGA codes. The first letter of the PALGA code can be seen as a named entity tag. In this thesis, there is no golden standard for NER in conclusion reports available. Therefore performance can only be tested by leaving out training samples as shown in figure 8. All four models were trained on 5.000, 10.000, 15.000 and the full set of colorectal reports. All four models were also trained on the full set of mixed reports. The testing procedure for each model can be found in figure 9.

(18)

Materials

To pre-process the data, create the models and calculate the performance python 3 was used with the Keras and Tensorflow architecture. The rule-based system used to annotate the training set is the “Vertaalmodule” created by G. Burger.

(19)

4 Results

4.1 Machine learning in other research

The performance of 24 articles were compared. A PRISMA Flow diagram of the search can be found in figure 10.

(20)

Compared articles

The resulting articles were compared on data type and machine learning techniques. For each article, the best performing technique is shown in table 1. The best performing techniques found in the literature search are: Bag of Words, word embedding, decision tree, CRF, RNN, CNN and SVM models. All the data is medical data, only one of the articles contains pathology reports, two articles annotate Chinese texts and data from the i2b2 NLP challenge[21] is used in multiple articles. After comparing all the articles on data and machine learning type, we compared the results of the articles in figure 11.

(21)

Table 1: Articles compared on data and machine learning techniques

Title Data Machine Learning

Tech-niques Evaluation and accurate

diagnoses of pediatric diseases using artificial intelligence.[22]

Pediatric patient visits split in Training (n=3,564) and validation (n=2,619) labeled with the International Clas-sification of Disease (ICD)-10 coding.

LSTM model using word em-beddings.

Improving Clinical Named-Entity Recognition with Transfer Learning.[9]

Clinical notes from the MIMIC II dataset split in training (n=199) and test (n=99) labeled with named entities.

Bi-directional LSTM with three transfer learning ap-proaches.

Comparative Analysis of Algorithmic Approaches for Auto-Coding with ICD-10-AM and ACHI.[23]

190 anonymised clinical records associated with res-piratory and gastrointestinal diseases and interven-tions(D190). an additional 45 clinical records similar to 190 records for respiratory and gastrointestinal diseases were created by mixing and matching certain diagnosis and interventions.

A Decision tree.

MfeCNN: Mixture Feature Embedding Convolutional Neural Network for Data Mapping.[24]

25,000 HL7 documents, each of which included only 1 HL7 message with ICD-10-AM codes. Training (50%), validation (20%), and testing (30%).

Multimodal learning and multiview embedding into a CNN.

A machine learning based ap-proach to identify protected health information in Chinese clinical text.[25]

4,719 discharge summaries from regional health cen-ters in Ya’an City, Sichuan province, China with all pri-vacy related information la-beled.

A CRF model.

Clinical Named Entity Recognition Using Deep Learning Models.[26]

i2b2 2010 clinical concept ex-traction corpus.[21]

A Recurrent Neural Network (RNN).

Intelligent Word Embeddings of Free-Text Radiology Re-ports.[27]

10,000 radiology reports as-sociated with computed to-mography (CT) Head imag-ing studies. Labeled with di-agnosis of intracranial hemor-rhage and trained on 1,188 re-ports.

(22)

Prediction of cause of death from forensic autopsy re-ports using text classification techniques: A comparative study.[28]

400 autopsy reports involving eight Cause of Deaths (S06, T07, X99, Y00, T71, X80, I24, and I25) and four MoDs were collected and annotated with ICD-10.

Support vector machine clas-sifier.

Natural Language–based Ma-chine Learning Models for the Annotation of Clinical Radi-ology Reports.[29]

Head CT reports (n= 1004) of which (60%)training and (40%)validation tagged with critical findings.

Bag of words with uni-grams, biuni-grams, and trigrams plus average word embed-dings vector.

Machine learning to parse breast pathology reports in Chinese.[30]

2104 de-identified Chinese benign and malignant breast pathology reports. 1216 cases were used as a training set for the algorithm.

Boosted classifier for binary entities and CRF for Numer-ical entities.

General Symptom Extraction from VA Electronic Medical Notes.[31]

Clinical notes extracted from 948 patient records used to create 16,353 training rows with labeled general symp-toms.

SVM coupled with stochastic gradient descent.

Medical subdomain classifi-cation of clinical notes using a machine learning-based nat-ural language processing ap-proach.[32]

two data sets — clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n=431) and Mas-sachusetts General Hospital (MGH) (n=91,237).

CNN and RNN with neural word embeddings.

Deep Learning to Clas-sify Radiology Free-Text Reports.[33]

2500 annotated thoracic com-puted tomography (CT) re-ports with pulmonary em-bolism (PE) findings.

CNN model with an unsuper-vised learning algorithm for obtaining vector representa-tions of words.

Artificial Intelligence Learn-ing Semantics via External Resources for Classifying Di-agnosis Codes in Discharge Notes.[34]

103,390 discharge notes cov-ering patients hospitalized from June 1, 2015 to January 31, 2017 labeled with ICD-10-CM codes.

Word embedding combined with a CNN.

Entity recognition from clini-cal texts via recurrent neural network.[35]

Experiments conducted on corpus of the 2010, 2012 and 2014 i2b2 NLP chal-lenges.[21]

LSTM-RNN.

Prescription extraction us-ing CRFs and word embed-dings.[36]

Data from the 2009 i2b2 Challenge. [21]

CRF and word embeddings.

Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.[37]

737 different diagnoses para-graphs with a total number of 865 coded diagnosis.

SVM.

Information extraction from multi-institutional radiology reports.[38]

150 radiology reports with five named entities anatomy, anatomy modifier, observa-tion, observation modifier, and uncertainty.

(23)

Automatic ICD-10 classifica-tion of cancers from free-text death certificates.[39]

447,336 death certificates with Detailed features, in-cluding terms, n-grams and SNOMED CT concepts.

Support Vector Machine clas-sifiers (one classifier for each cancer type).

Recognizing Disjoint Clinical Concepts in Clinical Text Us-ing Machine LearnUs-ing-based Methods.[40]

training set of 199 clinical notes with consecutive and disjoint clinical concepts.

Structured support vector machine.

Classification of Medication Status Change in Clinical Narratives.[41]

160 Mayo clinical notes with annotated medication status.

SVM.

Discovering body site and severity modifiers in clinical texts.[42]

140 notes from SHARP CORPUS and 80 MIMIC icu notes and discharge summaries with annotated body site modifiers.

SVM.

Recognition of medication information from discharge summaries using ensembles of classifiers.[43]

268 manually annotated dis-charge summaries from the i2b2 challenge. [21]

Local CRF-based voting method.

A clinical text classification paradigm using weak super-vision and deep representa-tion[6]

498 radiology reports anno-tated with femur (hip) frac-tures.

(24)

Data

The first training dataset consists of 25,000 colorectal excerpts. The second training dataset consists of 25,000 mixed excerpts of skin biopts, skin excisions, colon carcinoma, prostate, and mamma carcinoma. The test set consist of 1000 mixed conclusion reports, similar to the second training set. After cleaning up the training sets by removing conclusion reports that could not be coded by the Vertaalmodule, the colorectal training set has 23471 conclusion reports and the mixed training set has 22442 conclusion reports. Training sets were split in test and training, see figure 9 for the number of reports in the test and training set. A histogram of the amount of words and tags for the included conclusion reports can be found in figure 12 and 13. As seen in both histograms most of the reports are equal or smaller than 250 words. Therefore all models that require padding are padded to 250 words. The colorectal dataset contains 2582 unique classes for autocoding and 261 unique NER classes. The mixed dataset contains 4170 unique classes for autocoding and 319 unique NER classes.

Figure 12: Histogram displaying the amount of words per sample for the mixed dataset

(25)

4.2 Performance in autocoding

Colorectal reports

The average F1-scores of all the codes in the left-out part of the training set in colorectal reports can be found in figures 14 and 15. F1-scores for each of the individual labels can be found in figures 16 and 17.

Table 2: Highest average autocoding performance of the developed models trained on the complete training set

Model F1-score - Micro average F1-score - Macro average

BLSTM with word embedding 0.80 0.77

BLSTM with word and character embedding 0.79 0.76

Figure 14: Average autocoding F1-scores at multiple sample sizes - BLSTM with word and char-acter embedding

(26)

Figure 16: Performance of BLSTM with word and character embedding tested on the left-out colon test set.

(27)

(28)

Mixed reports

The baseline performance of the Vertaalmodule, together with the performance of two developed models can be found in table 3. The average autocoding F1-score of the models tested on mixed reports left-out set can be found in table 6. The F1-score for every code can be found in figure 18.

Table 3: Autocoding performance of the developed models compared to pathologists Model F1-score - pathologist 1 F1-score - pathologist 2

Vertaalmodule 0.67 0.65

BLSTM with word embedding 0.62 0.58

BLSTM with word and character embedding 0.63 0.60

Table 4: Autocoding performance of the developed models in the left-out test set Model F1-score - micro F1-score - macro

BLSTM with word embedding 0.71 0.67 BLSTM with word and character embedding 0.74 0.78

(29)

4.3 Performance in named entity recognition

Colorectal reports

Average performance in colon report performing named entity recognition for different training sample sizes can be found in figure 19, the highest average F1-score of each model can be found in table 5.

Table 5: NER performance of the developed models in the left-out test set Model F1-score - micro F1-score - macro BLSTM with word embedding 0.89 0.89

BLSTM with word and character embedding 0.89 0.89 BLSTM with word embedding and CRF 0.87 0.87

CRF 0.89 0.47

(30)

Figure 20: Performance of NER for BLSTM with word and character embedding tested on the left-out colon test set.

(31)

Figure 21: Performance of NER for BLSTM with word embedding tested on the left-out colon test set.

(32)

Figure 22: Performance of NER for BLSTM with word embedding and CRF tested on the left-out colon test set.

(33)

(34)

Mixed reports

The average NER F1-score of the models tested on mixed reports left-out set can be found in table 6. The F1-score for every code can be found in figure 24.

Table 6: NER performance of the developed models in the left-out test set Model F1-score - micro F1-score - macro BLSTM with word embedding 0.86 0.85

BLSTM with word and character embedding 0.87 0.86 BLSTM with word embedding and CRF 0.83 0.82

CRF 0.87 0.54

(35)

5 Discussion and conclusion

Based on our literature search, F1-scores of 0.8 or above are considered good. In the literature search F1-scores of 0.8 or higher have been reached with small training sets [6][38][31] and with large training sets [32][34]. All four of the models developed for this thesis also achieved average F1-scores of 0.80 or above on NER. Only the model with the highest F1-score, a BLSTM with word and character embedding, reaches an F1-score of 0.79 on autocoding. The PALGA data sets can be considered somewhere between average and large. Koopman et al.[39] shows that even a large training set can have high and low F1-scores. The main reason behind the low F1-scores with a large training set is the small support for rare terms. In conclusion reports the overall performance of NER is higher then the performance of autocoding in all models, most likely due to the much smaller number of unique classes in NER. Where multiple techniques are compared, CNN and LSTM often outperform classic machine learning like SVM, random forest and k-nearest neighbors[6][22][9][24][26][35]. However, as seen in figure 11 the F1-scores of all compared machine learning techniques is comparable. The highest scoring is Du L. et al.[25] with a near perfect F1-score of 0.98 using a CRF model.

Varying training sample sizes in pathology reports from 5.000 to 21.123 shows the number of reports used in training is directly correlated to the performance. For colorectal reports, the per-formance of autocoding reaches the best result when the full dataset is used for training. The performance of NER increases slightly when the training sample size becomes larger, F1-scores of 0.87 to 0.89 for 10.000 and 21.123 training samples for model with the highest F1-score. This is because the F1-score in NER is decided by the main classes, having a much higher support in both training and test set. As can be seen in the individual performance, increasing the training sample size does increase the F1-score of some individual tags, but the impact on the average is very low due to the small support of these tags.

On average performance of machine learning is considered good on pathology reports, since the F1-score is greater than or almost equal to 0.8. Performance of the individual codes and NER tags shows that even though the average performance is good the performance for some codes and tags is not. Some codes only occur once in the test set, and zero to a couple times in the training set. These rare codes are rare occurrences and do not provide enough training examples for machine learning to learn how to accurately annotate them.

Codes generated by the Vertaalmodule achieve F1-scores of 0.67 and 0.65 compared to reports coded by two pathologists. The performance of machine learning is comparable to the Vertaalmod-ule. Even though none of the machine learning models achieves a higher F1-score, the best per-forming model is very close with F1-scores of 0.63 and 0.60 respectively. This shows that the performance of the model is highly dependent on the quality of the training data and even though there are differences between the codes generated by Vertaalmodule and the machine learning models, average performance does not improve.

Strengths and limitations

Generating weak labels with the Vertaalmodule makes it easy to create large labeled training sets. These large training sets can be used to evaluate if the data is suitable for machine learning, and provides a clear indication on performance with varying sample sizes. The created training set can easily be manually improved, and knowing the minimum needed sample size can prevent unnecessary labour. The limitation of using weak labels generated by the Vertaalmodule is the low performance compared to coding of a pathologist. One factor that can cause the low performance is that the validation set is not perfect as it contains the codes of two pathologists which is not

(36)

found in all the used machine learning, especially autocoding, is that occurrence of some codes and NER tags is to low to accurately train a model.

Grouping codes together using NER tags has not been done yet, these can be added to existing data. By grouping codes together, less training samples are needed to achieve high F1-score. The current NER tags do not add value to the data, as they are based on existing codes that provide more detailed information. However, new NER classes can be created for specific purposes, for instance ”rare cancer type” or ”unknown tissue”. These new NER classes can add a lot of value for specific use cases.

Another limitation encountered in this thesis are the computational requirements of CRF models. The more codes or NER tags, the higher the computational requirements become. As a result, we only tested the CRF models on NER tasks, since the amount of unique codes would mean training CRF models takes over 56 hours and more memory then was available for this thesis. Training time seems to be a strength of models only using embedding. Training time of the slowest embedding model took approximately 4 to 5 hours, considerably shorter then the CRF models.

Implications and Future research

In NER, the performance of CRF models was lower then the performance of a BLSTM with word and character embedding. Weak supervision reduces the time it takes to annotate large sets of data but this data causes the CRF models to require a lot of time to train. Therefore a BLSTM with word and character embedding is recommended. In future research autocoding with machine learning can be improved in several ways. The most performance gain can be realized by im-proving the quality of the annotated training set. The quality of the annotation can be increased by manually improving the weak annotation. Another way autocoding can be improved is by hyper tuning the best performing model, as no hyper tuning was performed in this thesis. Hyper tuning parameters of a model to achieve the best performance will increase performance when comparing to the Vertaalmodule. However it will likely not increase performance when comparing to pathologists since the quality of the training data does not improve. To accurately measure the performance, a dataset with a ground truth needs to be created. The current validation set can be used to create this dataset if two pathologists reach a consensus on the true codes of the reports.

NER tags can be used to annotate the data with more then the standard PALGA-codes. Because the performance on NER tasks is high, adding more NER tags could increase the value of the data. The use of new NER tags need to be explored in future research and is something that requires input from the medical field. Future research into new NER tags can use the existing models and expand them by training on a new training set. New NER tags can include things that go beyond the scope of the PALGA codes.

Conclusion

Machine learning can be used to annotate pathology reports using the BLSTM model with word and character embeddings. It is also possible to use labels generated by the Vertaalmodule to train machine learning models but the quality of the weak labels needs to be improved manually before the performance of the trained model can become better than the Vertaalmodule.

(37)

6 References

[1] PALGA - The nationwide network, registry of histo-, and cytopathology in the Netherlands. PALGA about. Last accessed june 2019. url: https://www.palga.nl.

[2] Wilkinson M.D. et al. “The FAIR Guiding Principles for scientific data management and stewardship”. In: Scientific Data (2016).

[3] Liu B. et al. “Text Classification by Labeling Words”. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (2004).

[4] Burger G. et al. “Natural language processing in pathology: a scoping review”. In: Journal of Clinical Pathology (2016).

[5] Jovanovi´c J. and Bagheri B. “Semantic annotation in biomedicine: the current landscape”. In: J Biomed Semantics (2017).

[6] Wang Y. et al. “A clinical text classification paradigm using weak supervision and deep representation”. In: BMC Medical Informatics and Decision Making (2019).

[7] Samuel A.L. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM Journal of Research and Development. (1959).

[8] Bishop C.M. Pattern Recognition and Machine Learning. Springer, 2006.

[9] Zhang E. et al. “Improving Clinical Named-Entity Recognition with Transfer Learning”. In: Studies in Health Technology and Informatics (2018).

[10] _{Koehrsen W. Random Forest Simple Explanation. Last accessed May 2019. 2017. url:} %5Curl%7Bhttps://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d%7D.

[11] Douglas M. Hawkins. “The Problem of Overfitting”. In: Journal of Chemical Information and Computer Sciences (2004).

[12] Leo Breiman. “Random Forests”. In: Machine Learning (2001).

[13] Chen L. Support Vector Machine—Simply Explained The simplistic illustration of basic con-cepts in Support Vector Machine. Last accessed May 2019. 2019. url: %5Curl%7Bhttps:// towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496% 7D.

[14] Yoon Kim. “Convolutional Neural Networks for Sentence Classification”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014).

[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook. org. MIT Press, 2016.

[16] Zhiheng Huang, Wei Xu, and Kai Yu. “Bidirectional LSTM-CRF Models for Sequence Tag-ging”. In: CoRR (2015).

[17] _{Suriyadeepan Ram. A word vector image. Last accessed May 2019. 2018. url: %5Curl %} 7Bcomplx.me/img/seq2seq/we1.png%7D.

[18] Lafferty J., McCallum A., and Pereira F. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”. In: Departmental Papers(cis)-Department of Computer Information Science (2001).

[19] PALGA - The nationwide network, registry of histo-, and cytopathology in the Netherlands. PALGA Thesaurus. Last accessed May 2019. url: https://www.palga.nl/palga- on-line-thesaurus.html.

(38)

[22] Liang H. et al. “Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence”. In: Nature medicine (2019).

[23] Kaur R. and Ginige J.A. “Comparative Analysis of Algorithmic Approaches for Auto-Coding with ICD-10-AM and ACHI”. In: Studies in Health Technology and Informatics (2018). [24] Li D. et al. “MfeCNN: Mixture Feature Embedding Convolutional Neural Network for Data

Mapping”. In: IEEE Transactions on NanoBioscience (2018).

[25] Du L. et al. “A machine learning based approach to identify protected health information in Chinese clinical text”. In: International journal of medical informatics (2018).

[26] Wu Y. et al. “Clinical Named Entity Recognition Using Deep Learning Models”. In: Annual Symposium proceedings. AMIA Symposium (2018).

[27] Banerjee I. et al. “Intelligent Word Embeddings of Free-Text Radiology Reports”. In: Annual Symposium proceedings. AMIA Symposium (2018).

[28] Mujtaba G. et al. “Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study”. In: Journal of Forensic and Legal Medicine (2018).

[29] Zech J. et al. “Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports”. In: Radiology (2018).

[30] Tang R. et al. “Machine learning to parse breast pathology reports in Chinese”. In: Studies in Health Technology and Informatics (2018).

[31] Divita G. et al. “General Symptom Extraction from VA Electronic Medical Notes”. In: Breast cancer research and treatment (2017).

[32] Weng WH. et al. “Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach”. In: BMC Medical Informatics and Decision Making (2017).

[33] Chen M.C. et al. “Deep Learning to Classify Radiology Free-Text Reports”. In: Radiology (2018).

[34] Lin C. et al. “Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes”. In: Journal of medical Internet research (2017). [35] Liu Z. et al. “Entity recognition from clinical texts via recurrent neural network”. In: BMC

Medical Informatics and Decision Making (2017).

[36] Tao C. et al. “Prescription extraction using CRFs and word embeddings”. In: Journal of biomedical informatics (2017).

[37] L¨opprich M. et al. “Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research”. In: Journal of biomedical informatics (2016). [38] Hassanpour S. and Langlotz C.P. “Information extraction from multi-institutional radiology

reports”. In: Artificial intelligence in medicine (2016).

[39] Koopman B. et al. “Automatic ICD-10 classification of cancers from free-text death certifi-cates”. In: Artificial intelligence in medicine (2016).

[40] Tang B. et al. “Recognizing Disjoint Clinical Concepts in Clinical Text Using Machine Learning-based Methods”. In: AMIA Annual Symposium proceedings. AMIA Symposium (2015).

[41] Sohn S. et al. “Classification of Medication Status Change in Clinical Narratives”. In: AMIA Annual Symposium proceedings. AMIA Symposium (2010).

[42] Dligach D. et al. “Discovering body site and severity modifiers in clinical texts”. In: Journal of the American Medical Informatics Association : JAMIA (2013).

[43] Doan S. et al. “Classification of Medication Status Change in Clinical Narratives”. In: BMC Medical Informatics and Decision Making (2012).

(39)

Appendix: Models and Learning curves

Model summary and learning curves

Model summary with the different layers and learning curves with the epochs on the x-axis and

accuracy on the y-axis. The blue line represents accuracy in training data and the red line the

accuracy in the validation.

BLSTM with word embedding - autocoding colon 5.000

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 223) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 223, 20) 400160 _________________________________________________________________ spatial_dropout1d_1 (Spatial (None, 223, 20) 0 _________________________________________________________________ bidirectional_1 (Bidirection (None, 223, 100) 28400 _________________________________________________________________ time_distributed_1 (TimeDist (None, 223, 2583) 260883 ================================================================= Total params: 689,443

Trainable params: 689,443 Non-trainable params: 0

(40)

BLSTM with word embedding - autocoding colon 10.000

(41)

BLSTM with word embedding - autocoding colon 15.000

(42)

BLSTM with word embedding - autocoding colon full

(43)

BLSTM with word embedding - autocoding mixed full set

Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 223) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 223, 20) 477020 _________________________________________________________________ spatial_dropout1d_1 (Spatial (None, 223, 20) 0 _________________________________________________________________ bidirectional_1 (Bidirection (None, 223, 100) 28400 _________________________________________________________________ time_distributed_1 (TimeDist (None, 223, 4171) 421271 ================================================================= Total params: 926,691

(44)

BLSTM with word embedding - NER colon 5.000

(45)

BLSTM with word embedding - NER colon 10.000

(46)

BLSTM with word embedding - NER colon 15.000

(47)

BLSTM with word embedding - NER colon full set

(48)

Annotating Dutch pathology reports with machine learning