Dokter BERTje: Medical Concept Normalization of Dutch Medical Description on to SNOMED CT

(1)

Dokter BERTje: Medical Concept

Normalization of Dutch Medical

Description on to SNOMED CT

Jeannine Lie, BSc.1

1 _{Master Medical Informatics, Amsterdam University Medical Center, location AMC, University of}

Amsterdam

(2)

1

Master Thesis

Dokter Bertje: Medical Concept Normalization of Dutch Medical Description on to SNOMED CT Student

J.K.N. Lie (Jeannine), BSc. Email: j.k.lie@amsterdamumc.nl

Student number: 12579076 Supervision

Dr. M.C. Schut (Martijn), Assistant professor (tutor) Dr. Ir. R. Cornet (Ronald), Associate professor (mentor)

E.S. Klappe (Eva), MSc PhD student (mentor)

Location

Department Medical Informatics, Amsterdam UMC location AMC Meibergdreef 9, 1105 AZ Amsterdam

SRP Duration:

(3)

2 Acknowledgement

I would like to take this time to sincerely thank my supervisors Ronald Cornet, Eva Klappe and Martijn Schut. I had challenged myself in choosing a scientific research project in Natural Language Processing where I knew little to none of. I am grateful for their patience while I was self-studying, their expert knowledge to boost my learning, their guidance to keep me on track and their support when I got insecure.

Furthermore, I would like to thank Miguel for letting me pick his brain during all the brainstorm sessions. I also want to thank Iacopo for setting up the technical equipment for parts of the research. At last, I would like to thank my family and friends for their support, unconditional love and interest in my study.

(4)

3 Acknowledgement ... 2

Summary ... 4

Samenvatting ... 5

List of Abbreviations ... 7

1. Introduction ... 8

1.1. Short description of the rest of the chapters ... 9

2. Related work ... 9

3. Preliminaries/Background description of key concepts ... 10

3.1. Language model training: Pre-training vs Fine-tuning ... 10

3.2. Transformer ... 10

3.3. Cosine similarity ... 10

4. Method ... 11

4.1. Datasets ... 11

4.2. Study setting ... 11

4.2.1. Dataset preprocessing ... 12

4.2.2. Medical concept normalization – Pipeline ... 15

4.3. Data analysis ... 17

5. Results ... 18

6. Discussion ... 20

7. Conclusion ... 22

8. References ... 22

Appendix ... 24

(5)

4 Summary

Introduction

In many Electronic Health Records (EHR), clinical findings are captured in a problem list on a structured way with the use of a chosen controlled terminology. However, these EHR systems generally allow for free-text entry next to the selected structured clinical finding in case the correct code cannot be readily found. As a consequence, relevant information in EHRs is still largely stored as unstructured free text. Medical Concept Normalization (MCN) aims to map unstructured text to a medical concept in a knowledge base. Recent studies showed promising work on MCN using BERT-based architecture models on English medical text, but no such work has been done in Dutch. We used a Dutch BERT-based model, BERTje, and fine-tuned it on Dutch medical language. We called this fine-tuned model Dokter BERTje. This study aims to determine if Dokter BERTje can be successfully used as a performant component for Dutch MCN and the usefulness of Dokter BERTje.

Method

In this experimental study, we applied a deep learning approach for MCN as an unsupervised classification task with the use of cosine similarity method. We used Dokter BERTje to convert Dutch medical texts and SNOMED CT terms into semantic embeddings and used cosine similarity to classify the Dutch medical texts to the closest SNOMED CT terms. To assess the performance of Dokter BERTje, annotated datasets were extracted out of an Amsterdam UMC dataset, called exact-match dataset and comparable-match dataset. The exact-match dataset contained diagnosis descriptions that were identical to the SNOMED CT terms. The comparable-match dataset contained diagnosis descriptions that matched the SNOMED CT terms on similar meaning (e.g. diagnosis description “not fusing of fracture” with the SNOMED CT term “pseudo arthrosis”). The performance was assessed on three types of sentence embeddings. These were the classification token (CLS), the average (AVG) of the word embeddings in a sentence, and the maximum (MAX) of the word embeddings in a sentence. The usefulness was assessed by the percentage of total and unique out-of-vocabulary (OOV) words and the frequency of these OOV words. A cutoff percentage of <10% was defined as useful. The usefulness was assessed on SNOMED CT, the exact-match dataset, the comparable-match dataset and an extra subset extracted from an Amsterdam UMC dataset containing modified free-text called free-text dataset. Results

For the exact-match dataset, the AUROCs were 0.895, 0.895 and 0.896 for CLS, AVG and MAX sentence embedding respectively. The F1 scores were 0.857, 0.920 and 0.890 respectively for CLS, AVG and MAX sentence embedding. For the comparable-match dataset, the AUROC were 0.309 for CLS, 0.296 for AVG and 0.285 for MAX sentence embedding with F1-scores of 0.500, 0.444 and 0.467 respectively for CLS, AVG and MAX sentence embedding. Total and unique OOV words in all datasets were higher than 10%, except the total percentage of OOV words in SNOMED CT (9.11%). The top 20 most frequent OOV words were medical words in SNOMED CT, exact-match dataset and comparable-match dataset. In free-text dataset the top 20 most frequent OOV words were abbreviations.

Discussion and Conclusion

Dokter BERTje can link SNOMED CT terms to diagnosis descriptions that exactly match with the use of cosine similarity on sentence embeddings. However, it performs poorly in a practical setting where diagnosis descriptions are not identical to SNOMED CT terms. Therefore, Dokter BERTje cannot be successfully employed yet as performant component for Dutch MCN. The high percentage of OOV words indicates that Dokter BERTje contains limited variation of (sub)words in its vocabulary. Therefore, Dokter BERTje is not yet useful. Dokter BERTje needs to be further pre-trained instead of fine-tuned on medical corpora as fine-tuning does not expand the model’s vocabulary.

Keywords: Natural Language Processing; Medical Concept Normalization; BERTje; cosine similarity; out-of-vocabulary

(6)

5 Samenvatting

Introductie

In de meeste Elektronische Patiënten Dossiers (EPD's) worden klinische bevindingen op een gestructureerde manier vastgelegd in een probleemlijst met behulp van gecontroleerde terminologie. Echter, in het geval dat de juiste code niet gemakkelijk gevonden kan worden, staan deze EPD’s ook vrije tekst toe naast de reeds gekozen gestructureerde klinische bevinding. Als gevolg hiervan wordt relevante informatie in EPD's grotendeels nog opgeslagen als ongestructureerde vrije tekst. Medical Concept Normalization (MCN) heeft als doel om ongestructureerde teksten om te zetten naar een medisch concept van een terminologiestelsel. Recente studies toonden veelbelovend werk in MCN met behulp van BERT-gebaseerde architectuurmodellen op Engelse medische teksten, maar dergelijk werk is nog niet gedaan in het Nederlands. We gebruikten een op Nederlands gebaseerd BERT model, genaamd BERTje, voor het fine-tunen op medisch Nederlands. Dit fine-tuned model noemden wij Dokter BERTje. Het onderzoek heeft als doel te bepalen of Dokter BERTje met succes kan worden gebruikt als een uitvoeronderdeel voor het Nederlandse MCN en de bruikbaarheid van Dokter BERTje te bepalen.

Methode

In deze experimentele studie hebben wij een deep learning manier voor MCN toegepast. Wij hebben MCN benaderd als een unsupervised classificatietaak met behulp van de cosine similarity methode. Wij gebruikten Dokter BERTje om Nederlandse medische teksten en SNOMED CT-termen om te zetten in semantische embeddings en gebruikten cosine similarity om de Nederlandse medische tekst te classificeren naar de meest gelijkende SNOMED CT-term. Om de prestatie van Dokter BERTje te beoordelen, werden geannoteerde datasets gecreëerd uit een Amsterdam UMC-dataset, genaamd exact-match dataset en comparable-exact-match dataset. De exact-exact-match dataset bevatte diagnosebeschrijvingen die identiek zijn aan de SNOMED CT-termen. De comparable-match dataset bevatte diagnosebeschrijvingen die overeenkomen met de SNOMED CT-termen op basis van dezelfde betekenis (bijv. een diagnosebeschrijving 'niet fuseren van breuk' met de SNOMED CT-term “pseudo artrose”). De uitvoering werd beoordeeld op drie soorten zin-embeddings. Dit waren het classificatietoken (CLS), het gemiddelde (AVG) van de woord-embeddings in een zin, en het maximum (MAX) van de woord-embeddings in een zin. De bruikbaarheid werd beoordeeld aan de hand van het percentage totale en unieke out-of-vocabulary (OOV) woorden en de frequentie van deze OOV-woorden. Een afkap-percentage van <10% werd als bruikbaar gedefinieerd. De bruikbaarheid werd beoordeeld op SNOMED CT, de exact-match dataset, de comparable-match dataset en een extra subset gehaald uit een Amsterdam UMC-dataset met gemodificeerde diagnosebeschrijvingen in vrije tekst genaamd free-text dataset.

Resultaten

Voor de exact-match dataset waren de AUROC's 0.895, 0.895 en 0.896 voor respectievelijk CLS, AVG en MAX zin-embedding. De F1-scores waren respectievelijk 0.857, 0.920 en 0.890 voor CLS, AVG en MAX zin-embedding. Voor de comparable-match dataset waren de AUROC 0.309 voor CLS, 0.296 voor AVG en 0.285 voor MAX zin-embedding met F1-scores van respectievelijk 0.500, 0.444 en 0.467 voor CLS, AVG en MAX zin-embedding. Het totale en unieke OOV-woorden in alle datasets waren hoger dan 10%, behalve het totale percentage OOV-woorden in SNOMED CT (9.11%). De top 20 van meest voorkomende OOV-woorden waren medische woorden in SNOMED CT, exact-match dataset en comparable-match dataset. In de free-text dataset waren de 20 meest voorkomende OOV-woorden afkortingen.

Discussie en conclusie

Dokter BERTje kan SNOMED CT-termen koppelen aan diagnosebeschrijvingen die exact overeenkomen met het gebruik van cosine similarity op zin-embeddings. Het presteert echter slecht in een praktische omgeving waar de diagnosebeschrijvingen niet identiek zijn aan SNOMED CT-termen. Hierom kan Dokter BERTje nog niet met succes worden ingezet als uitvoeronderdeel voor het Nederlandse MCN. De hoge percentage aan OOV-woorden geeft aan dat Dokter BERTje een beperkte

(7)

6

variatie van (sub)woorden in zijn woordenboek heeft. Daarom is Dokter BERTje nog niet bruikbaar. Dokter BERTje moet verder worden pre-trained in plaats van fine-tuned op medische corpora, aangezien fine-tuning het woordenschat van een model niet vergroot.

(8)

7 List of Abbreviations

In this chapter, we present a list of frequent abbreviations used in this thesis.

BERT Bidirectional Encoder Representations from Transformers

MCN Medical Concept Normalization

OOV Out-of-vocabulary

(9)

8 1. Introduction

As the healthcare industry increasingly embraces the promise of new data-driven approaches, the challenges of organizing and managing patient data become more prominent.(1) Even before the first Electronic Health Record (EHR), establishment of a structured, organized, and standard representation of patient data was already a struggle in healthcare organization.(2) In 1968, Lawrence Weed introduced a structured way of organizing patient information presented per medical problem called problem-oriented medical record (POMR).(3) The structure that POMR provides, has proven to support healthcare providers in recording patient’s notes and swiftly giving them an overview and understanding about those patients history.(4) The problem list is the core element of POMR, which presents a list of active and inactive diagnosis relevant to current care of the patient.(5) Besides the primary use for patient care, the problem list enables the reuse of data for creating patient registries, identifying patient populations for quality improvement activities, or conducting research.(6-8)

In many EHRs, the problem list items are captured in a structured way with a problem represented as codes chosen from a controlled terminology. This limits the capability of capturing the clinical state directly as intended by the healthcare provider. Even if having codified data is the goal, EHR systems still allow for free-text entry as a backup next to the selected structured problem description in case the correct code cannot be readily found.(9) These free-text entries might change the context of a structured problem description represented by its code. For example, a structured problem description “hepatitis A”, with the free-text entry “vaccinated against hepatitis A” changes the context of a patient from having a disease to not having one. As a consequence, relevant information in EHRs is still largely stored as unstructured free text. Structured and codified data are more amenable to data analytics,(10) standardization activities (11) and EHR secondary use,(12) but if the intended context is not represented other than in unstructured free text, this will lead to false statistics.

Natural Language Processing (NLP) can help solve the problem of information stored in free text by information extraction. NLP is a sub-field of linguistics, computer science, information science and artificial intelligence that is concerned with utilizing computers to process human language data and turn it into information usable by computers. Medical Concept Normalization (MCN) is the NLP task that aims to map unstructured text to a medical concept in a knowledge base. An example of such a knowledge base is SNOMED Clinical Terms (SNOMED CT): a structured representation capable of capturing the semantics of summary-level problem descriptions in a computable way. Deep learning-based methods, such as Word2Vec (13) have been successfully applied to MCN.(14) Word2Vec maps distinct words onto vectors of real numbers called embeddings, where words with similar meaning have an embedding close to each other. Recently, Bidirectional Encoder Representations from Transformers (BERT) introduced masked language models to create word embeddings based on context of a sentence and improved NLP tasks, such as named entity recognition and sentiment analysis.(15)

Recent studies showed promising work on MCN using BERT-based architecture models trained on medical texts,(16-18) but no such work had been done in Dutch. Based on BERT architecture, BERTje was developed, which is a monolingual Dutch version pre-trained on a large and diverse Dutch text. (19) By fine-tuning BERTje on Dutch medical text and using it for linking free-text entries onto SNOMED CT, information of contextually changed descriptions could be extracted from the text, which could eventually lead to updating the problem list. For example, the structured problem description “hepatitis A”, with the free-text entry “is vaccinated for hepatitis A” could be updated to the SNOMED CT concept “hepatitis A immune”. This increases the amount of structured data that could be reused. As a preliminary study, we conducted an experiment for MCN with the use of word embeddings based on context. For this, we fine-tuned BERTje on unannotated medical texts, which we called Dokter BERTje. The aim of the study is to determine if Dokter BERTje can be successfully used as a performant component for Dutch MCN. A secondary aim is to determine the usefulness of Dokter BERTje. Success was defined by F1-scores and the usefulness by percentage of out-of-vocabulary (OOV) words.

(10)

9 1.1. Short description of the rest of the chapters

The rest of the paper is organized as follows: Section 2 presents the related works of this research. Section 3 presents with description of key concepts used in this research. The research methods, results and discussion are provided in section 4, 5 and 6 respectively. Finally, section 7 concludes the work with some future improvements.

2. Related work

MCN has a long history in clinical NLP. Traditional methods involved dictionary-base approaches, (20, 21) and rule-based approaches.(22, 23) Dictionary-base approaches, such as MetaMap and cTakes, relied on the constructed dictionaries to accomplish the term-to-concept lookup. Drawbacks of dictionary-base approaches were that slightly different descriptions were not mapped.(24) For example, a medical concept mention “HIV exanthema” failed to be mapped on the concept “HIV-exanthema” even if only one character was changed (“-” or “ ” between “HIV” and “exanthema”). Since real-world clinical texts are highly heterogeneous and full of abbreviations, grammar mistakes and clinical scores, building a comprehensive dictionary is not feasible. Rule-based approaches depended on manually designed rules, but were not able to cover all situations of term-to-concept mapping. Furthermore, previous studies that applied machine learning methods for MCN had reported limitation due to the sparsity of available annotated clinical datasets.(25)

The first application of deep learning techniques in MCN used word-level methods along with Convolutional Neural Networks and Recurrent Neural Networks.(26, 27) These methods failed to learn character structure features inside words and ignored OOV words. In order to overcome this problem, recent models were based on the multilayer BERT to extract vector representations of the texts and the medical concepts and trained a softmax based classifier using these representations.(16) This deep learning model used character-level models to alleviate OOV issues. Recently, Ji et al. fine-tuned three BERT-based models for the normalization task in biomedical texts.(17) They showed that fine-tuning a pre-trained language model improved MCN. The three models they used, were the original BERT, BioBERT and ClinicalBERT. Both BioBERT and ClinicalBERT outperformed the original BERT model, indicating that domain-specific models were more appropriate in MCN. These models were applied to medical English texts. As our work focuses on Dutch medical texts, we used a Dutch based BERT model (BERTje) and fine-tuned it on domain specific texts to extract vector representations of our texts and medical concepts.

The previously mentioned approaches had set MCN as a supervised text classification task. Here the task of assigning text to a class, in this case the medical concept, was trained on annotated datasets where text and the designated medical concept were already known. The drawback of this is that creating training data requires one to manually map medical concept mentions to entries in a target knowledge base such as SNOMED CT, which is time and effort intensive. Instead, Tahmasebi et al. did unsupervised learning method for normalization of anatomical phrases in radiology reports.(28) They achieved concept normalization via a naïve classifier that determined the label by calculating the cosine similarity between SNOMED CT representative vectors and the mention vector and selecting the label yielding maximum similarity. In our work, we based our classifier on this cosine similarity method to classify in an unsupervised way. The difference between our work and that of Tahmasebi et al was that they only extracted 47 anatomical concepts from SNOMED CT, while we used all SNOMED CT concepts where Dutch SNOMED CT terms were available. Furthermore, instead of using Word2Vec, we used a pre-trained BERTje.

(11)

10 3. Preliminaries/Background description of key concepts

In this section three key concepts are described to be able to understand the methodology used for our MCN pipeline. These three key concepts are language model training, transformer and cosine similarity.

3.1. Language model training: Pre-training vs Fine-tuning

Training a Language model consists of two types of training, pre-training and fine-tuning. Pre-training a language model means learning the words, grammar, structure and other linguistic features of a language from a corpus and creating a vocabulary with all the text represented as tokens. The outcome of pre-training is a model that is capable of accurately modelling a language. Pre-training from scratch for a meaningful language model is expensive (days on a dozen of CPU’s). With the ability to download a trained model, called transfer learning, this substantial overhead is avoided. BERTje is such a pre-trained language model on the Dutch language. Pre-pre-trained language models can be used on a variety of tasks, such as named entity recognition, question answering and classification. However, pre-trained language models can be less effective in domain-specific data.(29) Fine-tuning a language model is to teach the model how to better represent the domain-specific language features, which the pre-trained model is not familiar with. Fine-tuning is much less expensive than pre-training (hours on a single CPU). As BERTje is based on natural Dutch language, we fine-tuned it further on Dutch medical corpora for our research.

3.2. Transformer

Transformer is a model that uses an attention mechanism to boost the speed with which models can be trained.(30) The attention mechanism takes greater notice to certain factors when processing data. This mechanism helped improve the performance of machine translation application, such as Google Translate, by enabling the model to “remember” all the words in the input and focus on specific words when formulating a response.(31) It works by adding two sentences into a matrix where the words of one sentence form the columns, and the words of the other sentence the rows. By making matches, it identifies context and relevant relationships between words. Adaption of attention has seen movement to self-attention, where the same sentence is put along the columns and the rows. Here the relationship between words of one sentence is identified. Self-attention is the method used by Transformer to reduce training time significantly by enabling it to be parallelizable. We used Transformer to help speed up the fine-tuning of BERTje on medical domain.

3.3. Cosine similarity

Cosine similarity is a metric used to measure how similar words, sentences, and documents are irrespective of their size. Cosine similarity uses cosine to measure the angle between two vectors in a multi-dimensional space. The vectors are the computer representation of a certain text (a word, a sentence, a document), where semantically similar texts have similar vectors. The similarity can be calculated with the following formula where A and B are two vectors in a multi-dimensional space:

cos , ∙

| | | |

Here (A, B) is the angle between the vectors A and B measured in radians. A · B is the dot product of A and B, and |A| and |B| represent the length of the vectors A and B respectively. As the cosine similarity is used in a positive space with vectors consisting of only positive numbers, the maximum angle between two vectors is 90 degrees. Hence the outcome of cosine similarity is a value between 0 and 1, with values

(12)

11

closer to 1 meaning the text are more semantically similar. See Appendix A for further explanation of the mathematics and an example calculation of cosine similarity.

4. Method

4.1. Datasets

We used four medical datasets in this study, namely two datasets including healthcare providers’ notes (one of 2017 and another of 2019), the Nederlands Huisartsen Genootschap standards (NHG standards) dataset and SNOMED CT dataset. Healthcare providers’ notes 2019, NHG standards and SNOMED CT were used for fine-tuning BERTje, while Healthcare providers’ notes 2017 and SNOMED CT were used for analyzing performance of the model.

Healthcare providers’ notes 2017: This dataset is an anonymous dataset extracted from the EHR system (Epic) of Amsterdam UMC, including both locations AMC and VUMC. The dataset included structured diagnosis descriptions, with their corresponding SNOMED CT code and the modified free-text diagnosis description by healthcare providers of 37 different specialties of the impatient clinics from 1-1-2017 to 31-12-2017. Structured diagnosis descriptions are structured from the diagnosis thesaurus, which is a list of clinically relevant diagnostic terms based on SNOMED CT. This dataset consisted of 290,275 records.

Healthcare providers’ notes 2019: This dataset is an anonymous dataset extracted from the EHR system of Amsterdam UMC. The dataset included also healthcare providers’ comments next to the structured diagnosis descriptions, the corresponding SNOMED CT code and the modified free-text diagnosis description. This dataset contained records selected on non-empty modified free-text diagnosis descriptions by 39 different specialties of inpatient clinics from 1-1-2019 to 31-12-2019. This dataset consisted of 8,998 records.

NHG standards: NHG standards are guidelines intended to support medical policy in the daily practice of the general practitioner. The NHG standards datasets contained text written in the guidelines. The NHG standards dataset consisted of 43,477 records.

SNOMED CT: We used the 2017 Dutch version of SNOMED CT. The 2017 Dutch SNOMED CT dataset consisted of 2,411,303 Dutch and English terms in total (44.107 (1.83%) vs 2,367,196 (98.17%) Dutch and English terms respectively).

4.2. Study setting

We applied a deep learning approach for mapping concept mentions in Dutch medical text to SNOMED CT terms as an unsupervised classification task. We first converted the Dutch medical texts and SNOMED CT terms into semantic representative vectors using Dokter BERTje. Then, a concept mention in Dutch medical text was classified to SNOMED CT concept by finding the closest SNOMED CT term vector to a mention’s vector with the use of cosine similarity. As this study was performed on Dutch medical text, we only used SNOMED CT concepts where Dutch SNOMED CT terms were available. As there were no available annotated datasets in Dutch medical text to measure the performance of our approach, we created one using healthcare providers’ notes 2017. Lastly, we also created a free-text dataset from healthcare providers’ notes 2017 to evaluate the usefulness in practical setting. In table 1 variables which were used to create the sub-datasets out of the healthcare providers’ notes 2017 are visualized. In figure 1 our study design is given. In section 4.2.1 the creation of these sub-datasets is described.

(13)

12

Table 1 Variables of the healthcare provider’s notes 2017 used in creating subsets for analyzing performance and usefulness.

Figure 1 Study design of this MCN approach. CLS = classification token, AVG = average of the word embeddings

in a sentence, MAX = maximum of the word embeddings in a sentence, AUROC = area under the receiver operating characteristic, OOV = out-of-vocabulary.

4.2.1. Dataset preprocessing

SNOMED CT dataset

We preprocessed the SNOMED CT dataset by excluding English terms, duplicated terms with the same SNOMED CT code and duplicated terms with different SNOMED CT codes. This resulted in a SNOMED CT dataset of 43,003 terms. Figure 2 shows the flowchart of excluding SNOMED CT terms. Annotated datasets: Exact-match and comparable-match datasets

From the healthcare providers’ notes 2017, we used the diagnosis descriptions to create the annotated datasets. Only the unique diagnosis descriptions were used. The diagnosis descriptions were divided into diagnosis descriptions with a corresponding SNOMED CT code and diagnosis descriptions not linked to a SNOMED CT code. From the diagnosis descriptions with a corresponding SNOMED CT code, records were excluded if:

‐ The corresponding code was unfindable in SNOMED CT dataset

‐ The corresponding SNOMED CT code did not match the SNOMED CT code in SNOMED CT

(14)

13

From the diagnosis descriptions not linked to a SNOMED CT code, records were excluded if the description exactly matched a SNOMED CT term in SNOMED CT dataset, as these will certainly cause false-positives in performance results.

In reality, free text will often not exactly match the corresponding terms.(32, 33) As the dataset was imbalanced with 97.05% of the diagnosis descriptions exactly matching the SNOMED CT terms, we split the dataset in an exact-match dataset and a comparable-match dataset. The comparable-match dataset was used as a simulation of the practical setting. The exact-match dataset included diagnosis descriptions that were identical to the SNOMED-CT terms (e.g. diagnosis description “pneumonia” with the SNOMED CT term “pneumonia”) and the comparable-match dataset included diagnosis descriptions that matched the SNOMED CT terms on similar meaning (e.g. diagnosis description “not fusing of fracture” with the SNOMED CT term “pseudo arthrosis”). 6,353 (97.05%) records with a corresponding SNOMED CT code were in the exact match dataset and 193 (2.95%) in the comparable-match dataset. The records without a SNOMED CT code were randomly allocated into the exact-match and comparable-match dataset. In total, the exact-match dataset contained 10,465 records and the comparable-match dataset 318 records. Both datasets were further split into 90% training set and 10% test set. Figure 3 shows the flowchart of creating the annotated datasets.

Free-text dataset

From the healthcare providers’ notes 2017, we also extracted the modified free-text diagnosis descriptions to create the free-text dataset. All modified free-text diagnosis descriptions were included. This resulted in a free-text dataset containing of 74,633 records.

SNOMED CT 2017 dataset

Dataset contained n = 2,411,303 terms

Remaining terms: n = 44,107 English terms: n = 2,367,196 Remaining terms: n = 43,329 Duplicated terms + code: n = 778 SNOMED CT 2017 dataset: n = 43,003

Same terms with different codes:

n = 326

(15)

14

Dataset extracted from EHR System EPIC, AMSTERDAM UMC (1-1-2017 – 31-12-2017)

Dataset contained n = 290,275 records

Unique Diagnosis discription records n = 11,861 Remaining records: n = 11,860 Blank Diagnosis discription: n = 1 Containing SNOMED-CT code: n = 7,414 No SNOMED-CT code: n = 4,346 Remaining records: n = 6,551 Code unfindable in SNOMED-CT corpus: n = 963 Remaining records: n = 6,546 Term match, but different SNOMED-CT code: n = 5 Remaining records: n = 4,236

Findable term match: n = 110

Total number of records: n = 10,783

Exact-match dataset ntotal = 10,465 Containing SNOMED-CT code: n = 6,353

No SNOMED-CT code: n = 4,112

Comparable-match dataset ntotal = 318 Containing SNOMED-CT code: n = 193

No SNOMED-CT code: n = 125 90% Training set: n = 9,419 10% Test set: n = 1,047 90% Training set: n = 286 10% Test set: n = 32

(16)

15 4.2.2. Medical concept normalization – Pipeline

Figure 4 shows the pipeline for MCN used in this study, which consisted of three modules: Language model fine-tuning, preprocessing and classification.

Fine-tuning the language model

Datasets

For fine-tuning, we used the NHG standards, the SNOMED CT and healthcare providers notes 2019. These three datasets were useful for fine-tuning as the NHG standards contained large free medical domain text, the SNOMED CT contained the terms on which free medical text should be mapped on and healthcare providers’ notes 2019 contained extra healthcare providers’ comments that had longer sentences than the 2017 dataset. The datasets were divided in a 90% training set and 10% test set.

Programs & parameters

As NHG standards and SNOMED CT were publicly available, these were fine-tuned with the use of Google-Colaboratory’s Tesla P100 PCIE 16GB GPU. Fine-tuning on healthcare providers’ notes 2019 was done locally on a computer with 8 CPU’s, because of the sensitive nature of the notes. We used Python 3.0, Pytorch 1.5.0, Transformer 2.6.0, maximum (sub)word per sentence size of 510 (512 minus the classification (CLS) and separator (SEP) tokens to give the beginning and end of a sentence respectively) and 24 epochs. Epoch is the number of times the dataset is passed through the model. Preprocessing

Preprocessing consisted of two steps. First, Dokter BERTje was used to get word embedding outputs. Second, pooling strategies were used to derive a fixed-size sentence embedding from the word embeddings. We preprocessed in the following way for each diagnosis description in the annotated datasets and each term in SNOMED CT:

‐ Dokter BERTje:

o Tokenizing: Dokter BERTje’s tokenizer broke down a sentence to a list of (sub)words, called tokens. Next, it added the CLS and SEP token at the beginning and at the end of the sentence respectively.

o Encoding: Using Dokter BERTje’s vocabulary, the tokens per sentence were converted into the vocabulary indices creating sentences with encoded tokens.

o Creating word embedding: By running the sentences with encoded tokens through Dokter BERTje, word embedding was created for each token in a sentence.

‐ Pooling: There are different pooling strategies to derive a fixed-size sentence embedding out of word embeddings. We experimented with three types of pooling strategies:

o Using the CLS-token vector, which is the vector corresponding to the first token of a sentence. This is specific to classification tasks.

o Computing the average vector (AVG) of all the word embeddings in a sentence. o Computing the maximum vector (MAX) of all the word embeddings in a sentence. Classification

Cosine similarity was used for Classification. Classification was performed per pooling strategy in the following way for each diagnosis description in the annotated datasets and each term in SNOMED CT: ‐ Similarity calculation: the similarity between two sentences (a diagnosis description and a

SNOMED CT term) was calculated with the cosine formula described in section 3.3 on the sentence embedding.

‐ Sorting from similar to dissimilar: for each diagnosis description in the annotated dataset, the cosine results to each SNOMED CT term were sorted from largest to smallest value.

‐ Returning result: from the sorted cosine results, the nearest term and its cosine result were returned (i.e. top 1).

(17)

16

NHG-Standards Healthcare Providers’ notes SNOMED-CT terms BERTje Dokter BERTje Language Model Fine-tuning

Healthcare providers’ notes SNOMED-CT terms Word embedding Word embedding Cosine similarity Results between 0...1 Classification Pooling strategy Sentence embedding Sentence embedding Preprocessing

(18)

17 4.3. Data analysis

Performance rate

The performance was evaluated using the confusion matrix to find the performance metrics such as accuracy, precision, recall and F1 score. Receiver operating characteristic (ROC) analysis was performed on the training set to assess the model accuracy. The area under the ROC curve (AUROC) was considered excellent if the AUROC values were between 0.9 – 1, good for AUROC values between 0.8 – 0.9, fair for AUROC values between 0.7 – 0.8, poor for AUROC values between 0.6 – 0.7 and failed for AUROC values between 0.5 – 0.6. As the results from the cosine similarity were not binary the true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) were defined to assess the AUROC. Table 2 gives the definition of the TP, FP, FN and TN. The optimum cutoff point was defined with the use of Youden index, which is defined as the point on the ROC curve closest to perfect sensitivity and specificity. To evaluate the performance of the different pooling strategies, we calculated the F1-score on the test set which is calculated with the precision and recall. Precision is the amount of TP divided by the amount of all positive results and recall is the amount of TP divided by samples that should have been positive. The F1 score can be interpreted as a weighted average of the precision and recall. F1 score reaches its best value at 1 and worst value at 0. Precision, recall and F1 can be calculated with the following formulas:

F1 2 precision recall

precision recall

Table 2 Definition for the confusion matrix variables.

Out-of-Vocabulary

We evaluated Dokter BERTje on usefulness for medical text. We defined a cut-off value of <10% of OOV words as useful. OOV words are words that are not in the vocabulary of a model. For the usefulness evaluation, we analyzed the amount of OOV words, the amount of unique OOV and the frequency these OOV words appeared in the dataset. We evaluated OOV words on the annotated datasets, SNOMED CT and on the free-text dataset.

We analyzed the OOV by running Dokter BERTje’s tokenizer on the datasets. If a token was OOV, it was tokenized with the unknown token (UNK). By analyzing the number of UNK, we visualized the

(19)

18

percentage of (unique) OOV words appeared in the different datasets. Furthermore, the top 20 most frequent OOV words were given and evaluated in the type of OOV words such as abbreviations, medical term or typing mistakes.

5. Results

Performance rate

The exact-match dataset had excellent AUROCs on all three pooling strategies (0.928, 0.927 and 0.927 for CLS, AVG and MAX respectively). For the comparable-match dataset on the other hand, the AUROCs were lower than 0.5 for all three pooling strategies (0.438, 0.448 and 0.422 for CLS, AVG and MAX respectively). Figure 5 shows the AUROCs of the exact-match and comparable-match dataset. For the exact-match dataset, the similarity threshold chosen with the Youden index was 1 for all pooling strategies. The AVG had the highest F1-score (0.920). For the comparable-match dataset, the similarity threshold chosen with the Youden index for all pooling strategies was above 0.9 with the lowest threshold 0.929 for CLS. The F1-score for all pooling strategies in the comparable-match dataset scored lower than in the exact-match dataset, with the highest F1-score 0.500 for CLS.

Figure 5 The AUROCs of the exact-match and comparable-match dataset on three types of pooling strategies (CLS, AVG and MAX).

(20)

19

Table 3 Results of all pooling strategies for both annotated datasets.

Out-of-vocabulary words

Table 3 shows the numbers and percentages of OOV words of the different datasets. Only SNOMED CT had under 10% of OOV words with 16,039 (9.11%) out of the 176,009 words. However, SNOMED CT did have more than 10% in unique OOV words with 3,119 (18.58%) of the 16,789 unique words. The exact-match, the comparable-match and the free-text dataset all had more than 10% OOV words as unique OOV words. The free-text dataset had the highest percentage in OOV words (15.14%) and unique OOV words (22.69%).

The top 20 most frequent OOV words are shown in Table 4. In SNOMED CT, the top 20 most frequent OOV words were all medical terms. The exact-match dataset and the comparable-match dataset consisted both of more medical terms than abbreviations OOV words in the top 20, while the free-text dataset consisted of more abbreviations than medical terms OOV words. Abbreviations in the table are given in blue. The exact-match dataset had 11 medical terms vs 9 abbreviations, the comparable-match dataset had 12 medical terms vs 7 abbreviations and the free-text dataset had 5 medical terms vs 14 abbreviations.

Table 4 The number of OOV words in downstream task datasets.

(21)

20

Table 5 Top 20 most frequent OOV words per dataset. Medical terms are given in black, abbreviations in blue and

characters in orange.

6. Discussion

In this study we wanted to determine if Dokter BERTje could be successfully applied as a performant component for Dutch MCN and the usefulness of Dokter BERTje. We therefore conducted an experiment, where we fine-tuned Dokter BERTje on medical text for extracting vector representations of diagnosis descriptions and SNOMED CT terms. First, we analyzed Dokter BERTje on classifying the diagnosis descriptions onto a SNOMED CT terms with the use of cosine similarity method. Secondly, we analyzed the percentage of OOV words.

Dokter BERTje had high F1-scores on the exact-match dataset with the highest F1-score for the AVG pooling strategy (0.920), indicating a good performance. Having Dokter BERTje performing good on the exact-match dataset on all three pooling strategies, shows that Dokter BERTje is able to link SNOMED CT terms to diagnosis descriptions using cosine similarity on sentence embeddings. However, as the exact-match dataset was based on SNOMED CT terms that were identical to the diagnosis descriptions, it indicates the maximum attainable performance, but does not say much for how it will perform in the practical setting. In practice, free-text diagnosis descriptions are almost never identical to the SNOMED CT terms, (32, 33) we therefore used the comparable-match dataset to simulate a practical setting. However, Dokter BERTje performed very poor on the comparable-match dataset with the highest F1-score of 0.500 for CLS pooling strategy. This shows that Dokter BERTje cannot yet be employed in practice on Dutch medical text. This could probably be improved by adjusting the parameters for fine-tuning the language model. For example by adjusting the number of used epochs.

(22)

21

The total number of OOV words in SNOMED CT was lower than 10% (9.11%), while the total number of OOV words in both annotated datasets and free-text dataset were higher than 10%. The lower percentage OOV words in SNOMED CT can be due to the fact that SNOMED CT uses less OOV words on average. Despite the high percentage of OOV words in the exact-match dataset, Dokter BERTje still managed to classify the diagnosis descriptions onto the right SNOMED CT terms. This can be explained by the embedding creation of BERT-models. BERT-models create an embedding based on the surrounding words in a sentence. In this way, OOV words in a sentence can be compensated by the surrounding words which do occur in the BERT-model vocabulary. For example, “acute duodenal ulcer with hemorrhage”, where “ulcer” is the only OOV word in this sentence will be compensated by the other words in the sentence. A “weak” sentence embedding is formed by the compensation. Because of the identical sentence embedding of the diagnosis description and its corresponding SNOMED CT term in the exact-match dataset, these two “weak” sentence embeddings will have the closest distance to each other. In short sentences, consisting of only OOV words, OOV words are not compensated. For example, “aneurysma aortae” which both “aneurysma” and “aortae” are OOV words. This is probably a plausible reason why the AUROCs for all the pooling strategies of the exact-match dataset were not 1. On the other hand, “weak” sentence embeddings in the comparable-match dataset make MCN unreliable. Having a “weak” sentence embedding for one or both the diagnosis description and the corresponding SNOMED CT term influences the distance between these two. This may cause that another SNOMED CT term to be returned as the nearest SNOMED CT term. Given the AUROCs of 0.5 and lower for all three pooling strategies of the comparable-match dataset, this might have occurred often. From all the four datasets, the free-text dataset had the highest total number of OOV words. This can be due to clinical text containing grammar mistakes and clinical scores.

When observing the percentage of unique OOV words, this was above 10% for all datasets (SNOMED CT, the exact-match, the comparable-match and the free-text dataset). This indicates that Dokter BERTje contains limited variations of (sub)words in its vocabulary. To solve the problem of OOV words, Dokter BERTje needs to be further pre-trained instead of fine-tuned. Fine-tuning a language model does not expand the model’s vocabulary, but only changes the weights of context of already existing (sub)words in its vocabulary. BERTje is pre-trained on natural Dutch language, which we fine-tuned further on Dutch medical text. It is preferable to pre-train further on Dutch medical corpora to expand the model’s vocabulary on words used in the medical domain.

The top 20 most frequent OOV words in SNOMED CT, the exact-match and the comparable-match dataset were medical words. While the top 20 most frequent OOV words in the free-text dataset was abbreviations. Given the difference in top 20 most frequent OOV words between the comparable-match dataset and the free-text dataset (more medical words vs abbreviations in comparable-match dataset vs free-text dataset), it indicates that the comparable-match dataset is not a good representation of the actual practical setting. In practice, free clinical texts contain rich information of symptoms and disease observation supported by medication and surgical procedures where abbreviations are pervasive. (32, 33) For future research, an open annotated benchmark dataset should be created from free clinical Dutch text to be able to evaluate and compare the performance rate of future models.

A limitation of this study was the scarcity of local resources. Furthermore, because of the sensitive nature of the notes, fine-tuning on free text of healthcare providers’ dataset was done locally. Fine-tuning BERTje on the healthcare providers’ notes 2019, which contained 8,998 records, took about a day. To fine-tune BERTje on the healthcare providers’ notes 2017, which is 32 times bigger than the 2019 dataset, would take longer than a month. For this reason, we only fine-tuned BERTje on the smallest set of healthcare providers’ notes, next to the NHG standards and SNOMED CT.

One strength of the study is that this is the first experiment of MCN done in Dutch. Another strength is the ability to link terms to diagnosis description on an unsupervised manner with the use of cosine similarity. If performance is improved it will open ways of doing MCN in an unsupervised manner.

(23)

22

In our work, we only researched one language model for MCN, in which high OOV words were found after fine-tuning. Future research should check the number of OOV words first before deciding to pre-train a model further or to only fine-tune. To improve Dokter BERTje on MCN with the cosine similarity method, Dokter BERTje needs to be further pre-trained on medical Dutch language. In this way, new words can be added to the vocabulary to lower the OOV words. Alternatively, other Dutch language models, such as Multilingual BERT and RobBERT can be researched on their OOV words and compare their performance of MCN to Dokter BERTje.(15, 34) Multilingual BERT is based on 104 languages including Dutch and RobBERT is another monolingual Dutch version of BERT, but based on a robustly optimized BERT approach. Furthermore, as most SNOMED CT terms are in English, future research can use a multilingual model instead of an monolingual model for MCN.

7. Conclusion

Dokter BERTje can link diagnosis descriptions to SNOMED CT terms that exactly match, but performs poorly on linking diagnosis descriptions to SNOMED CT terms in a practical setting where diagnosis descriptions are not identical to SNOMED CT terms. Therefore, Dokter BERTje cannot be successfully employed yet as a performant component for Dutch MCN. However, this is the first study to explore MCN in Dutch free-text. Given the high percentage of OOV words in Dokter BERTje, indicates the model not yet useful. Further research should be conducted in lowering the OOV words by further pre-training the language model.

8. References

1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health information science and systems. 2014;2(1):3.

2. Luo JS. Electronic medical records. Primary Psychiatry. 2006;13(2):20-3. 3. Weed L. Medical records that guide and teach N Engl J Med 1968; 278: 652–7.

4. Salmon P, Rappaport A, Bainbridge M, Hayes G, Williams J, editors. Taking the problem oriented medical record forward. Proceedings of the AMIA Annual Fall Symposium; 1996: American Medical Informatics Association.

5. Simons SM, Cillessen FH, Hazelzet JA. Determinants of a successful problem list to support the implementation of the problem-oriented medical record according to recent literature. BMC medical informatics and decision making. 2016;16(1):102.

6. Wright A, McGlinchey E, Poon E, Jenter C, Bates D, Simon S. Ability to generate patient registries among practices with and without electronic health records. Journal of medical Internet research. 2009;11(3):e31.

7. Schmittdiel J, Bodenheimer T, Solomon NA, Gillies RR, Shortell SM. Brief report: the prevalence and use of chronic disease registries in physician organizations. Journal of general internal medicine. 2005;20(9):855-8.

8. Grant RW, Cagliero E, Sullivan CM, Dubey AK, Estey GA, Weil EM, et al. A controlled trial of population management: diabetes mellitus: putting evidence into practice (DM-PEP). Diabetes care. 2004;27(10):2299-305.

9. Jaffe IS, Chiswell K, Tsalik EL, editors. A Decade On: Systematic Review of ClinicalTrials. gov Infectious Disease Trials, 2007–2017. Open forum infectious diseases; 2019: Oxford University Press US.

10. Wang Y, Kung L, Byrd TA. Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change. 2018;126:3-13. 11. Holmes C. The problem list beyond meaningful use: part I: the problems with problem lists. Journal of AHIMA. 2011;82(2):30-3.

12. Liu H, Wagholikar K, Wu ST-I. Using SNOMED-CT to encode summary level data–a corpus analysis. AMIA Summits on Translational Science Proceedings. 2012;2012:30.

(24)

23

13. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, editors. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems; 2013.

14. Luo Y, Song G, Li P, Qi Z, editors. Multi-task medical concept normalization using multi-view convolutional neural network. Thirty-Second AAAI Conference on Artificial Intelligence; 2018.

15. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional

transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

16. Miftahutdinov Z, Tutubalina E. Deep neural models for medical concept normalization in user-generated texts. arXiv preprint arXiv:190707972. 2019.

17. Ji Z, Wei Q, Xu H. Bert-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings. 2020;2020:269.

18. Peterson KJ, Liu H. Automating the Transformation of Free-Text Clinical Problems into SNOMED CT Expressions. AMIA Summits on Translational Science Proceedings. 2020;2020:497. 19. de Vries W, van Cranenburgh A, Bisazza A, Caselli T, van Noord G, Nissim M. BERTje: A Dutch BERT Model. arXiv preprint arXiv:191209582. 2019.

20. Aronson AR, editor Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium; 2001: American Medical Informatics Association.

21. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association. 2010;17(5):507-13.

22. Kang N, Singh B, Afzal Z, van Mulligen EM, Kors JA. Using rule-based natural language processing to improve disease normalization in biomedical text. Journal of the American Medical Informatics Association. 2013;20(5):876-81.

23. D’Souza J, Ng V, editors. Sieve-based entity linking for the biomedical domain. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers); 2015.

24. Ren K, Lai AM, Mukhopadhyay A, Machiraju R, Huang K, Xiang Y. Effectively processing medical term queries on the UMLS Metathesaurus by layered dynamic programming. BMC medical genomics. 2014;7(S1):S11.

25. Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. Journal of biomedical informatics. 2015;57:28-37.

26. Limsopatham N, Collier N, editors. Normalising medical concepts in social media texts by learning semantic representation2016: Association for Computational Linguistics.

27. Tutubalina E, Miftahutdinov Z, Nikolenko S, Malykh V. Medical concept normalization in social media posts with recurrent neural networks. Journal of biomedical informatics. 2018;84:93-102. 28. Tahmasebi AM, Zhu H, Mankovich G, Prinsen P, Klassen P, Pilato S, et al. Automatic normalization of anatomical phrases in radiology reports using unsupervised learning. Journal of digital imaging. 2019;32(1):6-18.

29. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:190310676. 2019.

30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al., editors. Attention is all you need. Advances in neural information processing systems; 2017.

31. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014.

32. Jensen K, Soguero-Ruiz C, Mikalsen KO, Lindsetmo R-O, Kouskoumvekaki I, Girolami M, et al. Analysis of free text in electronic health records for identification of cancer patient trajectories. Scientific reports. 2017;7:46226.

33. Moon S, Pakhomov S, Liu N, Ryan JO, Melton GB. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. Journal of the American Medical Informatics Association. 2014;21(2):299-307.

34. Delobelle P, Winters T, Berendt B. RobBERT: a dutch RoBERTa-based language model. arXiv preprint arXiv:200106286. 2020.

(25)

24 Appendix

Appendix A – Cosine similarity

Cosine similarity uses cosine to measure the angle between two vectors in a multi-dimensional space. The similarity can be calculated with the following formula where A and B are two vectors in a multi-dimensional space:

cos , ∙

| | | |

Here (A, B) is the angle between the vectors A and B measured in radians. A · B is the dot product of A and B, and |A| and |B| represent the length of the vectors A and B respectively. A dot product of the vectors A and B is defined as:

∙ …

Here ∑ denotes the sum and n the dimensional space of the vector. The length of a vector A is defined as:

| | . . .

For example, two three-dimensional vectors, X = [2, 4, 4] and Y = [3, 2, 6] would return a dot product of: 2, 4, 4 ∙ 3, 2, 6 2 3 4 2 4 6 6 8 24 38 A length of: | | 2 4 4 √4 16 16 √36 6 | | 3 2 6 √9 4 36 √49 7 And a similarity of:

38

6 7 38

42 0.905

Dokter BERTje: Medical Concept Normalization of Dutch Medical Description on to SNOMED CT