Natural language processing for Dutch medical language: A method for evaluating the value of currently available NLP tools for annotating Dutch medical free texts

(1)

Natural language processing

for Dutch medical language

A method for evaluating the value of currently available NLP

tools for annotating Dutch medical free texts

(2)

(3)

Natural language processing for Dutch medical language

A method for evaluating the value of currently available

NLP tools for annotating Dutch medical free texts

Student

D.S. Westerbeek

Meibergdreef 9

1055 AZ Amsterdam Zuidoost

Studentnummer 5871557

d.s.westerbeek@amc.uva.nl

Mentor

Dr. ir. R. Cornet

Department of Medical Informatics

Academic Medical Center, University of Amsterdam

Tutor

Dr. F.J. Wiesman

Department of Medical Informatics

Academic Medical Center, University of Amsterdam

Period

(4)

(5)

BACKGROUND Clinicians prefer free text format when entering data in an elec-tronic health record (EHR). This gives more freedom of expression, but severely limits the reusability of the data and introduces irrelevant or redundant data in the EHR. In order to increase reusability without having to analyze every input text by hand, techniques from natural language processing (NLP) could be applied to automatically annotate data using standardized concepts from some lexicon or terminology system. For the combination of Dutch language and the medical domain, little evidence exists on the effectiveness of existing NLP tools, which are mostly developed for English or are general-purpose.

GOAL The question this thesis aims to answer is: which methods and measures can determine the quality level that currently available NLP tools provide to annotate Dutch medical free text using a standardized medical lexicon or terminology system? As NLP is commonly done through orchestration of various types of tools (sentence splitters, part-of-speech taggers, etc.) in a so-called pipeline, this question boils down to (1) which NLP tools are of use in the pipeline? (2) which configuration options affect the outcome of the pipeline? (3) which outcome measures are valid? (4) what golden standard can be used to assess the quality of the pipeline output? (5) what is the value of standardized Dutch medical lexicons in a pipeline? (6) is it practically possible to implement and evaluate a usable pipeline?

METHODS To answer question (1), I investigated literature on the subject, searched the world wide web, and selected tools based on availability and usability. Question (2) is answered by examining literature and the configuration options for the selected tools and by implementing a rudimentary version of the pipeline. Answering question (3) required investigation of literature. Question (4) was answered by an expert-based manual annotation. For question (5) I evaluated the degrees of coverage of three Dutch medical lexicons based on the golden standard. Question (6) was answered by setting up a rudimentary version of the pipeline.

RESULTS (1) I found the following tools that are available and usable on Dutch language: Alpino (parser), Frog (parser), JOrtho(spellchecker), and Hunspell (spellchecker). In order to contruct a pipeline, tools have to be written to let a single output be used as another input, as well as a tool to annotate the relevant concepts from the free texts using the contents of a lexicon. (2) Multiple configuration options are important for the outcome of the pipeline: the order of applying the used tools, the dictionaries used for spellchecking, and the extent of fuzzy search when annotating concepts. (3) Recall and precision are the standard for outcome measures in the NLP field. Since these two measures can be calculated for the found relevant concepts in the input texts, as well as for the annotated concepts, a single F-measure is not easily calculated. I propose to use a measure derived from the two F-measures, called the overall relevance. (4) The golden standard can be achieved by letting domain experts manually annotate concepts, but this requires a large number of man hours and a us-able annotation tool in order to come to enough results. (5) Depending on the specific lexicon, the degree of coverage ranges from .49 to .61. (6) An initial implementation of the pipeline revealed that connecting the various NLP tools is challenging, but doable. DISCUSSION The methods provided a way to evaluate the quality of currently available NLP tools applied to Dutch medical free text. The fact that these results are supported by a rudimentary implementation provides confidence in their applicability. This does not mean, however, that all obstacles that could occur when applying these methods have been encountered. For future research, I suggest to apply the design of the pipeline to real world medical data and evaluate the outcomes of different configurations, using the evaluation method proposed in this thesis. The outcome of the highest scoring configuration would provide an indication of the usefulness of currently available natural language processing tools in a Dutch medical setting.

(8)

This can influence how we treat currently existing medical data, or maybe even how clinicians are asked to enter data into an EHR.

CONCLUSION I have provided a method to evaluate the quality of currently available NLP tools applied to Dutch medical free text and shown that Dutch medical lexicons are suitable to be used in a pipeline setting. Applying the methods and evaluating the quality of the NLP tools is the next step.

keywords natural language processing, medical language, dutch, NLP pipeline, lexicon

(9)

Samenvatting

ACHTERGROND Clinici geven de voorkeur aan vrijetekstinvoer bij het registreren van data in een elektronisch pati¨entendossier (EPD). Dit biedt meer vrijheid, maar heeft tegelijkertijd het effect dat er grote beperkingen zijn bij het hergebruiken van die data. Ook introduceert vrijetekstinvoer irrelevante of redundante data in het EPD. Om er voor te zorgen dat de mogelijkheid tot hergebruik groter wordt, zonder dat er handmatig werk vereist is, kunnen technieken van de natuurlijke taalverwerking (NLP) ingezet worden. Hiermee kan automatisch data worden geannoteerd met concepten uit een gestandaardiseerd medisch lexicon. Er is maar weinig bekend over de effectiviteit van bestaande NLP tools in het Nederlands medische domein. De meeste NLP tools zijn ontwikkeld voor de Engelse taal en domeinonafhankelijk.

DOEL De vraag die dit onderzoek probeert te beantwoorden is: welke methoden en onderzoeksmaten kunnen het kwaliteitsniveau bepalen van de huidig beschikbare NLP tools die ingezet worden voor het annoteren van Nederlands medische vrije tekst, gebruik makend van een gestandaardiseerd medische lexicon of terminologie systeem? Aangezien NLP normaliter wordt toegepast door het samen laten werken van ver-schillende soorten tools in een zogenaamde pipeline, komt de vraag neer op (1) welke NLP tools zijn nuttig in een pipeline? (2) welke configuratie opties hebben een effect op de uitkomsten van een pipeline? (3) welke uitkomstmaten zijn valide? (4) wat kan er als gouden standaard worden ingezet om de kwaliteit van de uitkomsten van de pipeline te beoordelen? (5) wat is de waarde van Nederlandse medische lexicons in een pipeline? (6) is het praktisch gezien mogelijk om een pipeline te implementeren en te beoordelen?

METHODEN Om vraag (1) te beantwoorden heb ik literatuuronderzoek met betrekking tot het onderwerp uitgevoerd, evenals tools op Internet gezocht, gevolgd door het selecteren van tools op basis van beschikbaarheid en inzetbaarheid. Vraag (2) wordt beantwoord door literatuur en de configuratie opties van de gekozen tools te bestuderen. Het beantwoorden van vraag (3) vroeg om literatuuronderzoek. Vraag (4) wordt beantwoord door het inzetten van experts die een handmatige annotatie uitvoeren. Voor vraag (5) heb ik de dekkingsgraden van drie Nederlandse medische lexicons gemeten met behulp van de gouden standaard. Vraag (6) heb ik beantwoord door een experimentele versie van de pipeline op te bouwen.

RESULTATEN (1) De volgende beschikbare en inzetbare tools voor de Nederlandse taal werden gevonden: Alpino (ontleder), Frog, (ontleder), JOrtho (spellingschecker) en Hunspell (spellingschecker). Om een pipeline te bouwen is het nodig dat de output van de ene tool als input voor de andere tool ingezet kan worden. (2) Er zijn meerdere configuratieopties die belangrijk zijn voor de uitkomsten van de pipeline, namelijk: de volgorde van de gebruikte tools, de gebruikte woordenboeken bij de spellingscheck, evenals de mate waarin ‘fuzzy search’ wordt gebruikt bij het an-noteren van concepten. (3) Herkenning en precisie zijn de standaard uitkomstmaten in NLP. Deze twee uitkomstmaten kunnen worden berekend voor zowel de gevonden relevante concepten uit de input, als voor de geannoteerde concepten. Daarom is een enkele F-maat niet makkelijk te berekenen. Daarom wordt een enkele uitkomstmaat met behulp van de twee F-maten berekend, die ’overall relevance’ wordt genoemd. (4) De gouden standaard kan bereikt worden door experts in het domein handmatig relevante concepten te laten annoteren, maar dit komt neer op een groot aantal nodige man-uren en een annotatie-tool. (5) Afhankelijk van de lexicon, ligt de dekkingsgraad tussen .49 en .61. (6) Een experimentele implementatie van de voorgestelde pipeline bracht aan het licht dat het een uitdaging is om de verschillende NLP tools aan elkaar te koppelen, maar dat het mogelijk is.

DISCUSSIE De beschreven methoden bieden een mogelijkheid om de kwaliteit van huidig beschikbare NLP tools, toegepast op Nederlands medische vrije tekst, te evalueren met behulp van echte klinische data. Dit biedt een springplank voor

(10)

toekomstig onderzoek. Het feit dat de experimentele setup aangaf dat het mogelijk is, geeft niet aan dat alle problemen bij een praktische implementatie aan het licht zijn gekomen. Voor toekomstig onderzoek raad ik aan om de beschreven methoden in te zetten en een pipeline te bouwen die met echte klinische data er toe in staat is om een oordeel te vellen over de status van NLP tools in het Nederlands medis-che domein. Daarbij kunnen de uitkomsten van de verschillende configuraties van de pipeline met elkaar vergeleken worden met behulp van de gouden standaard. Dit kan invloed hebben op hoe we omgaan met medische data, of misschien zelf impact hebben op hoe clinici de data dienen in te voeren in dossiers.

CONCLUSIE Ik heb methoden en uitkomstmaten bepaald om de kwaliteit van huidig beschikbare NLP tools te beoordelen in de situatie van het toepassen op Nederlands medische vrije tekst. Daarnaast heb ik geconcludeerd dat drie Nederlands medische lexicons nuttig zijn in een pipeline setting. De volgende stap is het inzetten van de gevonden methoden en uitkomstmaten.

keywords nnatuurlijke taalverwerking, medische taal, Nederlands, NLP pipeline, lexicon

(11)

1 Introduction 3

1 Introduction

This thesis focuses on the usefulness of currently available natural language processing tools when applied to Dutch medical free text. This introduction will provide insight in the current situation, which problems arise there, and a possible solution with its challenges.

1.1 Current situation

As aging becomes a greater issue in our society and as economic hardship presents itself evermore, the incentive for searching ways to improve quality of health care, as well as reducing its costs, grows. Electronic health records (EHRs) have already played a substantial role in both aspects, since EHRs have shown to improve quality by providing greater accessibility of health records and to reduce costs both directly by eliminating certain logistical aspects, such as physically having to move the health records from one place to another, as well as indirectly from the effects of improved quality of care, such as reduced length-of-stay.[1]

Clinicians have accepted the use of EHRs in their daily workflows and have embraced the advantages that come with EHRs. Part of this acceptance has to do with the fact that their preferences regarding the usage of EHRs have been largely adopted by developers and implementation teams. One of these preferences that has wide support among clinicians is to be able to enter data in the form of free text. This means they can use natural language, or human, informal language, without restrictions. An example of free text in the form of natural language would be: Patient probably suffers from bronchitis. Clinicians prefer entering data in free text format, because it provides them with more freedom of expression than when limited to predefined data structures.[2]

1.2 Problems

Multiple problems arise from entering data in the form of free text. One of them is that it makes it harder to reuse data. Another is that it limits interoperability between different languages. Both problems ask for a solution that includes a standardized way of storing the data, so that computers can interpret and exchange the data.

1.2.1 Reuse of data

In order for an EHR to have the potential of improving quality and reducing costs, the data stored in the EHR needs to be reusable.[3] Apart from the efficiency that the collect once, use many principle achieves,[4] data that is collected once can be of critical value later. An example would be a patient’s allergy to penicillin collected during a visit to a general practitioner, and used years later when that same patient is admitted to the emergency department of a hospital, while unable to mention the allergy.

In this example, the general practitioner is unlikely to use the same EHR as the emergency department of the hospital. This make interoperability between multiple EHRs a requirement when it comes to reusing data across those EHRs. Interoperability means the EHRs should be able to exchange data. The exchange

(12)

1 Introduction 4

of data requires that the EHRs communicate according to a specified set of rules, which include specified data formats.

Data formats can be divided into two categories: structured and unstruc-tured. A structured data format is a specified way of denoting some particular data. This could be a date formatted as dd/mm/yyyy, a patient’s height for-matted as three digits that specify the number of centimeters, or a diagnosis formatted as a letter followed by two digits that references a concept identifier in the ICD-10 classification.1 An unstructured data format has no specified way of denoting the data. This could be a diagnosis that is written in free text format.

Entering data in free text format, which is preferred by clinicians, compli-cates exchanging data. Outside of reusing EHR data for direct clinical purposes, the data may also be valuable for managerial, audit, or research purposes. For these purposes, it may be desired that the data is available in an aggregated format. In the case of data that is stored and exchanged in free text format, however, computers cannot simply aggregate the data that lies within the free text and, as a result, reuse of that data is obstructed.

1.2.2 Interchangeability between different languages

The second problem concerns the reuse of clinical data across multiple languages. If a patient history is exchanged in free text format between two EHRs that are used within the same language, the user of the receiving EHR will most likely have no problem with interpreting the data (barring context issues). If, however, the EHRs have user bases that speak different languages, interpretation of the data is made much more difficult and error prone.

1.2.3 Deploying structured data for reuse and interoperability

Because of the mentioned problems that both reuse and interoperability get to deal with, free text data is detrimental to reuse and interoperability and should be avoided. To make data aggregation and translation of the data possible, structured data is required.[5, 6, 7] Some tools that support the medical pro-fessional in the care process, such as decision support systems, also rely on the availability of structured data.[8] A patient’s history, stored in free text, could also be regarded as a list of symptoms, diagnoses, and treatments, each with a date. To this end, coding systems and terminology systems exist that contain all of these concepts.

Coding systems contain universal medical code numbers for every concept that is included in the coding system. The description of the concept can be translated to multiple languages, but the code number remains the same throughout all languages. Multiple synonyms within a language are also linked to the same code number. We call these code numbers concept identifiers. An example of a coding system is ICD-10. Terminology systems define concepts through the relations between them. There are multiple relations types, like is-a relis-ations is-and his-as-is-a relis-ations. Since is-a kidney is is-an orgis-an, the kidney concept has a is-a relation to the concept organ. The retina is part of the eye, so a has-a relation exists between the eye concept and the retina concept. The concepts in a terminology system can also contain synonyms and definitions.

(13)

1 Introduction 5

Again, all concepts have unique concept identifiers. All concept identifiers are per definition a structured data type.

By storing and exchanging the respective concept identifiers that are linked to such a system instead of storing free text, the language problem is eliminated and data can be aggregated.

However, as mentioned in section 1.1, clinicians require a certain amount of freedom of expression when entering the data. This clashes with the require-ments of reusing the data.

1.3 Challenge

To overcome the problem of storing structured data without limiting the freedom of expression of clinicians, on way is to map automatically the entered free text to coding systems and terminology systems. In order to achieve this, techniques from the research field of natural language processing (NLP) can be employed. NLP analyzes natural language in an automated manner. NLP has been applied to the medical field with varied success. Software has been used to analyze the free text and recognized concepts are mapped to existing coding systems and terminology systems with a certain level of confidence. NLP is still no solved problem, in the sense that it cannot approximate human capability in a significant way.

Most NLP tools are created specifically for the English language, so mapping natural language to coding systems and terminology systems is even more of a challenge in non-English speaking countries. Efforts are being made to trans-late concepts of the most widely used coding systems and terminology systems to different natural languages,2 _{but this is only slowly taking place and a full}

translation to Dutch of concepts of one the most notable and most widely used terminology systems, SNOMED CT, is not available yet. Until such a transla-tion is available, it would be useful if natural language can be at least mapped to standardized Dutch lexicons, which could in turn be mapped to standardized English based coding systems and terminology systems.

Recently, a study was carried out to make an inventory of available NLP tools that can be applied to Dutch medical language.[9] That particular study has been the inspiration for this thesis, in which I try to determine a useful method to map medical Dutch free text to standardized Dutch medical lexicons as part of the greater mapping process to structured data in English based coding and terminology systems.

This method will make use of a pipeline setup. This means that NLP tools will be sequentially ran to process free text as input, with annotated concepts to the text as output.

1.4 Use case and research question

During the writing of this thesis, the Academic Medical Center (AMC) in Am-sterdam undergoes a transition from the currently used EHR to a new EHR.3 Part of this transition means migrating the currently stored data. Since a large part of the data is stored in Dutch medical free text, concessions have to be

2_{http://www.ihtsdo.org/develop/documents/translating-snomed-ct/} 3_{https://www.amc.nl/web/Het-AMC/Nieuws/Nieuwsoverzicht/Nieuws/}

(14)

1 Introduction 6

made when choosing if and how to store it in the new EHR. These choices de-pend on a number of factors, including the importance of the data, the context of the data, and the time it takes to manually migrate the data to a structured format.

The data to be migrated is a realistic basis for the goal of this thesis: to de-termine a method which can be used to automatically annotate medical Dutch free text with concepts from standardized Dutch medical lexicons, and to de-termine a measure to evaluate the quality of NLP tools that contribute to the automation of such a method.

This brings us to the research question:

Which methods and outcome measures can evaluate the quality level of currently available natural language pro-cessing tools in the process of annotating Dutch medical written free text from the AMC’s current EHR using a standardized medical lexicon?

I will have reached my goal when I have determined a method to evaluate the quality of currently available NLP tools, and I have provided a measure for this evaluation, including a golden standard.

To answer the research question, multiple smaller questions have to be an-swered. They can be split up into the following six questions:

1. Which NLP tools are of use in a pipeline setup?

2. Which configuration options affect the outcome of a pipeline setup? 3. Which outcome measures are valid?

4. What golden standard can be used to assess the quality of a pipeline output?

5. What is the value of standardized Dutch medical lexicons in an NLP pipeline?

6. Is it practically possible to set up the pipeline and apply the measures?

1.5 Outline of the thesis

The next section will provide background information on various relevant con-cepts and topics. If you, the reader, are not familiar with some of the concon-cepts mentioned in this introduction (e.g. NLP) it is highly recommended to read the background in section 2.

Section 3 will provide insight into the materials needed and used for this thesis. After that, in section 4, I will discuss the methods used in this study by which I will try to answer the research question. This ranges from setting up a pipeline to finding a golden standard.

The results of applying those methods will be provided in section 5, followed by an analysis and discussion on the meaning of those results in section 6. There, I will also address the limitations of this study and recommended future research.

(15)

2 Background 7

2 Background

This section will expand on certain topics that are discussed in the thesis. First, I will explain some basics about natural language processing in section 2.1, which is followed by a section on electronic health records (section 2.2).

2.1 Natural language processing

For those who are not familiar with natural language processing, this section provides information on its scope (section 2.1.1), its use and research in the medical domain (section 2.1.2), and specifically on its use and research applied to Dutch medical texts (section 2.1.3).

2.1.1 Scope

Natural language processing (NLP) is a field of artificial intelligence that has been active since the 1950s.[10] It focuses both on processing natural language as input, to make it interpretable by a pre-determined computer application, as on creating human-like natural language as output (i.e. natural language generation). Natural language can exist in the form of spoken or written text. Liddy defines NLP as follows:

”Natural language processing is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.”[11]

This thesis will only deal with the analytical part of NLP. This means that natu-ral language will be taken as input, instead of produced as output. The ultimate goal of this part of NLP is to extract meaning from natural language.[10]

It is possible to define subproblems of the NLP field. These subproblems are:

• parsing: the act of analyzing language according to its rules of formal grammar;

• noun phrase recognition: this could be regarded as a subproblem of pars-ing. It focuses on recognizing words that all say something about the same noun;

• named entity recognition (NER): this could also be regarded as a sub-problem of parsing. It focuses on the recognition of named entities, such as a human name, but also the name of a disease or bacteria;

• word sense disambiguation (WSD): where parsing determines the role of a word, word sense disambiguation focuses on determining the meaning of a word, which is dependent on context;

• abbreviation handling: this could be regarded as a subproblem of WSD. It focuses on determining the full form for which an abbreviation was used. This also depends entirely on context;

(16)

2 Background 8

• negation handling: uncertainty and negation can change the entire mean-ing of a piece of language. Negation handlmean-ing focuses on recognizmean-ing it and determining its scope.

2.1.2 Natural language processing in the medical domain

The meaning of natural language is highly dependent on the context in which the natural language is applied. Different technical domains have different sub-languages. According to Spyns, a sublanguage can be defined as a technical language that is used by the various actors in the technical field to pass specific messages. A technical language differs from the general language by presenting some characteristics that are unique to that technical language. An example of such a characteristic is a specific vocabulary that is exclusively used in the particular technical domain, or words that take on a different meaning when used in the technical domain as opposed to using the same words in the general language.[12]

The medical domain can be regarded as one of those technical domains with its own sublanguage. In geographical areas where English is used as the general language, the medical domain uses the sublanguage of medical English. One of its characteristics that differentiates it from the general language is its use of words that are composed of Latin or Greek parts.

An example of where the meaning of a word in the medical language differs from the meaning of the same word in another sublanguage is the word sinus in medical Dutch. In the medical sublanguage it refers to a cavity, whereas it refers to a trigonometric function in the mathematical sublanguage. Another example is monster in medical Dutch, where it always means sample, whereas it can also refer to a frightening imaginary creature in the general Dutch language. There are a lot of instances where NLP has been applied specifically to the medical domain. Among those are:

• (1990) SPRUS had the goal to extract and encode the findings in a radi-ologist’s report;[13]

• (1994) Friedman et al. also built and evaluated an extracting system for radiology reports;[14]

• (1995) MedLEE was the successor of the system of Friedman et al. and has been used succesfully in radiology practice;[15]

• (1996) a prototype system to index radiology and pathology reports of lung cancer patients was built by Taira et al. ;[16]

• (1999) Menelas was built as a pilot system to analyze discharge summaries;[17] • (2001) Lussier et al. used MedLEE to map free-text to SNOMED codes

and evaluated the feasibility to use it in practice;[18]

• (2001) MetaMap was created to map (non-domain specific) medical texts to the UMLS Metathesaurus;[19]

• (2002) MPLUS was designed to demonstrate the usefulness of Bayesian networks in medical text processing;[20]

(17)

2 Background 9

• (2003) IndexFinder was built to evaluate how much quality had to be sacrificed in order to map medical concepts to the UMLS Metathesaurus in a real-time fashion;[21]

• (2003) the RODS project was brought to life with the goal to facilitate outbreak and disease surveillance;[22, 23, 24]

• (2004) Friedman et al. evaluated a new version of MedLEE to map non-domain specific clinical documents to the UMLS Metathesaurus;[25] • (2006) Meystre et al. built a system to manage a patient’s problem list

using the MetaMap Transfer application;[26]

• (2008) Goryachev et al. developed a system to extract a patient’s family history from medical texts;[27]

• (2009) Coden et al. built a system that extracts cancer disease character-istics from pathology reports;[28]

• (2009) MedLEE was also evaluated specifically when applied to nursing narratives;[29]

• (2009) ONYX was developed to translate spoken dental examinations into chartable findings;[30]

• (2010) cTAKES aims to be usable in a broader context by providing a system for information extraction from unstructured electronic medical record clinical texts;[31]

• (2011) A general framework for NLP, GATE, applied to nursing narratives.[32] Besides these systems that are designed to work in medical practice, mul-tiple challenges have taken place to facilitate progress of NLP in the medical domain.[33, 34, 35, 36]

Spyns gives an overview of the most relevant instances up until 1996,[12] while Nadkarni et al. give noticeable examples from 1996 up until 2011.[10] Meystre et al. also gives an overview of NLP research in the medical domain from 1995 to 2008.[8]

2.1.3 Natural language processing for Dutch medical language

As section 2.1.2 makes clear, quite some research has been carried out regard-ing NLP in the medical domain. When lookregard-ing at the Dutch medical domain specifically, the amount of previous research disappoints compared to the En-glish medical domain. Spyns et al. did quite some work in the 1990s on their medical language processor,[37, 38, 39, 40] but from there on out, it has been relatively quiet in the Netherlands regarding the topic. That is, until 2012, when Cornet et al. performed an inventory of NLP tools that could be applied to Dutch medical language.[9] This study showed that it would be feasible to put the tools into use and perform the study that you are now reading that this thesis is based on.

(18)

2 Background 10

2.2 Electronic health records

Before the era of enterprise-affordable computers, all health related data of a hospital patient was stored in paper-based records. The records were physically carried around the hospital by supporting staff and physicians. When a patient would see a physician, the paper-based record was to be located and transported to wherever the physician would see the patient, so that the physician could prepare by reading into the history of that specific patient and later update the record. In case of an emergency visit by a patient, this sometimes presented trouble, because the attending physician would receive the record too late to see relevant information for an emergency treatment of the patient. This could be for instance an allergy to penicillin, while the physician wants to fight an infection according to standard procedure. This is just one of the properties of paper-based records, that electronic health records (EHRs) have an advantage over. More examples are:[5, 41, 42]

• physical availability to only one person at a time;

• manpower required to find and move the record to where it is needed and; • space required to store the record;

• illegibility through handwriting;

• difficult aggregation of data for auditing, quality control or research; • no automatic triggers for decision support tools;

• inability to force care givers to abide to a standardized structure within the file.

The combination of these disadvantages and the emergence of affordable computing power encouraged health care organizations to think about and act upon the possibility of digitizing health records. Around the end of the 1960s and the beginning of the 1970s, there was a lot of optimism about the speed with which these EHRs would be implemented and would completely have replaced paper-based records.[5, 43] This was largely caused by several barriers to adopt an EHR:[5, 42]

• high initial costs;

• scattered data in different formats on many isolated ’islands’ (different departments within a health care provider);

• concerns about confidentiality and security of patient data;

• a high risk of going bankrupt for EHR developers and a risk for potential buyers because of little use of standards;

• resistance by physicians to use an EHR.

This last barrier needs a bit of elaboration. Physicians seem to feel that us-ing an EHR durus-ing patient visits, disrupts their work flows.[42] Multiple factors contribute to this believe. These factors include:[44, 45] interrupted communi-cation with patients, computers being simply too far away, loss of eye contact, slow computers, not being able to type fast enough, EHR usability issues, and not being able to express freely enough in the rigid structure of the EHR.

(19)

2 Background 11

2.3 Structured versus unstructured data

Most of the aforementioned barriers to adopting an EHR still exist, which is why not all the potential of EHRs is utilized. For example, many countries still have no structure in place for healthcare providers to share all patient data with each other when necessary. This has to do with the concerns about confidentiality and security of patient data, but also with the use of different data formats on the many isolated islands (but in this case, the islands are the different healthcare providers within a country). Another example is the use of aggregated data for research, auditing, quality control or the automated triggering of decision support when a certain diagnosis is entered. This requires the diagnosis to be entered in a way that the computer ’understands’ it.[5] The lack of freedom of expression that arises from this requirement is still not accepted by physicians, because entering a diagnosis in a computer-understandable way simply costs more time than entering it in free-text form.[5, 46, 47, 2]

Rosenbloom et al. refer to this as the ’tension between the needs of busy healthcare providers and of those reusing data from healthcare information systems’,[2] The form of data that the latter group prefers is often denoted by the term structured, which makes the form of data that the former group prefers unstructured. Structured data can be defined as conforming to a prede-fined or conventional syntactic organization,[2] which makes unstructured data non-conforming to those properties.

(20)

3 Materials 12

3 Materials

For developing a method and a measure to evaluate the quality of currently available NLP tools, multiple materials are required, which are specified in this section. The Dutch medical lexicons used will be discussed in section 3.1. Sec-tion 3.2 will explain the data used to setup an experimental implementaSec-tion of the found methods. The experimental implementation also requires certain development and running materials, which are discussed in section 3.3. The materials used for experimental results of the golden standard are addressed in section 3.4.

3.1 Dutch medical lexicons

I use three lexicons with different purposes and content. All three are used in developing the golden standard and in an experimental implementation of the found methods. The three lexicons are all current and frequently updated. They are introduced below.

3.1.1 Pinkhof Geneeskundig woordenboek

Pinkhof Geneeskundig woordenboek4_{is an extensive Dutch dictionary that}

con-tains over 54,000 medical terms, their definitions, synonyms, variants in spelling, common mistakes, related terms, and abbreviations. I was provided with a dig-ital copy of the dictionary in XML format, which was specifically prepared by the publisher for this research. The definitions of the terms are left out and terms from the following domains are missing: corporate health care, statistics and epidemiology, nutrition, nursing, informatics, and health care law.

3.1.2 Thesaurus Zorg en Welzijn

Thesaurus Zorg en Welzijn5 _{(TZW) is a Dutch thesaurus, a product of}

Stimu-lansz6_{. It contains more than 30,000 concepts from the social health domain and}

the disease domain. TZW provides synonyms, related concepts and hierarchical relations between concepts.

3.1.3 Diagnosethesaurus

The Diagnosethesaurus7 _{is produced and maintained by Dutch Hospital Data}

(DHD). This thesaurus consists of an extensive collection of diagnoses where each diagnosis has been mapped to an ICD-10 code as well as to a SNOMED CT code. There also is a mapping to the DBC code system (part of the Dutch health care financing system).

3.2 Dutch medical free text

To perform the evaluation on a realistic data set, Dutch medical written free text from real use in everyday practice is required. To this aim, data was obtained

4_{http://www.pinkhof.nl/geneeskundig-woordenboek} 5_{http://www.thesauruszorgenwelzijn.nl/}

6_{http://www.stimulansz.nl/}

(21)

3 Materials 13

from the currently used EHR system in the AMC. It contains the anonymized conclusion/diagnosis fields of 1578 records that were entered in the last three years at the ear, nose, and throat (ENT) outpatient clinic. Some anonymized examples of the contents of the conclusion/diagnosis fields are:

• ‘Chronische rhinosinusitis zonder neuspoliepen.’ • ‘forse ABG bdz.’

• ‘OME’

• ‘vitamine D vrijwel op peil Hb minimaal verlaagd geen aanwijzingen voor andere sensitisatie dan HSM in RAST’

• ”tel; wil graag operatie neus. komt langs op spreekuur.’

3.3 Development and running materials

In order to put the found methods to a practical test, an experimental imple-mentation of them will be set up. This requires a platform to run the NLP tools on, as well as development tools to create an automated methods to invoke the tools.

A platform that is compatible with all found NLP tools is required. Looking at the requirements of these tools, Linux presents itself as the most supported platform and since the supplied Dutch medical data has to remain in a secure environment, a secure Linux server at the AMC, running a Red Hat Enterprise Linux (RHEL) environment, is used.

In order to automatically invoke the tools in the RHEL environment, the scripting language of the Bash Unix shell is used. Bash scripting offers a way to call Linux commands in sequential but dynamic fashion. This way, the different tools can be triggered and parameterized from a single starting point, meaning the possibility to construct a pipeline.

In order to make the output of one NLP tool usable for the other, Java SE with its Java Development Kit is used to develop intermediate tools. Everything that results from this is executable on Java Runtime Environment 7 and up.

3.4 Golden standard materials

To come to an golden standard, a human being needs to be able to pick concepts from the different lexicons in a practical manner. To this aim, ITEM DiTo is used: a user interface for searching concepts in structured lexicons. The user enters a term in search box and ITEM DiTo lists concepts from the selected lexicon that is lexically close to the entered term. It also shows synonyms and related terms. The user then clicks on the term that seems to be the searched concept and this is saved to a database. This way, Pinkhof, TZW, and the Diagnosethesaurus are easily searchable for relevant concepts.

(22)

4 Methods 14

4 Methods

In this section, I will discuss the methods used to answer the four subquestions as formulated in section 1.4. First, I will describe the search strategy for available NLP tools that could be usable in a pipeline setting. Second, I discuss the way to identify the configuration options of an NLP pipeline. Third, I explain how to determine a valid outcome measure for the pipeline performance. Fourth, I discuss the way to identify a valid and usable golden standard. Fifth, the used methods to determine the value of three standardized Dutch lexicons in an NLP system are discussed. Finally, the methods for putting theory into practice are discussed.

4.1 Usable NLP tools

These methods describe the way to answer the following question: which cur-rently available NLP tools are applicable and useful in a pipeline setting to an-notate Dutch medical free text?

In order to come to an answer, the following concepts need to be made explicit: currently available, useful in a pipeline to annotate, and for Dutch (medical) free text.

Currently available can be defined as downloadable from the World Wide Web, with a license that permits use for research purposes, and available with-out costs. Useful in a pipeline means that a tool should be applicable from a command line, since a pipeline cannot handle a graphical user interface designed for human interaction only. Usable for Dutch (medical) free text means that it should be either language specific or language independent, depending on the role it fulfils in the pipeline.

The search for NLP tools that adhere to these requirements is based on Cornet et al., who surveyed possible NLP tools applicable to Dutch medical written free text.[9] This search is extended by searching the World Wide Web with a great variety of search terms to investigate if any other tools were avail-able for one or more of the subtasks of an NLP pipeline: parsing, noun phrase recognition, NER, spell checking, WSD, abbreviation handling, and negation handling (see section 2.1). For each tool that came across in any of these two ways, I evaluated if it met the set requirements of availability, usefulness and language-specificity.

The resulting set of tools can be used to construct an experimental pipeline to evaluate if any obstacles emerged in the way they are used or in the output each of them produces. This experimental pipeline is described as the result of the next subquestion.

4.2 Configuration options for a pipeline

This section describes the methods to come to an answer to the question: which configuration options affect the output of an NLP pipeline?

The tools that resulted from searching for currently available and usable NLP tools all have specific configuration options that affect their performance. Not all of these configuration options have an effect on the content of the output of that tool. Sometimes they simply affect the mode in which the tool runs (e.g. limiting used resources) or the representation of the output.

(23)

4 Methods 15

Fig. 1: The annotation process. Both the performance of tagging relevant con-cepts as the output of annotating them can be evaluated, which asks for a single outcome measurement.

The configuration options were examined to determine which of the options affect the content of the output.

Besides these configuration options for the tools, the order in which the tools are executed, or which tools are left out, can have a large effect on the final output of the pipeline. Thinking from output to input, I composed an overview of the various sequences of tools that are possible to run.

Finally, the list of configuration options can be combined with the list of executing orders, resulting in a list of all possible ways the pipeline can provide output.

4.3 Outcome measures

Here, the methods for answering the following question are described: which outcome measures for the performance of the pipeline configurations are valid?

To answer this, a search for articles on studies with a comparable design was carried out: evaluating the output of an automated system, in the field of NLP. Before this thesis, a preliminary background search for literature on the subject was carried out. Using bibliography mining, commonly used outcome measures or consensus on best practice were identified.

Since the annotation process of the input text consists of both tagging rel-evant concepts (marked for annotation), as well as the annotation — based on existing lexicons — itself, the performance of these two aspects can both be measured (see figure 1). This requires a way to combine both measures into a single unit that is relevant for both aspects.

4.4 Golden standard

Since the found outcome measure requires the output of the pipeline to be compared to a golden standard in order to determine the pipeline’s performance, the following question needs to be answered: what is the golden standard to which an NLP pipeline can be compared?

As with any area within the field of artificial intelligence, the automated process’ output should approximate the output that would result from a human executing the same process. To minimize the effect of human errors as well as intra- and inter-rater reliability to annotate medical terms using standardized lexicons, two experts in the field of medical annotation were asked to annotate part of the supplied EHR data (see section 3.2) using the three available lexicons (see section 3.1). This choice was based on similar golden standards found in literature on NLP. No medical staff was used in the process, because they are often not familiar enough with annotating concepts with a standardized lexicon.

(24)

4 Methods 16

The outcome measures were applied to the output of the pipeline and the output of the golden standard to assess if this golden standard is valuable for analysis of pipeline performance.

4.5 Assessing a lexicon’s value in a pipeline setting

Even when all of the discussed elements of setting up a pipeline of NLP tools with its configuration options and evaluating its performance are valid, it would be of no use if the lexicons which are used to annotate the input texts cover too few of the relevant concepts. Therefore, an analysis of the degree of coverage gives context to the evaluation of a pipeline configuration’s performance. Using the results of the annotation exercise that was carried out by the two medical annotation experts, the degree of coverage of each lexicon can be measured for the domain of which the input texts are generated.

The degree of coverage is defined as the number of noun phrases of the input text tagged as relevant for which an annotation was found, divided by the total number of noun phrases of the input text tagged as relevant:

degree of coverage = # of tagged noun phrases for which an annotation was found_{# of tagged noun phrases} For the three lexicons available in this study, the degree of coverage was mea-sured for the ENT domain by analyzing the relevant noun phrases and their annotations by the experts of 79 randomly picked conclusions/diagnoses. This number made up exactly 5% to the conclusions/diagnoses that were made avail-able for this research. No higher number was analyzed because of time con-straints.

4.6 Experimenting with NLP tools applicable in a pipeline

To validate that the answers to the four subquestions are not only usable in theory, but also in practice, an experimental pipeline has been set up. The setup applies one of the possible configurations and compares it output to the output of the golden standard. Using the proposed outcome measures, an evaluation of the performance of the pipeline configuration is done. The evaluation enables a validation of the proposed methods in this thesis.

This configuration is used in the experimental setup: use Hunspell as a spellchecker with a general Dutch dictionary as well as a medical Dutch dictio-nary. For correcting possibly misspelled words, a Levenshtein distance of 1 is used. The Alpino parser is utilized, and the tagged noun phrases are annotated using the Thesaurus Zorg & Welzijn.

(25)

5 Results 17

5 Results

In this section, I will discuss the results that follow from the methods used for each subquestion (see section 4).

5.1 Usable NLP tools

From Cornet et al.[9], four NLP tools met the set requirements. Searching the World Wide Web did not yield any more available and usable tools. The four tools all are available without costs and can be executed from a Unix command line. They are the following:

Alpino8_{[48] is a Dutch language parser, which includes the subtasks of noun}

phrase recognition and NER. It distinguishes between the roles that a word can have within a sentence. Alpino takes a sentence as input and provides output in the form of a dependency tree with every token of the sentence tagged with the syntactic role it has relative to the other tokens. It also provides a stemmed version of the token, if available. Although it is not specific for the medical domain, it provides potential value in the pipeline.

Frog9[49] is another Dutch language parser. It provides the same function-ality as Alpino, but since they do not process their input in the exact same manner, they both provide different results.[49]

Hunspell10_{is a spellchecker for various languages, including Dutch. It takes a}

token, checks if one of the selected dictionaries contains that token and provides this as output. If the token is not found within the dictionaries, Hunspell provides suggestions based on lexical distance between the token and the words available in the dictionaries. Hunspell also works with custom dictionaries. For this research project, I added the Dutch Myspell Medical dictionary by Hannie Steegstra11 _{to Hunspell.}

JOrtho12 _{is another spellchecker. Because the dictionaries used by JOrtho}

are based on Wiktionary projects, the suggestions and outcome could be differ-ent from Hunspell.

In an experimental pipeline setup (see section 5.6), I have applied both Alpino and Hunspell. They are both confirmed usable in such a setting. The documentation of Frog and JOrtho suggests the same for those two.

5.2 Configuration options for a pipeline

To determine the best order and configuration of tools in the pipeline, it should be possible to configure the behavior of the pipeline to process data. The differ-ent configurations will then lead to potdiffer-entially differdiffer-ent outputs from processing the same input. I will first describe the inputs and outputs of the pipeline in section 5.2.1 and then we describe the structure of the pipeline in section 5.2.2.

8_{http://www.let.rug.nl/vannoord/alp/Alpino/} 9_{http://ilk.uvt.nl/frog/} 10_{http://hunspell.sourceforge.net/} 11_{http://archive.services.openoffice.org/pub/mirror/OpenOffice.org/contrib/} dictionaries/README_nl_NL.txt 12_{http://jortho.sourceforge.net/}

(26)

5 Results 18

5.2.1 Inputs and desired outputs

To think about a structure and configurations options of the pipeline, we need to know its inputs and outputs. The former are already defined by the conclu-sion/diagnosis data that are available for this research (see section 3.2), but the latter requires some decision making. In the use case of this study, the mapping is done in light of the overarching research goal: being able to determine how well Dutch medical free text can be mapped to a coding or terminology system. Every step towards this goal is meaningful. That is why the outputs are focused more on lexical mapping and less on semantic mapping. I try to map concepts to similar lemmas or their synonyms found in the lexicons and I ignore context for now, because of the scope of this research. I restrict the concepts suitable for mapping to findings and procedures, because these seem to be the most relevant concepts in a conclusion/diagnosis field.

Findings and procedures generally seem to be written in the form of noun phrases. A noun phrase is a part of a sentence that functions as a subject, object, or prepositional object. It can be one word or multiple words, at least one of which is a noun. For example fever, yellow fever, or fever for one week. From this point on, we will refer to findings and procedures as noun phrases. 5.2.2 Order of the pipeline components

Since the output of the pipeline should be the annotated with relevant concepts from the input text, the annotation process should come last. Before that, the relevant concepts should be defined by analyzing the role of the words in the conclusion/diagnosis texts. This means parsing the sentence and extracting the noun phrases. The spellchecking process should be integrated before the extraction of the relevant concepts, as to be able to correctly parse the sentences. After each step, the output of the various possible tools should be standardized to be used as input for the next step in the pipeline. This results in a structure as shown schematically in figure 2.

5.2.3 Configurable parameters

There are two configuration options op the NLP tools that seem to be sig-nificantly influential on the output of a pipeline: the number of dictionaries used when spellchecking and the lexical distance used for correcting miss-spelled words (which can be expressed as the Levenshtein distance). Looking at the Lev-enshtein distance however, it was easily observed that a distance of 3 or higher would over-correct words in so many cases that it cannot be outbalanced any-more by valid corrections. Since a Levenshtein distance of 0 would mean that not a single word is corrected, only 1 and 2 are options for this configuration setting if a spellchecker is used.

This presents us with the following varieties in the configuration settings if spellchecking is enabled.

• Spellchecking against a dictionary provides us with three options: – Use a general Dutch dictionary

– Use a medical Dutch dictionary

(27)

5 Results 19

(28)

5 Results 20

• Choosing the Levenshtein distance for correcting words provides two op-tions:

– Use a Levenshtein distance of 1 – Use a Levenshtein distance of 2

When looking at the possible sequence orders for a pipeline, a greater number of varieties is discovered. The following option ranges can be combined.

• Spellchecking provides five options: – Use no spellchecker

– Use only Hunspell – Use only JOrtho

– Use both spellcheckers, first Hunspell and then JOrtho – Use both spellcheckers, first JOrtho and then Hunspell • Parsing provides two options:

– Use Alpino as a parser – Use Frog as a parser

• Annotating relevant concepts with standardized concepts provide seven options:

– Use Pinkhof only

– Use Diagnosethesaurus only

– Use Thesaurus Zorg & Welzijn only – Use Pinkhof and Diagnosethesaurus

– Use Pinkhof and Thesaurus Zorg & Welzijn

– Use Diagnosethesaurus and Thesaurus Zorg & Welzijn

– Use Pinkhof, Diagnosethesaurus, and Thesaurus Zorg & Welzijn Permuting all these options, without the configuration options if no spellcheck-ing is applied, results in a total number of 350 configurations for runnspellcheck-ing the pipeline.

5.3 Outcome measures

After reviewing literature on the subject of evaluating NLP systems, it is clear that a best practice is preached. Recall and precision can be measured for system output, when compared to output of a golden standard.[50]

(29)

5 Results 21

5.3.1 Handling partially descriptive annotations

Since it is possible that an annotated concept only partly describes the original noun phrase, a value of 1.0 or 0.5 should be assigned by the golden standard when annotating the relevant noun phrases. A value of 1.0 is assigned when the annotated concept is deemed to mean the exact same as the original noun phrase. A value of 0.5 is assigned when the annotated concept only partially represents the meaning of the original noun phrase, and the annotated concepts only as half of a found candidate in calculating recall and precision. This value assigning method is partly based on core-term matching as described by Tsai et al.[51]

Whenever a formula beneath mentions a number of matches or noun phrases, this can include half points. For example: 2.5 matches is a possible value for the nominator or denominator.

5.3.2 Example

To clarify this, consider this example. The expert finds the noun phrase ‘hivtest ’ relevant to encode, which is a procedure to test for HIV. In Pinkhof, he/she finds the concept ‘hivtest ’. Since this exactly resembles the noun phrase, a value of 1.0 is given to the found concept. However, the expert also finds ‘hiv ’, which is of course not a procedure, but a finding. Since this still might be interesting to encode if ‘hivtest ’ would not have been found (which can happen with an automated pipeline), the concept is given a value of 0.5 and counts as half of a found candidate. The value of 0.5 does not count if the full concept was annotated. A maximum of 1.0 can be assigned to a concept.

5.3.3 Recall, precision, and the F1-score

Recall is defined as the number of found candidates from a reference list (true positives), divided by the total number of candidates from that reference list (true positives + false negatives), or:

recall = TP TP + FN[50]

Precision is defined as the number of found candidates from a reference list (true positives), divided by the total number of found candidates, even those outside of the reference list (true positives + false positives), or:

precision = _{TP + FP}TP [50]

The F1-score combines the precision and the recall as follows:

F1= 2 ·_{recall + precision}recall * precision

However, there are two functionalities of the pipeline that recall and precision (and so the F1-score) can be measured for. The first is tagging the relevant

noun phrases of the input text that the pipeline wants to be able to annotate (selecting the relevant concepts). The second is finding the right annotation for the tagged concepts.

(30)

5 Results 22

5.3.4 Applying recall, precision, and the F1-score to our case

The ’reference list’ (the denominator in calculating recall) in the noun phrase tagging process consists of the noun phrases the experts tagged for matching (true positives + false negatives). The ’found candidates from that reference list’ (the nominator in calculating both recall and precision) are the noun phrases from the reference list that the pipeline also tagged for matching (true positives). In this tagging process, the ’found candidates’ (the denominator in calculat-ing precision) are all the noun phrases that the pipeline tagged for matchcalculat-ing, whether they were also tagged by the experts or not (true positives + false positives).

So in the process of tagging:

recall = # of noun phrases that were tagged by the pipeline as well as by the experts # of noun phrases the experts tagged for matching

precision = # of noun phrases that were tagged by the pipeline as well as by the experts # of noun phrases the pipeline tagged for matching

For example, if both ‘hivtest ’ and ‘aids’ (both 1.0 points) were tagged by the pipeline as well as by the experts, but ‘alternative therapy ’ was only tagged by the experts, the recall would be 2/3, or 0.67, since the pipeline only tagged 2 out of three concepts that the experts tagged. The precision would be 2/2 or 1.0, since every concept that the pipeline tagged was also tagged by the experts. The F1-score would be 2*((0.67*1.0)/(0.67+1.0)), or 0.80.

In the process of finding a match for a tagged noun phrase the incorrectly tagged noun phrases by the pipeline (the false positives in the earlier process) are disregarded in all calculations. Keeping this in mind, the ’reference list’ (the nominator in calculating recall) consists of the found matches by the experts (true positives + false negatives). The ’found candidates from that reference list’ (the nominator in calculating both recall and precision) are the matches from the reference list that the pipeline also found (true positives). Here, the ’found candidates’ (the denominator in calculating precision) represent all the matches that the pipeline found, including both the ones that equal a match by the experts and the ones that don’t.

So in the process of matching (including only noun phrases that were in the reference list of the tagging process):

recall = # of matches for noun phrases that were found by the pipeline as well as by the experts_{# of found matches for noun phrases by the experts} precision = # of matches that were found by the pipeline as well as by the experts

# of found matches for noun phrases by the pipeline

For example, if the experts found full matches for both ‘hivtest ’ and ‘aids’ for 1.0 points, but also regarded ‘hiv ’ as a match worthy of 0.5 points, and the pipeline only found ‘hiv ’ and ‘aids’ as matches, the recall would be 1.5/2.0, or 0.75. If, next to those matches, the pipeline would also find ‘alternating current ’ which was not regarded as a possible match by the experts, the pre-cision would come down to 1.5/3.0, or 0.5, since the pipeline matched three concepts, but only 1.5 of them were valid matches. The F1-score would be

(31)

5 Results 23

Fig. 3: The outcome measures.

5.3.5 Defining a single outcome measure

As discussed in section 4.3, there are actually two types of outcome from both the pipeline process and the golden standard. Recall and precision can be mea-sured of the way that relevant concepts are tagged for annotation in the input text, but they can also be measured for the correctness of the actual annotations. Because it is desirable to be able to compare different pipeline configurations, a single measure by which to evaluate a configuration is needed. We call this the overall relevance (OR) and calculate this by taking the correctly matched concepts by the pipeline (the nominator from the matching process calculations) and dividing that by the number of tagged concepts by the experts:

OR = # of matches that were found by the pipeline as well as by the experts # of noun phrases the experts tagged for matching

This value of OR can be used to compare the performance of the different pipeline configurations (see figure 3), because it says something about the how well the pipeline performed, only looking at the possible annotations according to the experts, and disregarding annotations that were found by the pipeline, but are no good match according to the experts.

When we take our example numbers from the previous sections, we would get 1.5/3.0, or 0.5 as an OR value.

5.4 Golden standard

The golden standard was set by two experts in the field of annotating medi-cal data with the help of standardized terminology systems. They received the same test dataset as is used for running the pipeline experiment. The dataset was evenly split between the experts. They mark the findings and procedures that they deem relevant and would want to encode. After this, they manually pick acceptable concepts from each lexicon related to these noun phrases. This is done by using ITEM DiTo; a tool for searching concepts in structured lexicon.

(32)

5 Results 24

The user enters a term in search box and ITEM DiTo lists concepts from the selected lexicon that are lexically close to the entered term. It also shows syn-onyms and related terms. The user clicks on the term that seems to carry the same meaning as the searched for concept and the annotation is saved to the database. The user is also able to assign a value of 1.0 or .5 to an annotation, for which ITEM DiTo was specifically altered. This way, Pinkhof, TZW, and the Diagnosethesaurus were more accessible to search for appropriate annotations.

The two experts together found 742 concepts as annotations for relevant noun phrases from 158 input texts (conclusions and diagnoses, see section 3.2), which contained a total of 2648 words.

5.5 Assessing a lexicon’s value in a pipeline setting

For Pinkhof, Diagnosethesaurus, and Thesaurus Zorg & Welzijn, the degree of coverage was measured. In 79 conclusions/diagnoses, the experts found a total of 200 relevant noun phrases. The degree of coverage is calculated by:

degree of coverage = # of full matches + 0.5 * # of partial matches₂₀₀

Table 1 shows for which noun phrases the experts were able to annotate the noun phrases fully, partly, or not at all.

Tab. 1: The degree of coverage per lexicon as annotated by the experts

Lexicon Fully annotated Partially annotated Not annotated Degree of coverage

Pinkhof 104 37 59 .61

DT 92 24 84 .52

TZW 82 33 85 .49

5.6 Experimenting with a rudimentary pipeline

A rudimentary pipeline was set up to experiment with the proposed pipeline set up. In this set up, I used Hunspell as the spellchecker and Alpino as the parser. A Levenshtein distance of 1 was configured, and only the medical dictionary was utilized. I wrote the tools that were needed to make the one’s tool output suitable as the other’s input, as well as the software needed to annotate the extracted noun phrases with the lexicons. I also wrote a Bash script to pipe the data between tools and to configure each tool.

The end result was a positive one. Inputting data in bulk resulted in auto-matically annotated concepts at the end. For 79 of the conclusions/diagnoses, the approximate running time was about 10 minutes for each configuration. Be-cause of time restraints, I did not succeed in extracting the best annotation of all the found annotations, so I could not compare the output of the pipeline to that of the golden standard. However, nothing was found that indicates this last step would be impossible to automate.

(33)

6 Discussion 25

6 Discussion

The results cannot be interpreted without taking into account their context and its implications on the results. This section discusses the main findings of this study, followed by its strengths and weaknesses. Then the meaning of the study is discussed, followed by recommendations for future research.

6.1 Statement of principal findings

The goal of this thesis was to investigate which methods and measures can de-termine the quality level that currently available NLP tools provide to annotate Dutch medical free text, using a standardized medical lexicon or terminology system.

Looking at question (1), we have found four eligible tools to use in a pipeline setting. These tools are for parsing, Alpino and Frog, and for spell checking, Hunspell and JOrtho.

Question (2) was answered by looking at the configurable settings of the pipeline, which includes a Levenshtein distance for determining how close a correction for a spellchecked word must be to replace to original, as well as which dictionaries to include in the spellchecking. The pipelines order of tools should also be configurable. Using the proposed pipeline setup, the pipeline can be run with all possible configuration combinations, generating multiple outputs from the same input.

Question (3) was answered by a measure to compare the different pipeline outputs to each other and is provided in the form of overall relevance. The overall relevance can be derived from two F1-scores, that follow from the relevant

concepts tagging step and the annotation step.

The golden standard was determined by answering question (4). In order for the outcome measure to be calculated, two human experts in annotating medical texts with standardized lexicons were asked to annotate part of the provided medical Dutch free texts with the available Dutch medical lexicons, using ITEM DiTo. This provided a usable golden standard.

Question (5) provided us with a degree of coverage for three available stan-dardized Dutch medical lexicons in a pipeline setup was also determined. This showed us that Diagnosethesaurus has a degree of coverage of 0.520 in the ENT domain of findings and procedures, while Thesaurus Zorg & Welzijn has a de-gree of coverage of 0.4925, and Pinkhof has a dede-gree of coverage of 0.6125. This means that a large portions of noun phrases that the experts deemed important enough to find a match for, are not covered by the lexicons. However, many noun phrases are covered by the lexicons and the adage something is better than nothing could be used here. Every noun phrase that is covered by any one of the lexicons brings us a step closer to matching the concept to a standardized med-ical coding system. The numbers should be interpreted with caution though, since TZW brings quite a limiting factor to the table: the content mostly seems suited to the social domain. This could be regarded a benefit if the text to be mapped also mostly comprises that same domain, but can also be a limiting factor when taking data from multiple or other domains. This theoretical lim-iting factor seems to be of no great influence on the degree of coverage in our calculations, since the difference between TZW and the Diagnosethesaurus is just 0.03. Another important influence on the usefulness of one of the lexicons

Natural language processing for Dutch medical language: A method for evaluating the value of currently available NLP tools for annotating Dutch medical free texts

Natural language processing

for Dutch medical language

A method for evaluating the value of currently available NLP

tools for annotating Dutch medical free texts

Natural language processing for Dutch medical language

A method for evaluating the value of currently available

NLP tools for annotating Dutch medical free texts

Student

D.S. Westerbeek

Meibergdreef 9

1055 AZ Amsterdam Zuidoost

Studentnummer 5871557

d.s.westerbeek@amc.uva.nl

Mentor

Dr. ir. R. Cornet

Department of Medical Informatics

Academic Medical Center, University of Amsterdam

Tutor

Dr. F.J. Wiesman

Department of Medical Informatics

Academic Medical Center, University of Amsterdam

Period

Contents

1

Introduction

1.1

Current situation

1.2

Problems

1.3

Challenge

1.4

Use case and research question

1.5

Outline of the thesis

2

Background

2.1

Natural language processing

2.2

Electronic health records

2.3

Structured versus unstructured data

3

Materials

3.1

Dutch medical lexicons

3.2

Dutch medical free text

3.3

Development and running materials

3.4

Golden standard materials

4

Methods

4.1

Usable NLP tools

4.2

Configuration options for a pipeline

4.3

Outcome measures

4.4

Golden standard

4.5

Assessing a lexicon’s value in a pipeline setting

4.6

Experimenting with NLP tools applicable in a pipeline

5

Results

5.1

Usable NLP tools

5.2

Configuration options for a pipeline

5.3

Outcome measures

5.4

Golden standard

5.5

Assessing a lexicon’s value in a pipeline setting