Artificial Intelligence in the

(1)

Artificial Intelligence in the

Doctor's Office

Decreasing the Administrative Burden With Natural

Language Processing

(2)

Artificial Intelligence in the Doctor's Office

Decreasing the Administrative Burden With Natural Language Processing

Student

Marieke Meija van Buchem Student number: 12242144

Email: m.m.van_buchem@lumc.nl, m.vanbuchem@amsterdamumc.nl

Mentors

MSc. Simone Cammel Email: s.cammel@lumc.nl

LUMC, Department of Information Technology & Digital Innovation MD. PhD. Martijn Bauer

Email: m.p.bauer@lumc.nl

LUMC, Department of Internal Medicine

Tutor

MSc. Erik Joukes

Email: e.joukes@amsterdamumc.nl

Amsterdam UMC, location AMC, Department of Medical Informatics

Location of Scientific Research Project

LUMC, Department of Information Technology & Digital Innovation Albinusdreef 2

2333 ZA Leiden The Netherlands

Practice Teaching Period

(3)

Summary……….

4 Chapter 1: Introduction……….

5 Objectives And Research Questions

6 Chapter 2: Background………

8 Medical Consultation

8 Spontaneous Speech

9 Natural Language Processing

10 Chapter 3: Methods………

11 Recording

11 Automatic Speech Recognition

13 Knowledge Extraction

13 Chapter 3: Results………..

17 Description Of Data

17 Word Error Rate

18 Knowledge Extraction

18 Chapter 4: Discussion………

20 Main Findings

20 Strengths And Limitations

21 Future Research

22 Conclusion

23 References………

24

(4)

Summary

Introduction: With health care becoming increasingly expensive while the administrative burden on physicians is growing,

a digital scribe could be the solution to solve both of these issues. A digital scribe is a system that transcribes a conversation between physician and patient, extracts knowledge from it, and turns it into a summary. In this project we present a proof-of-concept of a digital scribe, that transcribes and extracts keywords from Dutch, clinical conversations.

Methods: We recorded and automatically transcribed conversations between physicians and patients at the Internal

Medicine department. The resulting transcriptions were used to fit a term frequency-inverse document frequency (TF-IDF) model, which we used to extract the most important words per conversation. The transcriptions were compared to the golden standard transcriptions and resulted in a word error rate (WER). The keywords were compared to a golden standard summary using the ROUGE-1 metric. We also examined the effect of the WER on the ROUGE-1 score.

Results: A total of 158 conversations was recorded. The average WER was 40.4%. The extracted keywords reached a

precision of 0.44, a recall of 0.13, and an F-score of 0.18. A higher WER led to a much lower ROUGE-1 score.

Discussion: It is possible to transcribe and extract keywords from Dutch clinical conversations, although this is only the

first step towards creating a digital scribe. Future research should focus on increasing the amount of data, improving the transcriptions, and using smarter algorithms to extract a full summary from the conversations.

Keywords: automatic speech recognition; knowledge extraction; digital scribe; clinical conversations.

Samenvatting

Introductie: De zorg wordt steeds duurder terwijl artsen lijden onder de toenemende administratie druk. Een digital scribe

zou een uitkomst kunnen zijn voor beide problemen. Een digital scribe is een systeem dat gesprekken tussen arts en patiënt transcribeert, informatie extraheert en samenvat. In dit project presenteren we een eerste versie van een digital scribe, die Nederlandse, klinische gesprekken transcribeert en daar steekwoorden uithaalt.

Methode: We hebben gesprekken tussen arts en patiënt op de Interne Geneeskunde polikliniek opgenomen en

automatisch getranscribeerd. Het resulterende transcript hebben we gebruikt om een term frequency-inverse document

frequency (TF-IDF) model op te passen. Dit model hebben we gebruikt om uit elk gesprek een aantal steekwoorden te

halen. De transcripten hebben we vergeleken met gouden standaard transcripten, wat resulteerde in de word error rate (WER). De steekwoorden hebben we vergeleken met een gouden standaard samenvatting met behulp van de ROUGE-1 score. Daarnaast hebben we naar het effect van de WER op de ROUGE-1 gekeken.

Resultaten: In totaal hebben we 158 gesprekken opgenomen. De gemiddelde WER was 40,4%. De steekwoorden

hadden een precision van 0,44, een recall van 0,13, en een F-score van 0,18. De WER leek effect te hebben op de ROUGE-1 score.

Discussie: Het is mogelijk om Nederlandse, klinische gesprekken te transcriberen en daar bruikbare steekwoorden uit te

halen, alhoewel dit slechts de eerste stap is richting een digital scribe. Vervolgonderzoek zou zich in eerste instantie moeten richten op het verzamelen van meer data, het verbeteren van de transcripten en het inzetten van slimmere algoritmes om een volledige samenvatting van de gesprekken te maken.

(5)

Chapter 1: Introduction

Health care is becoming more expensive every year (1). In the Netherlands, the costs have increased with 250% over the past two decades, and are expected to double in the next two decades (2). This is mostly due to the improvements in health care, leading to a higher life expectancy but also to more people suffering from one or more chronic diseases. The most expensive health care sector is hospital care, which takes up 31% of the total health care expenditure (2).

On the other hand, the administrative burden on medical personnel, especially medical specialists, is increasing. This is mostly due to the introduction of the electronic health record (EHR), which caused an increase of quality registrations and indicators for supervision and transparency (3). In the Netherlands, a questionnaire completed by over 3000 medical specialists (in training) revealed that most specialists spend an average of 40% of their time on administration, of which they perceive only 36% as useful (3). The administrative burden in the Netherlands is such a major issue that an action force has been created ((Ont)regel de Zorg) and the problem has been placed high on the political agenda.

The Netherlands is not the only country struggling with an administrative burden. One American study (4) looked into the impact of administrative burden on academic physicians and found clinical documentation to be one of the most burdensome tasks. Furthermore, physicians in this study reported that administrative tasks negatively influenced their work by impeding their ability to deliver high-quality care and decreasing their career satisfaction. A related study (5) looked at the relationship between physicians' career satisfaction and patients' satisfaction with care. They found that patients treated by physicians who are satisfied with their career were more positive about the care they received.

Apart from the additional administrative tasks the EHR has brought about, the switch from written to typed clinical notes has also changed communication between physician and patient. Shachak et al. (2009) studied the effect of EHR use on physicians’ communicative behavior (6). Although physicians reported that the EHR reduced the cognitive load they experienced during clinical tasks, a large majority of physicians (92%) found that the EHR disturbed the communication with the patient, with physicians looking at their screen for 25-55% of the time. Furthermore, while written notes follow a narrative structure, the EHR requires discrete data items, which led to a loss of information.

To summarize, while health care costs are rising, physicians are spending an increasing amount of their time on administration, partly due to the EHR. As a result, physicians are less satisfied with their career and both physicians and patients feel that it negatively affects the quality of care by decreasing patient-centeredness. One possible solution which is gaining popularity in the United States is the medical scribe. This is a person (often a premedical student) who is present during consultations and carries out the registration and clinical documentation. Several studies have looked at the effect of using medical scribes to alleviate the administrative burden. Almost all studies show an increase in productivity and physician satisfaction (7-9). However, the current amount of people working in the care sector in the Netherlands is 1 in 6 (approximately 1.3 million people), with a shortage of health care personnel that is expected to increase to 100,000 in the coming years (1). The feasibility of having a medical scribe in every consultation room is questionable.

(6)

Nonetheless, with the current advancements in automatic speech recognition (ASR) and natural language processing (NLP), it might be possible to create a digital scribe. Our idea of a digital scribe would be a system that transcribes a conversation between physician and patient, extracts the most important information, and uses this to create a structured summary, which can be edited by the physician. The use of this digital scribe would (i) decrease the physician’s administrative burden, by reducing the time spent on clinical documentation, and (ii) improve patient-centeredness by decreasing the time physicians have to look at the computer monitor. This would shift the focus during a consultation back to the communication between physician and patient, improving this relationship and, hopefully, improving the care itself.

The demand for such a digital scribe is high: recent commentaries by Lin et al. (2018) and Varghese et al. (2018) state the same issues as described above, expressing the need for artificial intelligence (AI) to alleviate physicians’ burden (10, 11). A perspective by Quiroz et al. (November 2019) presents all the challenges of creating a digital scribe, and include a list of at least 12 companies that are currently working on developing a digital scribe (12).

Despite the popularity of the subject, literature on this topic is scarce. Klann and Szolovits (2009) performed a proof-of-concept of a system that is able to transcribe and extract symptoms from a conversation (13). However, this system is only able to process one speaker, was tested in a controlled setting and used headsets to record the conversation, making it unsuitable for clinical practice. Finley et al. (2018) present a system that performs ASR, knowledge extraction, and natural language generation (14), turning the conversation into a structured summary, but they do not present any results. Diu et al. (2019) try to infer symptoms from a manually transcribed, clinical conversation and determine if the symptom is present in the patient (15). Lastly, Amazon has just released Amazon Transcribe Medical (16), a system that transcribes a medical conversation, and Amazon Comprehend Medical (17), a system that extracts structured information from free text clinical notes. As far as we know, however, Amazon did not publish any article describing the techniques or performance of these systems.

Although the systems described above seem promising, most of them focus on extracting only specific information from the conversation, instead of presenting an overview. Furthermore, none of them have a working system that includes ASR as well as knowledge extraction or summarization. Moreover, these systems only work with the English language, minimizing the applicability in Dutch hospitals.

Objectives and Research Questions

We aim to create a system that records and transcribes a Dutch conversation between physician and patient, extracts the most important information, and combines this in a comprehensive, structured summary. The project consists of multiple phases:

1. Automatic speech recognition 2. Knowledge extraction

3. Summarization

In this article we describe a proof-of-concept of the first two phases (ASR and knowledge extraction), applied to outpatient visits at the Internal Medicine department. Our main objective is to analyze the performance of

(7)

the system in these two phases to determine the feasibility of creating the whole system for the Dutch language. The main research question can be defined as follows:

• Is it possible to create a system that accurately extracts important information from a Dutch conversation between physician and patient, using existing ASR software and NLP techniques?

The research questions per phase can be defined as follows: Automatic speech recognition:

• Can existing ASR software be used to accurately transcribe the conversation between physician and patient?

Knowledge extraction:

• Is it possible to extract the most important information from a Dutch conversation, with accuracy levels approximating those of English knowledge extraction algorithms?

(8)

Chapter 2: Background

Medical Consultation

The data used in this project is conversational data from medical consultations at the Internal Medicine Outpatient Clinic at Leiden University Medical Center (LUMC). LUMC is part of the tertiary care system, which provides highly specialized care. Patients visiting the Internal Medicine Outpatient Clinic have been referred by their general practitioner or a medical specialist from a different department or hospital. Our data includes intake consultations as well as follow-up consultations from the General Internal Medicine, Endocrinology, Geriatric Medicine and Infectious Disease sub-departments. The typical structure of an intake consultation is as follows (18):

Question clarification • Reason for contact • Chief complaint Diagnostics

• Investigate the seven dimensions of the chief complaint (onset and chronology, position and radiation, quality, quantification, related symptoms, setting, transforming factors)

• Hypothesis testing • Physical examination Conclusion

• Summarize findings from phase 1 and 2 • Explain (differential) diagnosis

• Propose course of action

• Discuss execution of chosen course of action • Conclude consultation

The physicians from the Internal Medicine Department have 45 minutes for an intake consultation. In these 45 minutes, they have to address all of the above components, type their findings in the patient’s record, form a differential diagnosis, decide on a course of action, and build up a relationship with the patient. Most consultations are follow-up consultations, which take 15 minutes. Follow-up consultations usually involve discussing diagnostic results or checking up on the patient. Consequently, these consultations usually do not follow such a clear structure as the structure presented before.

A study by Edwards et al. (2013) examined the quality of outpatient physician notes at the Internal Medicine Department (19). They report a high heterogeneity in the notes, from method of reporting and length of the note, up to the presence or absence of important note subsections. For example, in 10.8% of the notes, the reason for visit was not described, and in notes that included a new symptom, only 28.4% gave a full description of the symptom. Edwards et al. also express their concern about the effect of the EHR on clinical note quality, stating that ‘a lack of narrative, inappropriate use of copy-and-paste, increasing length and redundancy and poor formatting’ are factors pressing the quality of the clinical note.

(9)

Spontaneous Speech

The conversation between physician and patient that takes place during a consultation can be defined as spontaneous speech. Unlike prepared speech, which includes news reports and speeches, spontaneous speech is unplanned. As a result, it has some distinctive characteristics (20):

- Filled pauses (‘eh’, ‘hm’, etc.) - Restarts

- Mispronunciations

- Ungrammatical constructions - No clear punctuation

Because of these characteristics, spontaneous speech presents problems for automatic speech processing, which is mostly based on written text or prepared speech. Shriberg (2005) describes the following problems: recovering hidden punctuation, coping with disfluencies, allowing for realistic turn-taking, and perceiving more than words (21).

Recovering hidden punctuation

As stated before, spontaneous speech usually does not have clear punctuation. Unfortunately, many summarization algorithms require separate sentences as input, which creates a need for algorithms that detect sentence boundaries. The task of detecting sentence boundaries is complicated by people’s tendency to talk without pausing in between sentences, while pauses, on the contrary, might indicate hesitation or disfluencies instead of an actual sentence boundary.

Coping with disfluencies

Disfluencies include filled pauses, repairs, repetitions, and false starts. Because the large corpora on which models are often trained are mostly made up of written text, one way to deal with disfluencies is removing them. This can be challenging, especially if the disfluency could also be a word in itself. If, for example, someone says ‘is that a motor.. oh no it’s a sailboat’ it is clear for us that this person was going to say motorboat, but then corrected himself. However, ‘motor’ in itself is also a word, which makes it hard to distinguish it as a disfluency for an algorithm.

Allowing for realistic turn-taking

In spontaneous speech, like in consultations, multiple people are involved in the conversation. This makes it useful to define speaker turns: who says what at which point in time? The number of microphones that are used to record the conversation determine the difficulty in defining the speaker turns: one microphone for everyone is more difficult than one microphone per person. However, even a conversation between two people that is recorded by two microphones can be challenging, because people tend to speak before their turn and make sounds such as ‘hm’ or ‘yes’ while the other person is talking. Because of these challenges, overlapping speech is often removed from the conversation altogether.

Perceiving more than words

During a consultation, patients or physicians might use facial expressions or gestures, like nodding and pointing, to express themselves. This information is not picked up by the microphone, while it may contain essential information. For example, a physician asks the patient if he is in pain and the patient nods, or the patient tells the physician that it hurts ‘here’ while pointing to his chest. Besides, a patient’s emotion can be

(10)

heard on the recording, but will not show up in the transcript, while it might be important to include in the report.

The four problems described above are demonstrated in a study (22) that examined the quality of existing ASR systems on clinical conversations. This study reports word error rates (WER) ranging from 34% up to 65%, meaning that up to 65% of the words were misspelled, missing, or wrongly inserted. This illustrates the difficulties with transcribing spontaneous speech, let alone spontaneous medical speech. Besides, the ASR systems in this study were tested on mock consultations, which could mean that the WER during actual clinical practice might be even higher.

Natural Language Processing

Natural language processing (NLP) consists of different techniques to preprocess, analyze, and extract information from text data. In this project, we use NLP techniques to preprocess the data, so this section will explain the different preprocessing steps used in NLP.

The main goal of preprocessing in NLP is to turn the input data into a format that can be processed by machine learning algorithms. As the input data in NLP is text data, and algorithms only process numbers, the text data has to be converted to numerical values, without losing all meaning. The first step is to gather all the input data into a corpus, which is a collection of texts. In our case, the corpus includes all the transcriptions from the clinical conversations.

The next step is removing the noise from the corpus. Any word or character that is irrelevant for the question at hand, can be removed from the corpus. An often used technique is the removal of stop-words. To direct the algorithm’s attention towards the informative words, all non-informative words, such as short function words or connection words, are removed from the text. There are multiple, online-available, lists of stop-words per language, that are based on the most prevalent words in different corpora. Non-alphanumerical characters, such as ‘-‘, ‘!’, and ‘:’, can also be removed. This step is not always performed, as the punctuation, especially sentence boundaries, might be of significance.

Types of noise that are more difficult to remove are the different ways in which words can be written. One easy first step in this process, is lowering all capital letters. This assures that ‘Diabetes’ and ‘diabetes’ will be seen as the same word. However, words can also have various conjugations and word forms. For example, the sentences ‘I have a head ache’, ‘my head is aching’, and ‘it aches’, all include a different form of the word base ‘ache’. If ‘ache’, ‘aching’, and ‘aches’ are given to a machine learning algorithm, they will be seen as three totally different words, and represented as such. However, if we apply lemmatization, these words will all be turned into their word base or lemma, ‘ache’.

After the noise has been removed, the text has to be split up into n-grams. An n-gram is a text fragment, consisting of n words. The most used n-grams are the unigrams (1-gram) and bigrams (2-gram). The sentence ‘my head is aching’ has four unigrams (‘my’, ‘head’, ‘is’, ‘aching’), and three bigrams (‘my head’, ‘head is’, ‘is aching’). The splitting of text into n-grams is called tokenization, as the resulting text fragments are called tokens. From these tokens, a lexicon can be built. This lexicon is list of all the unique tokens in the corpus. All the aforementioned preprocessing steps are aimed at reducing the size of the lexicon.

(11)

Chapter 3: Methods

Recording

Set-Up

We equipped four consultation rooms at the Internal Medicine Department at Leiden University Medical Center (LUMC) with recording devices. This included the Focusrite Scarlett 2i2 audio interface in combination with two Shure cardioid boundary microphones, to be able to separate the physician's speech from the patient’s. The audio interface linked the microphones to the computer, where the physician could start and stop recording via a web-based application. This application was made in cooperation with Cloud Technology Solutions, a company specialized in creating Google Cloud-based applications. The current application consists of a front-end (Firebase) and a back-front-end (Google Cloud Platform). Figure 1 shows a scheme of the system’s architecture. Once a recording is started, it is shown in the front-end (Firebase), while the back-end starts saving the audio file in Cloud Storage. After the recording has been stopped by the physician, the Cloud Functions are triggered and activate the Cloud Pub/Sub. This Pub/Sub sends a message to the App Engine, which contains the splitting and transcription algorithm. Once these algorithms have been run, the results are saved in the Cloud Storage, and can be used for NLP using the Natural Language application programming interface (API). The transcript of the conversation is also fed back to the front-end (see Figure 2). The front-end shows the recording, the transcript, and the result of the analysis (not used in this proof-of-concept).

(12)

Physicians

We included 10 physicians who held consultations in one of the four rooms and were willing to participate. They received a short training in how to use the application, after which they were able to apply the recording procedure in their consultations

Patients

All patients who were seen by one of the 10 physicians were called a few days before their appointment to explain the project and ask for their consent to participate If patients consented, they received a more extensive information letter and an informed consent (IC) form, which they could sign and bring with them to the appointment if they wanted to participate. We aimed to include all patients, except for non-Dutch-speaking and legally incapacitated patients. Once a patient presented the physician with their IC form, the physician could start the recording.

Privacy

The project was approved by LUMC’s medical ethics review committee. Before and during the project, we had regular contact with the security officer of our department (Information Technology and Digital Innovation), to ensure that the whole process of recording and storing patient conversations in the Google Cloud conformed to the national as well as the LUMC’s privacy and security regulations.

Figure 2: Screenshot of the front-end. The recording is shown at the top-left, the 2-channel transcription on the right, and the (for

(13)

Automatic Speech Recognition

Speech-To-Text

After the signed IC form was handed to the physician, the physician started a new recording, which he/she stopped at the end of the consultation. The two microphones together recorded a stereo file, where the left channel represented the physician’s voice and the right channel that of the patient. To improve the audio quality, the single audio file was split into the left and right channel, creating two distinct audio files. These audio files were separately processed, maximizing the volume by using a gain node for each channel.

In the next step, noise was removed from the audio files. People tend to speak with different volumes, which complicates the separation of voices. For example, if one person speaks louder than the other, this person’s channel will have his or her voice clearly on the foreground with the other person’s voice as background noise, while the other person’s channel will have two equally loud voices. Noise was removed with a noise removing algorithm. This algorithm compared every millisecond of sound between the left and the right channel. If both the left and the right channel were capturing sound at the same time, this could be an indication of background noise. The algorithm then muted the softer sounding audio file, to keep only one speaker at a time. The processed audio streams were stored, and sent through to the Speech-to-Text API from the Google Cloud.

The API detected Dutch words in the audio, creating a transcript of both audio streams. As the API did not detect punctuation, the speaker turns were taken to reflect sentence boundaries. Finally, the two separate transcriptions were combined into one transcript, where the left channel was referred to as speaker A and the right channel as speaker B.

Word Error Rate

To be able to determine the accuracy of the transcripts, we calculated the word error rate (WER) of the transcript. This is a metric that compares two texts and counts the number of substitutions, deletions and insertions, leading to a total number of word errors. We chose the WER because it is a commonly used metric for measuring the accuracy of a transcript, which facilitates the comparison of our accuracy to other studies on automatic transcription. To calculate the WER for the ASR transcripts, we manually transcribed part of the conversations as ‘golden standard’. Subsequently, we removed the stop-words from both transcripts, to correct for the many difficulties in transcribing spontaneous speech (as described in the Background) and to only take into account the transcription of informative words. We then matched the ASR and golden standard transcript per speaker turn. We calculated the WER per speaker turn, whereafter an average was computed per conversation. During the computation of the average, a weight, corresponding to the length of the speaker turn, was added per speaker turn.

Knowledge Extraction

Preprocessing

Both the automatic and manual transcripts were preprocessed using NLP techniques. After changing all letters to lowercase, we removed the non-alphanumerical characters, as far as there were any, as punctuation was not included in the transcript. Thereafter we performed lemmatization, stop-word removal and tokenization using the NLP package from Stanford University (stanfordnlp (23)).

(14)

Keyword Extraction

Term frequency—inverse document frequency (TF-IDF) is a technique used in information retrieval to calculate the importance of an n-gram in a specific text, compared to a corpus of texts. The term frequency (tf) is equal to the total number of occurrences of a certain n-gram (i) in the current text (j). For example, the word ‘have’ will have a high tf score, as it is often used. The inverse document frequency is calculated by dividing the total number of texts (N) by the number of texts that contain a certain n-gram (dfi). ‘Have’ will probably be present in

almost every text, leading to a low idf score. The total tf-idf score is determined by multiplying tf by idf. See the full formula below. The resulting score will be high for less common words that occur often in a text, while very common words will have low scores.

In this project, we used the TF-IDF to extract the most important n-grams per conversation. We chose this algorithm, because it is a simple, easily interpretable algorithm that stands at the basis of many other knowledge extraction and summarization methods. We calculated the TF-IDF for different lengths of n-grams, ranging from 1 to 5. We fitted the final TF-IDF model on the preprocessed ASR transcripts (model 1).

After calculating the TF-IDF for every n-gram in the lexicon, we extracted the n-grams with the highest TF-IDF score per conversation: the keywords of the conversation. We determined the optimal number of n-grams, that were extracted per conversation, by trying all numbers in the range from 10 to 40, with a step size of 5, and choosing the number which led to the most accurate keywords.

As the goal of a clinical note is to capture the most important clinical information, we filtered out all the n-grams that only consisted of common Dutch words (list extracted from the Corpus Gesproken Nederlands, or Corpus of Spoken Dutch). The full list contained 4997 Dutch words, sorted on prevalence. As the list also contained words like ‘bacteria’, ‘coughing’, and ‘psychiatrist’, which are clinically relevant, we tried to find the optimal number of words that should be filtered out. Thus, we searched for the optimal length of the list of common Dutch words, again by trying all numbers (range from 300 to 1000, step size 50) and choosing the number which led to the most relevant keywords.

Afterwards, we removed all overlapping n-grams, keeping the longest one, because this one contained the most information. For example, if ‘ct scan’, ‘ct lung’, and ‘ct scan lung’ were all present, only ‘ct scan lung’ would be included in the final set of n-grams. The n-grams that remained were used as the conversation’s keywords. See Figure 3 for a visualization.

w

i, j

= t f

i, j

× log( N

_{d f}

i

)

(15)

Golden Standard

To be able to measure the accuracy of the extracted keywords, a medical student listened to all the audio files and extracted the most important information, which functioned as ‘golden standard’. We chose a student who was in her final year of medical school and had multiple years of experience in doing medical consultations. We defined the most important information as the reason for contact and the chief complaint, as all consultations include at least one of these. Apart from this, she also extracted the most important information in the following categories: general anamnesis, symptoms, conclusion, and course of action. As the keywords were extracted from the ASR transcript, we instructed the student to write down the most important information in the exact phrasing as used by the patient or physician.

We did not use the physicians’ reports because of the issues with clinical notes as described in the Background. Furthermore, the keywords were literally extracted from the transcript, while the words in the physicians’ reports were not necessarily said in the same phrasing during the conversation.

Metrics

To measure the accuracy of the extracted keywords, we used the ROUGE-1 metric (11). The ROUGE-1 is a metric that measures the accuracy of an automatic summary (in our case the keywords) by comparing it to a golden standard summary and counting the number of matching unigrams. It consists of two parts, precision and recall, which are combined in an average, called the F-score. In our case, the precision represents the percentage of keywords that match a word in the golden standard. The recall, on the other hand, represents the percentage of words in the golden standard that match with a keyword. The F-score is a score ranging from 0 (no overlap between keywords and golden standard) to 1 (total overlap between keywords and golden standard). See Figure 4 for a visualization of the ROUGE-1. To get insight into what kind of keywords are extracted by the algorithm, we also calculated the scores per category (reason for contact, chief complaint, symptoms, general anamnesis, conclusion, and course of action).

Automatic transcript

‘We hebben een scan gemaakt en bloedonderzoek name omdat

ze dus op die ct-scan’

‘heb scan maken bloedonderzoek name omdat ct

scan’ Preprocessing Score n-gram 2.7 ct scan 2.4 name omdat 1.5 bloedonderzoek … … 0.5 heb Fit TF-IDF Filter keywords Top n keywords 1. ct scan 2. name omdat 3. bloedonderzoek … … n. heb 1. ct scan 2. bloedonderzoek Keyword extraction

Figure 3: Visualization of the process of keyword extraction. The automatic transcript is preprocessed, after which the TF-IDF is

fitted on the preprocessed text. Per conversation, the top n keywords are extracted, and filtered using the list of common Dutch words.

(16)

Effect WER On ROUGE-1

Because the audio files were automatically transcribed, we were interested in the effect of the WER on the ROUGE-1 score. If the keywords extracted from a perfectly transcribed conversation yield a much higher ROUGE-1 score, then it would be beneficial to work on improving the WER. We measured this by training two models on the subset of conversations that had been manually transcribed: one on the ASR transcripts (model 2a) and the other on the golden standard transcripts (model 2b).

See Figure 5 for an overview of the whole process and metrics.

Golden

standard

_Keywords

Precision: Recall: m atch es k e y wor d s m atch es su m m ar y Matches

Figure 4: Visualization of ROUGE-1. F-score: 2PR. P: precision. R: recall.

P + R

Figure 5: Process description from conversation to keywords. GS: golden standard.

ASR TRANSCRIPT GS TRANSCRIPT GS SUMMARY WER ROUGE-1 Results Algorithms Golden Standar ds Eﬀect?

(17)

Chapter 3: Results

Description of Data

In total, we recorded 158 conversations from 10 physicians with their patients, from March 1st to November 30th 2019. See Figure 6 for a flowchart of patient inclusion. The number of patients that refused to participate via the phone was 5.6% (n = 22). After the patients were called, it was up to the physician to collect the informed consent (IC) form and start the recording. The 113 patients that consented via phone but were not recorded in the end either forgot to bring the IC form, the physician forgot to start the recording, or something went wrong in the recording process itself.

The medical student made a golden standard summary for 89 of the 158 recordings because of time constraints. The selection was made at random. Due to a problem with the Google Cloud Platform, 21 recorded conversations were not transcribed because they exceeded the maximum length the current architecture could handle. The final number of conversations that included both an ASR transcription and golden standard summary was 72. See Figure 6 for a full overview of the inclusion.

Patients recorded n = 158 GS summary n = 89 GS transcription n = 20 Patients called n = 395

Consents via phone n = 271

Reasons for no consent:

Privacy n = 22

Unreachable n = 102

Reasons for not recording: (n = 113) No IC Physician forgot Error recording Automatically transcribed n = 137 Input model 1 KE model 1 n = 72 KE model 2 n = 14 Input model 2b

Figure 6: Inclusion flowchart. IC: informed consent form. GS: golden standard. KE:

(18)

The average duration of the 158 recorded conversations was 18:30 minutes (±13:11). The average duration of the 72 conversations included in the keyword extraction was 13:30 minutes (±8:05). Of the conversations included in the keyword extraction, 47 (65.3%) included a female patient. 51 (70.8%) conversations were held with an internist, 19 (26.4%) with en endocrinologist, and 2 (2.8%) with an infectious disease specialist.

Word Error Rate

From the 158 conversations that were recorded, 12.6% (n = 20) were transcribed verbatim. This selection was based on an even distribution between physicians, and the gender of patients. The average WER from these 20 conversations was 40.4% (±8.6%). The WER ranged from 26% to 62%.

Knowledge Extraction

We fitted the TF-IDF model on all the 137 available transcripts. The optimal n-gram length was a range from 1 to 4, the optimal number of extracted n-grams was 25, and the optimal length of the common spoken Dutch list was 500. The final vocabulary consisted of 298,176 n-grams. The average number of keywords extracted per conversation was 9.1 (±2.3). An example of a golden standard summary and list of keywords can be seen in Figure 7. Because of the lemmatization, some keywords were turned into nonsense words. For example, ‘milt’ (spleen) was changed into ‘millen’, because it was seen as the 3rd person singular form of the non-existing verb ‘millen’.

The precision of the extracted keywords was 0.44, meaning that 44% of the keywords were found in one of the six golden standard categories. The overall recall was 0.13, meaning that the keywords covered 13% of the golden standard summary. The overall ROUGE-1 was 0.18. As can be seen in Table 1, not all conversations included all categories, and there was a difference in precision, recall and ROUGE-1 score per category.

Reason for coming in:

Results MRI scan pituitary gland.

Symptoms:

Regularly severe headache in the back.

Conclusion:

No growth micro-adenoma. Tension headache.

Course of action:

New MRI scan in 5 years.

viaan, mri scan, antibody, in a moment, further, micro-adenoma, ct scan lung, tension headache

Figure 7: Example golden standard and keywords. On the left side the

golden standard as summarized by the medical student. On the right side the extracted keywords.

(19)

Table 1: Results per category. Available: number of conversations included per category. Length: number of words per category.

Matches: number of keywords found in the text.

Effect WER On ROUGE-1

In total, 14 conversations had both a golden standard transcript and summary. Model 2a and model 2b were fitted with the same parameters as model 1. See Table 2 for the results. The number of matches, precision, recall and F-score were all higher for model 2b when compared to model 2a.

Table 2: Results comparison ASR and golden standard model. Length: total number of words in the golden standard summary.

Matches: number of keywords found in the golden standard summary. Nr. of Keywords: number of keywords extracted per model.

Category Available Length Matches Precision Recall F-score

Total summary 72 43.0 (±42.7) 3.9 (±2.0) 0.44 (±0.21) 0.13 (±0.13) 0.18 (±0.11)

Reason for contact 64 2.3 (±1.5) 0.8 (±0.8) 0.09 (±0.1) 0.37 (±0.39) 0.13 (±0.13)

Chief complaint 20 1.5 (±0.6) 0.5 (±0.5) 0.06 (±0.07) 0.33 (±0.41) 0.10 (±0.12)

General anamnesis 37 8.7 (±8.5) 1.1 (±0.9) 0.13 (±0.13) 0.17 (±0.19) 0.12 (±0.09)

Symptoms 60 22.9 (±36.7) 1.7 (±1.8) 0.19 (±0.22) 0.14 (±0.20) 0.11 (±0.09)

Conclusion 65 7.7 (±8.0) 1.5 (±1.4) 0.16 (±0.15) 0.25 (±0.28) 0.16 (±0.15)

Course of action 67 8.4 (±5.9) 1.7 (±1.5) 0.19 (±0.16) 0.23 (±0.22) 0.18 (±0.14)

Model Length Matches Precision Recall F-score Nr. of Keywords Model 2a (ASR) 51.1 (±43.9) 0.8 (±1.2) 0.11 (±0.16) 0.01 (±0.02) 0.02 (±0.03) 7.2 (±1.8) Model 2b (golden standard) 51.1 (±43.9) 4.1 (±2.2) 0.43 (±0.24) 0.1 (±0.08) 0.15 (±0.1) 9.4 (±2.6)

(20)

Chapter 4: Discussion

Main Findings

In this study, we did a proof-of-concept to see whether it was possible to extract important information from a Dutch conversation between physician and patient, using automatic speech recognition (ASR) and natural language processing (NLP) algorithms, at the outpatient clinic of the Internal Medicine Department. This is the first step towards a system that automatically transcribes and summarizes the conversation between physician and patient. The main finding is that it is possible to extract relevant keywords from a conversation using TF-IDF, even with a high word error rate (WER). Although the F-score of the ROUGE-1, representing the accuracy and completeness of the keywords, is low, this is mostly due to the recall score. With the current algorithm, which only extracts keywords instead of whole sentences, the precision score is more interesting to look at, because it says something about the relevance of the extracted keywords. With half of the keywords being marked as relevant, we believe a promising first step towards a summarization algorithm for Dutch, clinical conversations has been made.

When comparing our results to previous studies on summarization, it is notable that, while our recall and F-scores are lower (0.13 and 0.18, respectively), our precision F-scores are relatively high (0.44). In a study by Gillick et al. (2009), summarization of meeting transcripts is performed using an Integer Linear Program (ILP) approach (24). Their maximum recall, precision and F-score are 0.26, 0.33, and 0.28, respectively. Lee et al. (2009) use Non-negative Matrix Factorization (NMF) to summarize documents, and reach a recall, precision and F-score of 0.28, 0.36, and 0.31 (25). One of the reasons for our high precision and low recall score is that we extracted keywords instead of sentences. By only extracting keywords, the length of our ‘automatic summary’ is shorter compared to the golden standard summary, and includes only the most important words. The precision score looks at the amount of keywords that match with words from the golden standard summary. With this difference in length, the chance of finding a keyword in the golden standard summary is high. On the other hand, the recall score tries to match all the words in the golden standard summary to the keywords, which produces a lower score. To increase the interpretability of our current ‘summary’, it would, however, be interesting to extract full sentences instead of keywords. Gillick et al. (24), as well as many other studies have used this technique to summarize texts, which is called extractive summarization. Because our data consists of spoken language and has been automatically transcribed, punctuation is missing, which makes it unsuitable for this type of summarization. It is, however, possible to add punctuation during the transcription process, so this would be an interesting feature to add in the next version of the system.

Another reason for the low recall and F-score is the high WER, which seems to greatly influence the ROUGE-1 score. Even with the small number of golden standard transcripts, there was a large difference between the ROUGE-1 scores, with F-scores of 0.02 versus 0.15 (model 2a versus model 2b). Because the recall looks at the number of words in the golden standard summary that is covered by the keywords, it makes sense that a high WER greatly impairs the algorithm’s ability to extract a complete set of keywords. As model 2b (trained on the golden standard transcripts) performed almost as well as model 1 (0.15 versus 0.18), we expect the ROUGE-1 score to improve even more when the WER decreases. Although there is not much literature on this topic, one study by Zechner and Waibel (2000) describes a similar relationship between the WER and the summary accuracy (26). They use a Maximal Marginal Reference (MMR) summarization method on

(21)

automatically transcribed conversations and describe a 15% increase in accuracy for every 10% decrease in WER.

Because of the small number of golden standard transcripts, it is not possible to point to conversation characteristics that could have led to a lower WER. There are, however, a few factors which we believe led to the high WER. As can be read in similar studies (12, 27, 28), the volume of speaking, distance to the microphone, and direction in which patients and physicians speak can differ markedly during the conversation, which in our case could have hindered the algorithm responsible for splitting the audio into a physician and patient channel. In addition, the spontaneity of the speech, which includes incomplete words, pauses, self-edits, and disfluencies, could have impaired transcription even more. A study by Szaszák et al. (2016) describes a system that automatically transcribes and summarizes highly spontaneous speech (29). Their automatic transcription attains a WER of 44% while using a system that is described as having a WER of 10.5%. This is similar to our case: the Google Speech-to-Text application programming interface (API) should be able to transcribe speech with a WER of 5.6% but attains 40.4%. It should be noted that this article describes the English API, and no average WER rate could be found for the Dutch API.

One last interesting finding is the presence (or absence) of the categories in some of the conversations. Although most categories were present in the majority of the conversations, it is important to note that not all conversations included all categories. This is especially of significance for future work on creating an automatic, structured summary. In this proof-of-concept we looked at the most important words in the whole conversation, but in a future version it might be better to extract keywords or sentences per category. One promising result is that the extracted keywords were distributed across all categories, giving a good overview of the conversation.

Strengths and Limitations

As far as we know, the present study is the only empirical investigation into a system that automatically transcribes and summarizes a conversation in the doctor’s office. Moreover, because transcription algorithms and NLP are very much language-specific, we are the first to describe a system that could be implemented in the Dutch health care system. With the rising health care costs and increased time spent on administration, we believe that even a system that makes suggestions or provides a draft of a summary that the physician can edit, can decrease the burden on physicians. As a result, physicians will have more time to interact with the patient rather than with the patient’s electronic health record (EHR), improving the experience for both of them, and hopefully improving the provided care.

One limitation of our study is the small number of recorded conversations relative to the number of physicians and specialities that participated. It might have been easier to start with a small group of patients that could be classified according to their diagnosis. In our dataset, the variety of diagnoses was too high to create a classifier. On the other hand, the current study was a proof-of-concept, and due to this large variety, we gained a deep understanding of the clinical practice. We believe this will facilitate the next steps that need to be taken to reach our goal. Furthermore, the small number of golden standard transcripts limited the possibility of drawing conclusions about the effect of the WER on the ROUGE-1 score. Although we have a good indication of the effect of the WER on the accuracy of the keywords, it would have been interesting to see how much the accuracy would improve with a large sample of golden standard transcripts.

(22)

One type of bias that might be present in the data is the Hawthorne effect (30). Physicians as well as patients might have acted differently because of the presence of the microphones, leading to an unrepresentative dataset. However, we do not believe this would have affected our results, as our main goal was to extract keywords from a conversation, and a different conversation would lead to different keywords, without affecting their accuracy. As the Hawthorne effect might lead to differences in the conversations themselves, it would be important to examine in future studies, especially when studying the effect of implementing such a system.

Future Research

As the project consists of automatic speech recognition, knowledge extraction, and summarization, we will discuss the implications and future research per phase. An additional section is dedicated to the clinical implications.

Automatic Speech Recognition

Although higher WER rates seem to be implicit to spontaneous speech, the next phase of the project should focus on decreasing the WER. This can be approached from multiple angles: different microphones, an ASR system that is better in spontaneous (medical) speech, or post-processing of the transcript by, for example, adding punctuation. By improving these aspects, we expect the ROUGE-1 score to increase.

For the next phase, we will try different ASR systems, to see if the WER improves. It is also possible to add a speaker diarization algorithm to the system. This would make the use of two microphones superfluous, as such an algorithm is able to determine who says what without the need for multiple microphones. Shafey et al. (2019) describe a speaker diarization algorithm, in combination with an ASR system (28). Although the diarization improves, the WER stays the same. Hence, we would have to try such an algorithm to see if it would improve the WER or ROUGE-1 in our case.

Correcting disfluencies and adding sentence boundaries has been studied by Liu et al. (2006) and Zayats et al. (2019), among others (31, 32). Zayats et al. propose a system that takes linguistic (textual) as well as prosodic (audio) cues into account for the correction of disfluencies. Liu et al. describe a system that, apart from disfluency correction, also restores punctuation. It would be interesting to incorporate such a system in our workflow, to post-process the transcript. Especially the restoration of punctuation is important if we want to extract sentences instead of keywords.

Knowledge Extraction

There are multiple ways in which we can improve the knowledge extraction. Du et al. (2019) focus on extracting symptoms and their status from the conversation (33). This could be an interesting first step towards structured knowledge extraction, although Du et al. only focus on deriving if a symptom is present or not, without elaborating. A different approach is assigning a topic to every sentence, which is called topic modeling. Park et al. (2019) present a neural network that can accurately perform topic modeling on clinical conversations (34). However, their clinical conversations consist of primary care office visits and concern mental health, thus making their labels inappropriate for our setting. It would be possible to use their modeling architecture on our own data, changing the topic labels to the categories of our summary (reason for contact, chief complaint, general anamnesis, symptoms, conclusion, course of action). All the sentences labeled with a certain category could then be aggregated. The next step would then be creating a summary per category.

(23)

Summarization

Our current model only extracts keywords, while the ultimate goal is to extract a full, structured summary. There are two ways in which this goal can be reached. The first one is to try extractive summarization as described above (ILP, NMF, MMR). This could be a first step towards a more complete summary. This, however, will not lead to a structured summary. We believe a new algorithm called Bidirectional Encoder Representations from Transformers (BERT) might be able to help us accomplish this (35). This algorithm can be trained on, for example, the complete Dutch Wikipedia, and uses the knowledge that it acquired to form a language model. This language model can be used to understand relations between words and sentences, decide which words are synonyms, or even find the lay term for a given medical term. When pre-trained on the right corpus, this model could be used to form a graph of the patient’s complaints, which could be used to create a summary (abstractive summarization). Ideally, the model is connected with a medical ontology such as SNOMED-CT (36) to automatically structure the data and present it to the EHR as such.

Finally, it is important to record more conversations and create more golden standard transcripts and summaries, to be able to train future models on our own dataset. As seen in the comparison between model 2a and 2b, increasing the amount of data seems to greatly improve the ROUGE-1 score. With the combination of improving the WER and increasing the number of conversations, a valuable dataset of medical conversations can be assembled, forming a solid basis for future studies. If the recording of conversations does not yield enough data, it would also be possible to augment the data using an algorithm as described by Liu et al. (2019) (37). They augment their dataset of nurse-patient conversations using a neural network.

Clinical Implications

Apart from these technical implications, it is important to study the effect of implementing such a system in clinical practice. As mentioned before, the presence of microphones in the doctor’s office might affect the conversation: patients might hold back sensitive information and physicians might be extra careful with their choice of words, disturbing the intimate nature of the conversation. However, based on the very low rate of patients that refused to participate (5.6%), we do not expect this to be an issue. Moreover, an increasing number of patients is recording their consultations (38), so our system could also benefit the patient by making the record available for them in the hospital’s patient portal. In addition, once a working system is in place, the effect of using the system on physicians’ time spent on administration should be investigated. It would also be interesting to look at the difference in patients’ experience of a conversation with and without the system, to see if the system improves physician-patient contact.

Conclusion

In this study, we developed a system that automatically records, transcribes, and extracts keywords from a conversation between physician and patient. Although both the WER as well as the ROUGE-1 score need to be improved, this proof-of-concept shows that it is possible to distill relevant keywords from Dutch medical conversations. With more data and smarter algorithms, we believe it is possible to alleviate the administrative burden on physicians, increase the patient-centeredness, and improve care. The next steps in this project will be decreasing the WER, trying other algorithms to increase the applicability of the extracted keywords or sentences, and increasing the number of recorded conversations.

(24)

References

1. Centraal Bureau voor de Statistiek. (2019, June 21). Health expenditure; functions, financing [Data file]. Retrieved from http://statline.cbs.nl/StatWeb/publication/?VW=T&DM=SLNL&PA=80416NED&LA=NL 2. Rijksinstituut voor Volksgezondheid en Milieu. Volksgezondheid Toekomst Verkenning 2018: Een gezond

vooruitzicht. 2018 Jun.

3. De Argumentenfabriek. Administratiedruk medisch specialisten [Internet]. 2017. Available from: https:// www.demedischspecialist.nl/sites/default/files/20171117_DEF%20Rapport-administratiedruk-specialisten.pdf

4. Rao SK, Kimball AB, Lehrhoff SR, Hidrue MK, Colton DG, Ferris TG, Torchiana DF. The impact of administrative burden on academic physicians: results of a hospital-wide physician survey. Acad Med. 2017 Feb 1;92(2):237-43.

5. Haas JS, Cook EF, Puopolo AL, Burstin HR, Cleary PD, Brennan TA. Is the professional satisfaction of general internists associated with patient satisfaction? J Gen Intern Med. 2000 Feb 1;15(2):122-8.

6. Shachak A, Hadas-Dayagi M, Ziv A, Reis S. Primary care physicians’ use of an electronic medical record system: a cognitive task analysis. J Gen Intern Med. 2009 Mar 1;24(3):341-8.

7. Shuaib W, Hilmi J, Caballero J, Rashid I, Stanazai H, Tawfeek K, Amari A, Ajanovic A, Moshtaghi A, Khurana A, Hasabo H. Impact of a scribe program on patient throughput, physician productivity, and patient satisfaction in a community-based emergency department. Health Informatics J. 2019 Mar;25(1):216-24.

8. Bank AJ, Gage RM. Annual impact of scribes on physician productivity and revenue in a cardiology clinic. Clinicoecon Outcomes Res. 2015;7:489.

9. Heaton HA, Nestler DM, Barry WJ, Helmers RA, Sir MY, Goyal DG, Haas DA, Kaplan RS, Sadosty AT. A Time-Driven Activity-Based Costing Analysis of Emergency Department Scribes. Mayo Clin Proc. 2019 Mar 1;3(1):30-4.

10. Lin SY, Shanafelt TD, Asch SM. Reimagining Clinical Documentation With Artificial Intelligence. In: Mayo Clinic Proceedings 2018 May (Vol. 93, No. 5, p. 563). Elsevier.

11. Verghese A, Shah NH, Harrington RA. What this computer needs is a physician: humanism and artificial intelligence. JAMA. 2018 Jan 2;319(1):19-20.

12. Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. NPJ Digit Med. 2019 Nov 22;2(1):1-6.

13. Klann JG, Szolovits P. An intelligent listening framework for capturing encounter notes from a doctor-patient dialog. BMC Med Inform Decis Mak. 2009 Dec;9(1):S3.

(25)

14. Finley G, Edwards E, Robinson A, Brenndoerfer M, Sadoughi N, Fone J, Axtmann N, Miller M, Suendermann-Oeft D. An automated medical scribe for documenting clinical encounters. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations 2018 Jun (pp. 11-15).

15. Du N, Chen K, Kannan A, Tran L, Chen Y, Shafran I. Extracting Symptoms and their Status from Clinical Conversations. arXiv preprint arXiv:1906.02239. 2019 Jun 5.

16. Amazon Transcribe Medical [Internet]. Amazon Web Services, Inc. 2019 [cited 14 December 2019]. Available from: https://aws.amazon.com/transcribe/medical/

17. Amazon Comprehend Medical [Internet]. Amazon Web Services, Inc. 2019 [cited 14 December 2019]. Available from: https://aws.amazon.com/comprehend/medical/

18. Schouten JAM. Anamnese en advies. Nieuwe richtlijnen voor de informatie-uitwisseling tussen arts en patiënt. 2nd ed. Alphen aan den Rijn/Brussel: Samson Stafleu; 1988.

19. Edwards ST, Neri PM, Volk LA, Schiff GD, Bates DW. Association of note quality and quality of care: a cross-sectional study. BMJ Qual Saf. 2014 May 1;23(5):406-13.

20. Ward W. Understanding spontaneous speech. In: Proceedings of the Workshop on Speech and Natural Language 1989 Feb 21 (pp. 137-141). Association for Computational Linguistics.

21. Shriberg E. Spontaneous speech: How people really talk and why engineers should care. In: Proceedings of the Ninth European Conference on Speech Communication and Technology 2005.

22. Kodish-Wachs J, Agassi E, Patrick Kenny III J. A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech. In: AMIA Annual Symposium Proceedings 2018 (Vol. 2018, p. 683). American Medical Informatics Association.

23. StanfordNLP 0.2.0 - Python NLP Library for Many Human Languages | StanfordNLP [Internet]. Stanfordnlp.github.io. 2019 [cited 14 December 2019]. Available from: https://stanfordnlp.github.io/ stanfordnlp/

24. Gillick D, Riedhammer K, Favre B, Hakkani-Tur D. A global optimization framework for meeting summarization. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing 2009 Apr 19 (pp. 4769-4772).

25. Lee JH, Park S, Ahn CM, Kim D. Automatic generic document summarization based on non-negative matrix factorization. Inform Process Manag. 2009 Jan 1;45(1):20-34.

26. Zechner K, Waibel A. Minimizing word error rate in textual summaries of spoken language. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics 2000. 27. Chiu CC, Tripathi A, Chou K, Co C, Jaitly N, Jaunzeikare D, Kannan A, Nguyen P, Sak H, Sankar A,

Tansuwan J. Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274. 2017 Nov. 28. Shafey LE, Soltau H, Shafran I. Joint speech recognition and speaker diarization via sequence

(26)

29. Szaszák G, Tündik MÁ, Beke A. Summarization of Spontaneous Speech using Automatic Speech Recognition and a Speech Prosody based Tokenizer. In: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management 2016 Nov 9 (pp. 221-227). SCITEPRESS-Science and Technology Publications, Lda.

30. McCarney R, Warner J, Iliffe S, van Haselen R, Griffin M, Fisher P. The Hawthorne Effect: a randomised, controlled trial. BMC Med Res Methodol. 2007 Jul 3;7:30.

31. Liu Y, Shriberg E, Stolcke A, Hillard D, Ostendorf M, Harper M. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on audio, speech, and language processing. 2006 Aug 21;14(5):1526-40.

32. Zayats V, Ostendorf M. Giving Attention to the Unexpected: Using Prosody Innovations in Disfluency Detection. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 2019 Jun (pp. 86-95).

33. Du N, Chen K, Kannan A, Tran L, Chen Y, Shafran I. Extracting Symptoms and their Status from Clinical Conversations. arXiv preprint arXiv:1906.02239. 2019 Jun.

34. Park J, Kotzias D, Kuo P, Logan IV RL, Merced K, Singh S, Tanana M, Karra Taniskidou E, Lafata JE, Atkins DC, Tai-Seale M. Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions. J Am Med Inform Assoc. 2019 Sep 17;26(12):1493-504.

35. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.

36. SNOMED Home page [Internet]. SNOMED. 2019 [cited 15 December 2019]. Available from: http:// www.snomed.org/

37. Liu Z, Lim H, Suhaimi NF, Tong SC, Ong S, Ng A, Lee S, Macdonald MR, Ramasamy S, Krishnaswamy P, Chow WL. Fast Prototyping a Dialogue Comprehension System for Nurse-Patient Conversations on Symptom Monitoring. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers) 2019 Jun (pp. 24-31).

38. Elwyn G, Barr PJ, Grande SW. Patients recording clinical encounters: a path to empowerment? Assessment by mixed methods. BMJ Open. 2015;5:e008566.

(27)

Abbreviations

AI - Artificial intelligence

API - Application programming interface ASR - Automatic speech recognition EHR - Electronic health record IC - Informed consent

NLP - Natural language processing WER - Word error rate

Artificial Intelligence in the

Artificial Intelligence in the

Doctor's Office

Decreasing the Administrative Burden With Natural

Language Processing

Artificial Intelligence in the Doctor's Office

Contents

Summary……….

4

Chapter 1: Introduction……….

5

Objectives And Research Questions

6

Chapter 2: Background………

8

Medical Consultation

8

Spontaneous Speech

9

Natural Language Processing

10

Chapter 3: Methods………

11

Recording

11

Automatic Speech Recognition

13

Knowledge Extraction

13

Chapter 3: Results………..

17

Description Of Data

17

Word Error Rate

18

Knowledge Extraction

18

Chapter 4: Discussion………

20

Main Findings

20

Strengths And Limitations

21

Future Research

22

Conclusion

23

References………

24

Summary

Samenvatting

Chapter 1: Introduction

Chapter 2: Background

Chapter 3: Methods

w

= t f

× log( N

d f

)

Golden

standard

Keywords

Chapter 3: Results

Chapter 4: Discussion

References

Abbreviations

_{d f}

_Keywords