Healthcare information models (HCIM) data quality checker: tool: Improving quality of BgZ registration in hospitals

(1)

1

Healthcare information models (HCIM) data quality checker

tool: Improving quality of BgZ registration in hospitals

Steyn Kahrel

Master thesis

Name student: Steyn Kahrel Student number: 10545417 Email address:

Mentor: W. Mennen

Tutor: F. Wiesman

(2)

2

ABSTRACT

This thesis aims to assess to what extent BasisgegevenssetZorg (BgZ) registrations can benefit from automated data quality assessment. Currently data quality assessment of EHR systems is done by hand. Having high quality data is essential for providing high quality care in today’s era of technology. Being able to assess data quality in an automated way means it will be less time consuming for

hospital staff to perform these actions. It was found that four dimensions were most commonly used when assessing data quality: completeness, accuracy, consistency and timeliness. Each dimension can be assessed in multiple different ways. The tool employs the data quality assessment methods: element presence, validity checks, constraint adherence and timestamp comparisons.

To assess data quality in an automated way a tool was developed which extracts data from a hospitals EHR and assesses the data quality. The tool was built to assess the quality of BgZ

registrations. The BgZ consists of 28 zibs which comprise the patient data which is important to all healthcare providers along the care path. Zibs are healthcare information models which offer a standard for registering healthcare information. The 28 zibs consist of 276 data elements which are formatted using 13 datatypes. The tool is able to extract 213 of these 276 data elements and assess the accuracy of 108 data elements. There are several reasons for the tools inability to assess all data elements aside from not being able to extract the data. The main reason is that some datatypes which are used are harder to assess in an automated way. Another reason is the difference in answer options for coded data elements between the EHR and the BgZ.

The tool offers a way to automatically extract and assess BgZ registrations to hospitals using HiX as their EHR. Having a fast and simple way to assess the data quality of one’s BgZ registrations may prompt hospitals to perform checks more often identifying areas that require improvement thus improving the overall data quality. Furthermore, being able to easily extract data, along with the hypothesised increase in data quality, will further the ability to exchange patient information which may improve patient care across and within healthcare organisations. In conclusion, BgZ registrations can benefit from automated data quality assessment by improved identification of problem areas, providing hospitals with ways to improve how they register data and provide Nictiz and Registratie aan de Bron with feedback on where they might be able to improve the BgZ standard itself.

(3)

3

SAMENVATTING

Het doel van deze thesis is om vast te stellen in hoeverre de BasisgegevenssetZorg (BgZ) registraties gebruik kunnen maken van het automatiseren van het beoordelen van datakwaliteit. Op het moment wordt de data kwaliteit van EPD systemen met de hand bepaald. Hoge kwaliteit data is essentieel voor het bieden van zorg van hoge kwaliteit. De mogelijkheid om datakwaliteit te beoordelen op een geautomatiseerde manier betekent dat het minder tijdrovend is voor ziekenhuispersoneel om deze taken uit te voeren. Middels een literatuurstudie werd vastgesteld dat bij het beoordelen van data kwaliteit doorgaans vier dimensies gebruikt worden: compleetheid, correctheid, samennhangendheid en tijdigheid. Elke dimensie kan op verschillende manieren beoordeelt worden. De tool gebruikt de beoordelingsmethoden: element presence, validity checks, constraint adherence and timestamp comparisons.

Om op een geautomatiseerde manier datakwaliteit te beoordelen is een tool ontwikkeld die het mogelijk maakt om data uit een EPD systeem te extraheren en de data kwaliteit te beoordelen.De tool is gebouwd om de kwaliteit van BgZ registraties te beoordelen. De BgZ bestaat uit 28 zibs die samen de patiëntgegevens vormen die van belang zijn gedurende het gehele zorgtraject van een patiënt. Zibs zijn informatiemodellen waarmee medische data gemodelleerd kan worden ten behoeve van

hergebruik van de data. De 28 zibs bestaan in totaal uit 276 data elementen vormgegeven met 13 verschillende datatypes. De tool is in staat om 213 van de 276 data elementen te extraheren en kan de data kwaliteit bepalen van 108 van deze data elementen. Er zijn meerdere oorzaken voor het feit dat de tool niet alle voor alle data elementen de data kwaliteit kan beoordelen naast dat het niet mogelijk is om het specifieke data element te extraheren. De grootste oorzaak is dat sommige voor sommige datatypes het moeilijker is om de data kwaliteit te beoordelen dan voor andere datatypes. Een andere oorzaak is een verschil in antwoord opties voor gecodeerde data elementen tussen HiX en de BgZ.

De tool biedt een mogelijkheid om op een geautomatiseerde manier data te extraheren en de kwaliteit te beoordelen van BgZ registraties voor ziekenhuizen die HiX gebruiken als hun EPD. Het hebben van een snelle en simpele manier om de data kwaliteit van BgZ registraties te beoordelen kan ervoor zorgen dat ziekenhuizen vaker gaan controleren wat de staat is van deze registraties. Dit zou ervoor kunnen zorgen dat er vaker gebieden worden geïdentificeerd waar verbetering nodig is, waardoor de algehele data kwaliteit van BgZ registraties zal verbeteren. Daarbij zal de mogelijkheid om makkelijker data uit het systeem te extraheren, samen met de verbeterde data kwaliteit, ervoor zorgen dat het makkelijker wordt om patiëntgegevens uit te wisselen binnen en tussen ziekenhuizen. Concluderend, BgZ registraties kunnen proviteren van het beoordelen van datakwaliteit op een geautomatiseerde manier doordat probleemgebieden vaker geïdentificeerd worden. Dit geeft

ziekenhuizen manieren om gericht te werken aan het verbeteren van het registreren van data en geeft Nictiz en Registratie aan de bron feedback voor het verbeteren van de BgZ zelf.

(4)

4

TABLE OF CONTENTS

1.1. Introduction ...5

1.2. Description of the SRP Project ...6

1.3. Research Questions ...6

1.4. Thesis format ...6

2. Chapter 2: Data quality assessment ...7

2.1. Introduction ...7

2.2. Methods ...8

2.3. Results ...9

2.4. Discussion ... 14

3. Chapter 3: Information exchange ... 15

3.1. Introduction ... 15 3.2. Methods ... 16 3.3. Results ... 17 3.4. Discussion ... 21 4. Chapter 4: Tool ... 22 4.1. Introduction ... 22 4.2. Methods ... 22 4.3. Results ... 23 4.4. Discussion ... 27

5. Chapter 5: Thesis Conclusion ... 28

5.1. Discussion ... 28

5.2. Conclusion ... 31

References ... 32

Appendix ... 36

Appendix 1: Definition BgZ based on Zibs release 2017... 36

Appendix 1.1: Dutch. ... 36

Appendix 1.2: English... 37

Appendix 2: Template tool output... 38

(5)

5 1.1. Introduction

With the increase of digitalisation of information in healthcare it has become increasingly important to register information in a standardised format. To improve the registration of information within the healthcare process Nictiz and the programme Registratie aan de bron devised so called zibs

(zorginformatiebouwstenen). Zibs are building blocks designed to standardise how and what

information needs to be registered in the healthcare process [1, 2]. The goal of this standardisation is to increase semantic interoperability to further information exchange within the healthcare process but also for the benefit of, for example, quality registrations and research [2]. For simplicity semantic interoperability is defined as “What is sent by system 1 is the same as what is understood by system 2” [3]. The Zibs are developed in conformance with Detailed Clinical Models (DCM) which are based on ISO TS 13972[4].

Within the Zibs there is a subcategory called “BasisgegevenssetZorg” (BgZ) which are the Zibs that make up the minimum set of patient information that is of importance to all specialties and professions involved in the care process [5, 6]. The elements building up the BgZ are shown in appendix 11. How and where the BgZ is registered differs between electronic health records (EHR). Currently there is no easy way to extract the data or assess the quality of the BgZ registrations in an automated way. This results in extra work for hospitals to deliver information to quality registrations and transfer patient information between hospitals because the information has to be extracted and assessed by hand. Ensuring high quality data is important for healthcare organisations as low quality data can lead to medical errors. Being able to assess the quality of the BgZ registrations in an automated way could help improve the overall quality of the BgZ registrations increasing semantic interoperability and assist in the process of quality reviews.

1_{For this study the BgZ specification based on the Zib release 2017 was used, because at the time of writing it was the}

(6)

6 1.2. Description of the SRP Project

The goal of this project was to determine whether BgZ registrations could benefit from automatic data quality assessment. To reach this goal literature was searched to determine what data quality is and how the different elements of it can be assessed both in general and in the context of the BgZ. Next, a tool was developed to automatically assess the data quality of BgZ registrations.

The tool was developed to work in hospitals that use HiX as their EHR. HiX is the EHR that is being developed by ChipSoft. This EHR was chosen because it is in use in the majority of hospitals in the Netherlands [7]. The pilot hospital for development of the tool was Leiden University Medical Centre (LUMC). This hospital was chosen because they have some projects in progress regarding the BgZ, so their experiences so far helped with the development of the tool.

1.3. Research Questions

This thesis is built up to answer one main question. To answer the main question there are three sub questions which together will provide insights to answer the main question.

Main question

• To what extent can BgZ registrations benefit from automatic data quality assessment? Sub questions

• How can data quality be assessed?

• What elements of BgZ-registrations are suitable for automatic data quality assessment? • What are the capabilities and limitations of the assessment made by the tool?

• What are the practical issues which hinder implementation of the tool? 1.4. Thesis format

Chapters 2 and 3 will focus on data quality and the BgZ, exploring assessment methods and

dimensions with which data quality can be assessed in general and in the context of the BgZ. Chapters 4 and 5 will focus on the tool that was developed, assessing its accuracy and limitations and practical issues related to implementing automated data quality assessment. Finally, chapter 6 will close the thesis with conclusions and discussion points.

(7)

7

2. CHAPTER 2: DATA QUALITY ASSESSMENT

2.1. Introduction

Good quality data is necessary for healthcare providers to deliver the best possible care. Turchin et al. found that discrepancies in electronic medication prescriptions can lead to adverse events [8]. Sittig and Singh reported similar results. Moreover, they looked beyond just prescription errors. They reported that any errors regarding health information, human or otherwise, can be

potentially harmful to patient safety when not caught or handled properly [9]. Murphy et al. concluded that misinformation not only leads to medical errors, but is time consuming for healthcare providers which results in less time for actual patient care [10].

In healthcare especially, being able to ensure that data is of good quality is essential for patient care and research. McCormack et al. found that clinicians are more likely to use a clinical decision support system (CDSS) if they believe that the data that is used by the CDSS is of high quality. They found that clinicians especially want data to be complete, accurate and reliable [11]. Ahmadian et al. found that the lack of standardized data was a major obstacle in CDSS development and

implementation. Furthermore, they reported that the success of any CDSS depends on the quality of the data it utilizes and that the majority of issues regarding data quality stem from issues regarding semantic interoperability [12].

Not just in the field of CDSS is it important to have good quality data. Zwaanswijk et al. found that the quality of patient data was a major issue perceived by healthcare professionals regarding electronic information exchange [13]. Furthermore, Simpao et al. reported on the importance of data quality when exchanging patient data among anaesthesiologists. They stated that the quality of data in anaesthesia information management systems is crucial for anaesthesiologists to properly perform their duties [14].

To develop an automated data quality assessment tool, first it must be established what different dimensions data quality consists of and how those different dimensions can be assessed. This chapter tries to answer the sub question “how can data quality be assessed?” and show its importance.

(8)

8 2.2. Methods

First, literature was searched to establish what data quality is and what dimensions it consists of. To find articles multiple online sources were searched. These sources were Pubmed, Google Scholar and UvA CataloguePlus. Search terms like “Data quality”, “Data quality dimensions”, “Information quality”, “Health information quality” and other paraphrased versions of those terms were used. Articles that were published before the year 2000 were used as background information and to provide context but were not used in the analysis. Relevance of the articles was determined based on the abstracts. The dimensions of data quality that were found in the articles were then summarised in a table. The dimensions were then counted to see which dimensions were most commonly used to identify the main dimensions of data quality.

Second, after the different dimensions of data quality were identified literature was searched to find methods of assessment for the identified dimensions. The same sources were searched as when searching for the dimensions. Some of the previously found articles already described methods of assessment, however additional articles were searched. These additional articles were searched using the search terms “Data quality assessment (methods)”, “Information quality assessment (methods)”, “EHR data quality assessment (methods)” and other paraphrased versions of those terms. The assessment methods found were then summarised in a table in line with what dimension of data quality they assess. Lastly, the assessment methods that were found were then reviewed to see which high-level datatypes they were suitable for assessing and then put in a table with corresponding datatypes.

(9)

9 2.3. Results

2.3.1. Data quality dimensions

A great deal been written regarding different data quality dimensions and many authors use different dimensions when assessing data quality. The dimensions found in the literature are summarised in table 2.1.

Table 2.1: Data quality dimensions.

Dimension Reference(s) Number of references

Completeness 17, 21, 19, 45, 23, 20, 16, 40, 15, 41, 42, 30 12 Accuracy (Validity, Correctness) 17, 18, 19, 45, 20, 16, 40, 15, 41, 42, 30 11 Consistency (Concordance) 17, 19, 46, 20, 16, 40, 15, 41, 42, 30 10 Timeliness (Currency, Volatility) 17, 18, 19, 45, 20, 26, 12, 15, 42, 30 10 Plausibility (Credibility) 17, 19, 16 3 Comparability 18, 30 2 Granularity 19, 30 2 Accessibility 41, 30 2 Understandability 41, 30 2 Security 41, 30 2 Privacy 41, 30 2 Fragmentation 19 1 Signal-to-noise 19 1 Structuredness 19 1

As can be seen in table 2.1 there are four dimensions of data quality that are used by nearly all authors which are:

• Data Completeness • Data Accuracy • Data Consistency • Data Timeliness

These four dimensions of data quality are the most recurring dimensions in the literature regarding this topic. Therefore, these four dimensions will be explained and looked at the most.

Data accuracy is often defined as the extent to which data are correct, reliable, and certified free of error. In other words, accuracy is defined as the proximity between a value and the “true” representation that the value is supposed to represent [15-20]. There are two types of accuracy; syntactic accuracy and semantic accuracy. Syntactic accuracy is defined as the proximity of a value to the elements of the corresponding definition domain. Semantic accuracy is defined as the proximity of a value to its true value [15,20]. Semantic accuracy is sometimes referred to as correctness.

(10)

10 For example, the name of a patient is Steve, but in his medical record his name is registered as Carl. In this example the registered name is syntactically accurate as the definition domain consists of names and Carl is a name. However, the registered name is semantically inaccurate as it is not close or equal to the true value. Semantic accuracy is typically more complex to measure, especially when it

concerns medical data because the help of a medical expert is required to assess whether a data element is in fact correct or not. Syntactic accuracy on the other hand is easier to assess, because the definition domain is all that is needed to assess a value’s proximity to the domain [15,20].

Data completeness is commonly defined as: “the extent to which data is of sufficient depth and breadth for the task at hand” [16-21,23]. It is important to understand that data completeness is context driven. This means that a certain value may be missing simply because an event did not occur [16]. For example, in the context of alcohol use, it is generally required to record a start and/or stop date for when a patient started and/or stopped drinking. However, if a patient never drank alcohol in his/her life these data elements will not be registered. Therefore, a missing value may not mean a record is incomplete given a certain context. Data completeness is the most straightforward and most used dimension of data quality [17]. However, there is no general agreement on the proportion of missing data which is deemed acceptable [16].

Data consistency, also referred to as data concordance or data credibility, refers to a set of semantic rules, most often called “constraints”, that certain data elements have to conform to [15,16, ,19,20]. Commonly, there are two types of constraints; intra-relation constraints and inter-relation constraints. Intra-relation constraints that are put on a single data element whereas inter-relation constraints are constraints that are put between data elements. An example of an intra-relation constraint would be: “The value of age has to be between 0 and 120”. Additionally, Inter-relation constraints could for example be: “If a patient is taking insulin, they must have diabetes”. Sometimes these constraints are built into a database, with for example foreign keys and primary keys, but this is not always the case [20].

Data timeliness is composed of data currency and data volatility. Data currency refers to

whether a data point was registered and/or made available within a given timeframe of data collection. Data currency can also refer to whether a certain value is the most recent value available at any given time or whether a value is measured recently enough for it to be medically relevant [15,16 ,20, 23]. Data volatility is characterised by how frequent data elements change over time. In other words; data volatility indicates how long a certain value stays relevant given the context [15,16,20]. For example, if a certain data point has a high volatility (the data point is frequently subject to change) the data registered has to be current to be considered timely.

(11)

11 2.3.2. Assessment methods

After identifying the different dimensions that data quality consists of, literature was searched on how to assess these methods. There are two main categories of assessment methods: Quantitative and Qualitative. Quantitative methods commonly use statistical approaches to assess the quality of data. Qualitative methods are for example interviews where data users are asked about their perspectives of the data quality. Table 2 shows the different assessment methods for each of the four main dimensions of data quality.

Table 2.2: Assessment methods for each of the four main dimensions of data quality.

Dimension Assessment methods

Completeness Compare to gold standard, Distribution comparison, Element presence (percentage of missing data), Data element agreement (Internal consistency checks), Compare to alternate (trusted) data source (Triangulation among multiple sources), Interviews, Observations, Documentation reviews, Questionnaires

Accuracy Compare to gold standard, Data element agreement (Internal consistency checks), Data source agreement (External consistency checks), Distribution comparison, Validity checks (Validation rules, Constraint adherence), Log review, Compare to alternate (trusted) data source (Triangulation among multiple sources), Interviews, Observations, Documentation reviews, Questionnaires

Consistency Data element agreement (Internal consistency checks), Data source agreement (External consistency checks), Compare to alternate (trusted) data source (Triangulation among multiple sources), Validity checks (Validation rules, Constraint adherence), Interviews, Observations, Documentation reviews, Questionnaires

Timeliness Compare to gold standard, Element presence (percentage of missing data), Data element

agreement (Internal consistency checks), Compare to alternate (trusted) data source (Triangulation among multiple sources), Log review, Interviews, Observations, Documentation reviews,

Questionnaires, Availability of data at a certain deadline, Timestamp comparisons

A commonly used method to assess the completeness and accuracy is comparison to a gold standard. This method involves having a dataset of which the data quality is known to quantitatively compare the dataset that is to be assessed. Comparing to a gold standard gives an accurate assessment of the completeness and accuracy of a dataset, however acquiring a gold standard dataset to compare to can be difficult [20]. Another commonly used method to assess completeness of a data set is called “Element presence”. Element presence is a quantitative method and is done by measuring the

percentage of missing values for all of the variables in a dataset and comparing that percentage to a predefined threshold percentage to determine the level of completeness [19].

A method that can be used to assess completeness, accuracy and consistency is called “Data element agreement”. This method uses internal consistency checks to assess these different

(12)

12 dimensions of data quality. For example, completeness can be assessed by counting the number of missing values given the value of a different variable (Patients with diabetes but no glucose medication prescriptions). Accuracy and consistency can be assessed by checking the value of a variable given the value of another variable (Patients with low blood pressure receiving beta blockers are probably inaccurate and inconsistent recordings) [19].

Accuracy and consistency can furthermore be assessed by a method called “Validity checks”. This method uses certain validation rules to assess whether a value is accurate and consistent or not. A validation rule to assess accuracy could for example be “The value of the variable Age has to be between 0 and 120”. A validation rule to assess consistency could for example be “When variable X equals A then the value of variable Y must be between B and C” [20]. Consistency can also be

assessed by using a method called “Data source agreement”. This method compares two records of the same patient and checks if the values of the different elements of the record are consistent with each other. This method can also be used to assess completeness by checking if any element that is available in one record is also available in the other [23].

Timeliness can be assessed by using a few different methods. “Log review” involves checking system logs to see if data was made available in a timely manner. A similar method involves

comparing timestamps of different data elements to see if they match to see if they match a known date of incidence [18, 19]. In addition to these quantitative methods, timeliness can also be assessed in a qualitative way. The most commonly used qualitative data quality assessment methods are

interviews and questionnaires. These methods are used by asking data users, either in an interview or a questionnaire, what their perceptions are of the different dimensions of data quality. These qualitative methods can therefore be used to assess all four main dimensions of data quality [20]. Other

qualitative methods are comparing data to observations made and documentation reviews [19,20,23]. Additionally, there are some data quality assessment methods that are more comprehensive and assess data quality in general instead of measuring the separate dimensions. These methods are often a combination of the aforementioned methods. Some of these methods are: AIMQ [24], CDQM-a [20, 25], COLDQ-CDQM-a [26], DQA [37], TDQM-CDQM-a [28] CDQM-and TQdM-CDQM-a [29]. These methods CDQM-are out of the scope of the current project because all they add is a process of assessing data quality and not new methods of data quality assessment.

Data can be registered in different ways. For example, some data is registered as numbers, other data is registered as free text and some data is registered as true or false. As can be expected some assessment methods are more suitable for assessing certain “datatypes” than other assessment methods. Table 2.3 shows the five main high-level datatypes along with the assessment methods that are suitable for assessing the corresponding datatypes. Some assessment methods are suitable for all datatypes as they simply compare the values in two different datasets (e.g. compare to gold standard) or because they simply check if a value is present (e.g. element presence).

(13)

13

Table 2.3: High-level datatypes with suitable assessment methods.

High-level datatype Example data Assessment methods Boolean True

False

Compare to gold standard, Element presence, Data element agreement, Data source agreement, Compare to alternate data source, Validity checks, Interviews, Observations, Documentation reviews, Questionnaires

Numeric 123 Compare to gold standard, Element presence, Data element agreement, Data source agreement, Compare to alternate data source, Validity checks, Interviews, Observations, Documentation reviews, Questionnaires, Distribution comparison

Enumerator Low Medium High

Textual Diabetes

Drinks on weekends

Timestamp 02/07/2019 Compare to gold standard, Element presence, Data element agreement, Data source agreement, Compare to alternate data source, Validity checks, Interviews, Observations, Documentation reviews, Questionnaires, Availability of data at a certain deadline, Timestamp comparisons, Log review

(14)

14 2.4. Discussion

Literature shows there is not one clear definition of data quality, however there are four dimensions of data quality that are mentioned frequently in the literature. These four dimensions are: completeness, accuracy, consistency and timeliness. Completeness is the extent to which data is of sufficient depth and breadth for the task at hand. Accuracy is the extent to which data are correct, reliable, and certified free of error. Consistency refers to a set of semantic rules, most often called “constraints”, that certain data elements have to conform to. Timeliness refers to whether a data point was registered and/or made available within a given timeframe of data collection.

There are two main categories of data quality assessment methods: quantitative methods and qualitative methods. The most commonly used quantitative methods are: comparison to a gold standard, element presence, data element agreement and validity checks. The most commonly used qualitative methods are: interviews and questionnaires. Quantitative methods are used more often than qualitative methods when assessing data quality [30].

Qualitative methods often result in data that is difficult to assess in an automated way. Therefore, the focus for the remainder of this thesis will be on quantitative methods. Because quantitative methods provide data that are easier to analyse in an automated way, developing a tool that implements quantitative methods is a good starting point.

There is not one grand method that assesses all four of the main dimensions of data quality. Therefore, it is required to use multiple of the aforementioned data quality assessment methods if a full assessment of all four main dimensions of data quality is required.

(15)

15

3. CHAPTER 3: INFORMATION EXCHANGE

3.1. Introduction

The BgZ consists of multiple different data elements which can have different datatypes. As

established in chapter 2 different datatypes require different assessment methods. Furthermore, some datatypes and assessment methods are more suitable for automated assessment than others. This chapter provides a summary of what the BgZ looks like and establishes what elements of the BgZ are suitable for automated assessment.

3.1.1. Benefits of good information exchange

The development of standards such as the BgZ is meant to improve the exchange of health information between healthcare providers. Improved health information exchange (HIE) processes can have a number of benefits. Kierkegaard et al. found that improved HIE leads to an increase in information completeness and quality, information availability and more timely information. It was found that these improvements had a number of benefits including but not limited to improved clinical care, improved workplace efficiency and improved assessment and planning [31].

LaBorde et al. found that although there are costs associated with implementing HIE solutions into existing workflows, the benefits outweigh the costs. They studied data from crossover patients between two hospitals and found that both short stay patients with more acute conditions and patients with chronic diseases that visited both hospitals over a longer period of time benefited from improved HIE. For patients with little time between visits efficient HIE can help in the prevention of duplicate testing. For chronic patients with longer times between visits historical data may be significant in identifying treatment needs [32].

Improved HIE provides patients with a more active role in care management. Furthermore, Mäenpää et al. found that both patients and healthcare professionals experienced benefit from improved regional HIE. For example, it was perceived that healthcare professionals had access to more complete and timely data and patients felt that they had more responsibilities and were more involved in the care process. Lastly, the study showed that HIE usage was recognised to have improved regional collaboration between different organisations in healthcare.

(16)

16 3.1.2. BgZ

The BgZ is a dataset of patient information that is almost always required for continuity of care. The BgZ was conceived by an initiative called ‘Registratie aan de bron’ (Registration at the source) to improve health information exchange. Registratie aan de bron is funded by several organisations from the Dutch healthcare sector, including but not limited to the Dutch Federation of University Medical Centres (NFU) and the Dutch Union of Hospitals (NVZ) [5]. The goal of this initiative and in particular the BgZ is to improve clear and concise registering of patient information. This would ensure that data only has to be registered once and can then be reused for other purposes such as patient transfer and research.

The BgZ consists of several zorginformatiebouwstenen or Zibs. Zibs meticulously describe what needs to be registered about a certain subject in the care path. A Zib entails a number of

agreements on a medical concept, such as a diagnosis or an operation. An example of an agreement is that when a blood pressure reading is done, it is also required to record the date of the measurement, the circumstances of the measurement and who did the measurement. Zibs are created with the idea that the same information about a certain subject is relevant to all personnel involved in the care process, each having their own use for the data [1].

Registratie aan de bron introduced the BgZ in July 2016 after which an increasing number of parties have started working with it. In January 2018 the BgZ and Zibs were designated as the nationwide standard for exchange of healthcare information. Since January 2018 a number of programs have started to help hospitals and other healthcare facilities implement the BgZ in their respective electronic record systems [5].

3.2. Methods

To assess what elements of the BgZ are suitable for automated data quality assessment, first the latest BgZ specification [33] was summarised to determine what data and data types are specified in this standard. All the BgZ elements were then put in a table along with their respective data types.

Second, all datatypes of the BgZ were matched with the data quality assessment methods found in chapter 2. This matching was done by comparing the datatypes from the BgZ to the high-level datatypes found in chapter 2. The assessment methods were then put in the table alongside the BgZ datatypes they are suitable for. Lastly, the articles that were used in chapter 2 were reviewed once more to assess which methods can be used in an automated way and which datatypes are suitable for automated assessment.

(17)

17 3.3. Results

The BgZ consists of 28 Zibs which are relevant to all healthcare professionals involved in the care process of a given patient. Examples of some of these Zibs are: patient information (name, date of birth, etc.), allergies and medication. The full list of Zibs building up de BgZ is shown in appendix 1.

Each Zib consists of a number of “data elements” that must be registered in order to comply to the BgZ. For example, the Zib “BodyHeight” consists of: “HeightValue”, “HeightDateTime”,

“Comment” and “Position”. Each of these data elements has a certain datatype attached which states how the data element should be registered. There are 13 different possible datatypes in the BgZ which are shown in table 3.1. Table 3.2 shows each datatype and the number of data elements that utilise each datatype.

Table 3.1: Possible datatypes in the BgZ (adapted from [46])

Datatype Description

Container Container. A collection of related elements, which can occur together one or more times. Datatype REF Reference. A reference to another HCIM.

Datatype ANY Any. Any of the datatypes mentioned below can be used.

Datatype BL Boolean. Data element that only represents the values 'Yes/No' of 'True/False'.

Datatype CD Coded Descriptor. Data element that is coded with code systems like SNOMED-CT, LOINC, G-Standaard Datatype CO Coded Ordinal. Data element that uses the same code systems as datatype CD (above). However, these

elements have values with an ordinal scale (e.g. low – medium – high).

Datatype ED Encoded Data. Data element that represents e.g. images or document blobs. Datatype II Instance Identifier. Data element that uniquely identifies a thing or object.

e.g. a driving license number, a passport number, etc.

Datatype INT Integer. Data element that contains any of the natural numbers, the negatives of these numbers, or

zero.

Datatype PQ Physical Quantity. Data element that expresses a quantity and a unit of measure, typically the result of a

measurement. The units of measure are to be selected from the UCUM system [47].

Datatype ST String. Data element contains textual information. Datatype TS Timestamp. Data element contains a date and/or a time.

(18)

18

Table 3.2: Datatypes and number of occurrences in BgZ

Datatype Number of occurrences

Container 16 REF 42 ANY 3 BL 6 CD 85 CO 1 ED 1 II 4 INT 2 PQ 10 ST 67 TS 39 Error 0

In total there are 276 data elements in the different Zibs building up the BgZ. Some Zibs contain a large number of data elements e.g. health professional which contains 34 data elements. Other Zibs, such as marital status, contain only one data element. Table 3.3 shows the data elements of the Zib patient information and their respective datatypes. The data elements and the respective datatypes of the entire BgZ are shown in appendix 3.

(19)

19

Table 3.3: Data elements with respective datatypes of the Zib patient information (Dutch).

1. Demografie en Identificatie 1.1 Patientgegevens Datatype Zib Patient ::Naamgegevens Container Voornamen ST Initialen ST Roepnaam ST Voorvoegsels ST Achternaam ST VoorvoegselsPartner ST AchternaamPartner ST Naamgebruik CD ::Adresgegevens Container Straat ST Woonplaats ST Gemeente ST Huisnummer ST Huisnummerletter ST Huisnummertoevoeging ST AanduidingBijNummer CD Land CD Postcode ST Adressoort CD AdditioneleInformatie ST ::Contactgegevens Container Telefoonnummer ST TelecomType CD NummerSoort CD EmailAdres ST EmailSoort CD PatientIdentificatienummer II Geboortedatum TS Geslacht CD MeerlingIndicator BL OverlijdensIndicator BL DatumOverlijden TS

Chapter 2 showed the different data quality assessment methods and which datatypes they are able to assess. In the context of the BgZ, table 3.4 shows the datatypes of the BgZ and the different data quality assessment methods with which each of these datatypes can be assessed. The datatypes Container, ANY and Error where left out of this analysis. Error contains no relevant data; Containers are built up from multiple different datatypes and ANY can be any of the other datatypes.

(20)

20

Table 3.4: Datatypes of the BgZ with assessment methods suitable for each type.

Datatype Data quality assessment methods

Datatype REF Compare to gold standard, Data element agreement, Element presence

Datatype BL Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

source, Data source agreement, Element presence

Datatype CD Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

source, Data source agreement, Element presence, Validity checks

Datatype CO Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

source, Data source agreement, Element presence, Validity checks

Datatype ED Compare to gold standard, interviews, observations, documentation reviews

Datatype II Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

Datatype INT Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

source,

Data source agreement, Distribution comparison

Datatype PQ Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

Datatype ST Compare to gold standard, Data element agreement, Compare to alternate (trusted) data

Datatype TS Compare to gold standard, Data element agreement, Availability of data at a certain

deadline, Timestamp comparisons, Log review, Interviews, Observations, Documentation reviews, Element presence

As can be seen in table 3.4 there are some data quality assessment methods that are able to assess multiple data types. For example, comparing to a gold standard is suitable for each of the datatypes as it compares the values of two data sets irrespective of datatype. Similarly, data element agreement checks whether two values within a data set are in agreement and can therefore be used for assessing multiple datatypes. Another data quality assessment method that can be used to assess for multiple data types is element presence. Element presence simply checks whether a value is present that is supposed to be present. In comparison, the data quality assessment method distribution comparison requires numerical data and therefore can only be used to assess the datatypes INT and PQ (in the case of the BgZ).

In terms of automated assessment there are some datatypes which are harder to be assessed in an automated way. Datatypes which consist of free text, primarily ST, are harder to assess if the value that a certain item is supposed to have is not known (no standard or secondary data source to compare to). For example, the item “First Name” from the Zib patient information can contain the value “Carl”, without having a secondary dataset which confirms this value is true it is near impossible to assess the accuracy in an automated way. Contrarily, datatypes which consist of numerical data or coded

elements, such as CD, CO, INT and PQ, are easier to assess the accuracy in an automated way. For example, if an item which has the datatype CD has a value which is not in accordance with the coding system the item uses then the value is incorrect.

(21)

21 Similar to datatypes some data quality assessment methods are harder to implement in an automated way than others. For example, comparing to a gold standard can easily be done in an automated way, however it requires a gold standard to be present and available. Assessment methods which produce qualitative data, e.g. interviews and observations, are harder to assess for the same reasons as why free text is hard to assess in an automated way. The data quality assessment method element presence is the easiest to assess in an automated way because the system simply has to check whether a value is present or not.

3.4. Discussion

The 28 Zibs that the BgZ consists of contain a total of 276 data elements. These data elements can be any of 13 different datatypes ranging from free text (string) to coded elements (coded descriptor and coded ordinal). Of these 13 datatypes in the BgZ 3 types are not suitable for assessment or can be any of the other datatypes. Datatype CD is the most recurrent with 85 occurrences.

Some assessment methods are suitable for multiple datatypes whereas other assessment methods are tailor-made for specific datatypes. For example, the assessment method “compare to gold standard” is suitable for any of the datatypes as it simply compares whether the data element in one dataset is the same as the data element in the gold standard. In comparison, the method “distribution comparison” requires numerical datatypes to establish a distribution and is therefore not suitable for assessing textual data.

Some datatypes and data quality assessment methods are more suitable for automated data quality assessment than others. For free text datatypes such as strings it is are harder to assess the accuracy of a value than for numerical datatypes such as integers. Assessment methods where the value of an item is compared to the value of another item (internal or external) are easy to implement in an automated way given that a secondary data source is present and available. Furthermore, quantitative assessment methods, such as distribution comparison or timestamp comparison, provide numerical output data which can be assessed in an automated way. Contrarily, qualitative assessment methods are harder to assess in an automated way due to the free text nature of the output.

Assessing completeness is possible for all the elements for the BgZ as it only requires the assessment method element presence to be applied, which is suitable for all datatypes. Accuracy, consistency and timeliness are more difficult to assess due to the datatypes of suitable assessment methods. Of the 13 datatypes found in the BgZ seven are suitable for automated assessment. These seven datatypes shape 148 of the 276 (54%) total data elements in the BgZ.

(22)

22

4. CHAPTER 4: TOOL 4.1. Introduction

To get a picture of how the tool can contribute to improving BgZ registrations it is important to fully understand the capabilities and limitations of the tool that is being developed. Currently version 1.0 of the tool is finished. The tool is still undergoing further development, but for the purposes of this thesis version 1.0 will be used.

As discussed in chapter 3, some datatypes and data quality assessment methods are more suitable for automated assessment than others. However, what is the result of implementing this theoretical information into a practical situation? The aim of this chapter is to answer the last two sub questions providing insight into what capabilities and limitations the tool has and to show what issues are behind the limitations of the tool.

4.2. Methods

4.2.1. Construction

The tool was developed in Microsoft visual studio, using the language VB.NET as a base. There are plans to remove the VB.NET base and replace it with C# code with the same functionalities. This will be done to improve the performance, maintainability and reusability of the code base. Figure 4.1 shows all components of the tool. The base program, from now on will be referred to as toolbox, consists of two tools: “AuthorisationScanner” and “DataQualityChecker”. The AuthorisationScanner tool was developed in conjunction with the DataQaulityChecker tool by another student of the master Medical Informatics. This tool falls outside the scope of this thesis.

Besides the VB.NET base toolbox, the DataQualityChecker tool implements a number of MSSQL scripts and SQLite scripts. Before these scripts were made, first the database structure of the EHR had to be mapped. After initial development the tool was examined to establish what parts of the BgZ it is able to assess and what parts of the BgZ it is unable to assess. This examination resulted in a number of items that the tools is able and unable to assess. To assess the data quality of the BGZ registrations the tool utilizes three of the data quality assessment methods discussed in chapter 2.

To assess the completeness the tool currently utilizes element presence to determine whether a record is complete. The elements present are then compared to the requirements of the BGZ to assess the accuracy, consistency and timeliness of the data. The accuracy and consistency are checked using validity checks based on the BGZ by filtering out improbable data values e.g. a patient has an age of 200. The timeliness of the data is assessed by comparing the timestamps of extracted data elements. The results section of this chapter will go into further depth of how these methods are performed by the tool.

(23)

23 4.2.2. Validation

During development the tool was continuously tested on a copy of the database from the LUMC and the data from this database was used to further develop the tool. Because the tool is intended to work in any hospital that uses HiX, the tool was also tested in a secondary hospital: Alrijne hospital in Leiderdorp. This validation was done to see what problems arise when trying to use the tool in a different hospital. The data used in both tests was from all patients who had been in the hospital in January or February of the year 2019. Alrijne was chosen because ChipSoft sells two different packages of HiX to their customers. The first package is the standard package. The standard package is the same in every hospital that uses the standard package. Alrijne is one of the hospitals that has the standard package. The LUMC on the other hand has the package that allows hospitals to change parts of their EHR to their liking. During these tests the time the tool needed to assess both databases was recorded as well as memory usage because of the large amounts of data.

4.3. Results

4.3.1. Tool

Figure 4.1 shows the components of the furore HiX toolbox. The HiX toolbox component is the user interface of the tool. Here the user selects which tool to run and what settings to use. The Data Quality Checker uses the settings it is given to extract data from the HiX database. The DB browser for SQLite is used to run transformations on the extracted data. The data is visualised using a macro in Excel.

The process of the tool follows three steps to assess the data quality of BgZ registrations in hospitals that use HiX as their EHR. These three steps are:

- Data extraction - Data assessment - Data visualisation

Before these three steps can be done the user has to input a time period from which the data will be extracted. The data extraction phase is done by using a number of MSSQL scripts to extract the data out of the hospital database into a SQLite database to allow transformations of the data to take place without the risk of altering any data in the hospital database itself. The assessment of the data quality is done at data element level. Over time each patient generates more data and some data elements are updated more frequent (high volatility). Timeliness is assessed during extraction by matching the timestamps of the data elements (timestamp comparison). Matching timestamps ensures that the extracted values are the most recently registered values and that the values are in accordance with one another. After the data is extracted into the SQLite database the data is then used by a number of SQLite scripts to assess the data quality of the registrations.

(24)

24 The SQLite scripts perform four tasks which form the data assessment phase. First, a count is made for the number of patients that had a visit or were hospitalized during the selected time period. Second, for each of the data elements of the BgZ the number of missing and present records is counted (element presence) to assess the completeness of the BgZ registrations. Third, elements which have datatypes CD and CO are checked to see whether they contain values that are not allowed according to the BgZ and are therefore incorrect (validity checks). These elements are then counted to get a number of correct and incorrect values in the database. Other elements that have a set format for what a value should look like (e.g. dates and social security numbers) are assessed using regular expressions. Lastly, numerical data such as age and blood pressure readings are checked to see whether they contain values that are unlikely or impossible e.g. a patient has an age of 200.

In the data visualisation phase the result of the assessment is exported to Excel by the toolbox. This results in a workbook with a sheet for each of the 28 Zibs building up the BgZ. After the export is complete the toolbox runs a macro to visualise the data according to the BgZ template found in

appendix 2. A component diagram of the toolbox is shown in figure 4.1.

(25)

25 4.3.2. Outcomes tool

Of the 276 data elements found in the BgZ specification (Appendix 4) the tool is able to extract 213 data elements (77%). Of these 213 data elements the tool is able to assess the accuracy of 108 data elements (51%). This means that the tool is only able to assess the accuracy of 39% of the 276 total elements found in the BgZ. In light of confidentiality no numbers regarding the data quality of the tested database are given.

At the LUMC where the tool was developed the latest version of the tool had a runtime of 12 minutes and 58 seconds with a maximum memory usage of 881MB. When the tool was tested at the secondary hospital the maximum memory usage was 964MB. However, the test at the secondary hospital was eventually cancelled because the tool had been running 1 hour and 57 minutes and different appointments had to be kept.

4.3.3. Issues with development

In theory, hospitals should register all data elements specified in the BgZ. However, the tool is unable to extract 23% of the total data elements specified in the BgZ. The reason for this is that those data elements are not registered in the HiX database. For example, the zib LaboratoryTestResult contains data elements regarding specimen specifics. However, these details regarding specimens are registered in the, separate, lab’s electronic record system and are therefore not directly extractable from the HiX database. More examples of data elements which cannot be extracted are HouseNumberLetter,

HouseNumberAddition and HouseNumberIndication. These data elements are all registered in a single field in the HiX database called HouseNumber. Furore has attempted to extract and split data from this field, however because of the large variety of conventions regarding house numbers in the Netherlands it seems impossible to be able to extract the data elements separately.

As stated in chapter 3, theoretically it should be possible for the tool to assess accuracy for 54% of the total data elements of the BgZ. However, it was found that the tool is only able to assess 39% of the total data elements. There are multiple reasons for this 16% difference. The first obvious reason for this is that some of the data elements cannot be extracted from the HiX database. The second reason is that for some of the data elements that are registered in the HiX database the datatypes differ from what is specified in the BgZ. For example, the data element AlcoholStartDate from the zib AlchoholUse is specified in the BgZ as the datatype TS. However, in the HiX database this field is a free text field. Because of this, it is common to see answers like “Only drinks in the weekend” or “Started when patient was 16” and these answers are, according to the BgZ specification, always incorrect because a date is expeccted. The third reason is that in HiX some data elements with the datatype CD use different code systems than those specified in the BgZ. This means that they are always incorrect according to the BgZ even though they may contain information that is valuable to patient care. Table 4.1 shows an example of such a case.

(26)

26

Table 4.1: Example of difference in use of code systems between the BgZ and HiX.

BgZ HiX

Current drinker of alcohol (finding) Yes Current non-drinker of alcohol (finding) No

Ex-drinker (finding) Sometimes

Lifetime non-drinker (finding) Never

Other Other

To add to the complexity of the problem it can be seen in table 4.1 the difference is not only in the choice of wording but some options differ completely in meaning as well. The different wordings and meanings make it impossible to fully map the options available in HiX to the options specified in the BgZ. Furthermore, within HiX itself exist differences in options for these data elements as well. In some hospitals it is possible to create new questionnaires and with new questionnaires also create new options for these data elements.

Because of this, some of the time-consuming mapping process has to be done again. This also means that when trying to use the tool on another database, from for example a different EHR system, the database would have to be mapped first. The initial mapping of the database was done in roughly two months.

During development and testing of the tool, the tool sometimes registered extremely high memory loads. Breaking up the extraction of the data into one zib at a time as well as optimizing the SQL scripts improved the memory load to acceptable levels. During the tests at both hospitals two months’ worth of data was extracted. A possible reason for the difference in runtimes between the two hospitals could be that the LUMC has superior hardware compared to the Alrijne Hospital. This large difference in runtime is currently being looked into at Furore as well as possible optimisations that can be made to the tool itself to improve runtimes.

Another issue that was encountered during the test at the secondary hospital is that data elements which are obtained through questionnaires have different ids between the hospitals. This is because the LUMC does not have the standard package which allows them to construct their own questionnaires on top of the standard questionnaires provided by ChipSoft. These new questionnaires then get assigned new ids by HiX. Because of this, these new ids have to be identified first before being able to run the tool.

(27)

27 4.4. Discussion

The process of the tool can be described in three phases. The first phase is data extraction, which is done by a series of MSSQL scripts. The second phase is data assessment, which is done in a SQLite database, to avoid tampering with the original hospital database, by a series of SQLite scripts. The third and final phase is data visualisation, here the data is extracted to Excel and visualised using a macro.

The tool is able to assess the accuracy of 51% of the 77% of total data elements that it is able to extract from the LUMC HiX database. This means that it is only able to assess 39% of the total data elements for accuracy. Chapter 3 showed that theoretically automated assessment of the accuracy of the BgZ should be able for 54% of the total data elements.

The main reason for the difference in theoretical and practical suitability of the data elements is that some data elements which should theoretically be suitable for automated data quality assessment cannot be extracted from the HiX database. Another reason is differing datatypes between HiX and the BgZ. A more complex problem is a difference in use of code systems between HiX and the BgZ. This difference makes mapping the answers found in HiX to the code systems found in the BgZ virtually impossible.

(28)

28

5. CHAPTER 5: THESIS CONCLUSION

5.1. Discussion

The goal of this chapter is to first answer the sub questions proposed in chapter 1 before summarising the answers to these questions into an answer to the main question: To what extent can BgZ

registrations benefit from automatic data quality assessment?

How can data quality be assessed?

Data quality in the literature commonly comprises four main dimensions: completeness, accuracy, consistency and timeliness. Completeness refers to whether a value that is supposed to be present is actually present. There are two types of accuracy: semantic accuracy and syntactic accuracy. Semantic accuracy is the proximity of a value to the true value of that data element. Syntactic accuracy is the proximity of a value to the elements of the definition domain. Semantic accuracy is in this sense “stricter” for assessing accuracy than syntactic accuracy. Consistency is whether a value conforms to certain constraints. There are two types of constraints: intra-relation constraints and inter-relation constraints. Timeliness refers to whether information was available when it needed to be.

Out of the four data quality dimensions completeness is the easiest to assess as it simply requires an element to be present irrespective of whether the entered value is correct or not. The most commonly used assessment method for completeness is element presence. Accuracy is harder to assess, especially for data elements that are recorded with free text. The best way to assess accuracy is to compare to a gold standard. However, getting a gold standard dataset is often not possible.

Commonly researchers use similar techniques like comparing data to an alternate source or

triangulating a comparative dataset from multiple sources to assess accuracy. Assessing consistency is commonly done by using validity checks and internal consistency checks. Timeliness is mostly done by checking if data was available within a certain timeframe or by comparing timestamps of different data elements.

What elements of BgZ-registrations are suitable for automatic data quality assessment? Different assessment methods are suitable for different datatypes. The BgZ consists of 276 data elements which have 13 datatypes. Three of the 13 datatypes are not suitable for assessment for various reasons. Some assessment methods are suitable for all the remaining datatypes found in the BgZ whereas other assessment methods are only suitable for assessing a single datatype such as the distribution comparison. Furthermore, some of the datatypes are harder to assess in an automated way. Strings are an example of this. Because of the free text nature of strings, it is hard to assess the

accuracy of a data element with this datatype in an automated way. Because of this, 7 out of the 13 datatypes in the BgZ are suitable for automated assessment. This means that 54% of the data elements in the BgZ should theoretically be suitable for automated data quality assessment.

(29)

29 What are the capabilities and limitations of the assessment made by the tool?

The tool developed during this study is part of a base toolbox where it can be decided which of the HiX tools developed at Furore will be ran. The DataQualityChecker tool is able to extract 77% of the total data elements in the BgZ from HiX, of which the tool can assess 51%. This means that the tool is able to assess the data quality of 39%, a difference of 16% with the theoretically suitable 54%. Another limitation is that the output of the tool still requires the interpretation of a consultant to give a definitive assessment of the data quality. Despite the limitations the tool still provides hospitals using HiX a way to easily extract data from their EHR as well as provide some insight into the data quality. Having these insights will help hospitals with ensuring and improving their overall data quality.

What are the practical issues which hinder implementation of the tool?

The main reason for the difference in theoretical suitability and the practical suitability for automated data quality assessment is that the tool is only able to extract 77% of the data elements found in the BgZ from the HiX database. Some data elements, like lab data, are stored in separate systems. Vaccinations are for example mostly done by general practitioners or specialised clinics, which means that hospitals only store whether a patient has been vaccinated and not all the extra data elements found in the zib Vaccination. Another reason for the difference is that some data elements which are split in the BgZ are stored in a single field in the HiX database. A more complex problem is that some of the data elements within HiX use different code systems than are specified in the BgZ. This means they are inaccurate according to the BgZ even though they contain relevant data for patient care and might not be interpreted as inaccurate by a healthcare professional. Even within HiX the code systems for the same data element differ between departments making mapping the values to the code systems required in the BgZ nearly impossible.

(30)

30 5.1.1. Strengths

This study provides a scientific basis for the DataQualityChecker tool. No tools existed which allowed hospitals to assess the data quality of the BgZ registrations specifically. Furthermore, this study identifies what the tool is currently able to do and where it still needs improvements providing a basis for further development of the tool. All in all, this study provides insight into the development of a tool to assess data quality in an automated way and shows which problems might be encountered during development. Furthermore, this study offers a way of assessing the quality of HCIM registrations. A substantial amount of research has been done to assess and/or assure the quality of how the HCIMs are constructed, however little research has been done regarding how HCIMs are used in practice [34-37].

The tool was developed using real data and was designed to work with an EHR that is

currently the most widely used EHR in the Netherlands. This means that the findings of this study will be applicable in more hospitals. The tool was tested in two different hospitals, although the tool did work at the second hospital the runtime was very long. Furthermore, the tool described in the study was built and is being planned for implementation by multiple parties and is not just a proof of concept.

5.1.2. Limitations

This study does not show the impact that the tool will have after being implemented. Hypotheses are made of its impact but it provides no actual study to test the impact after implementation. Furthermore, the tool only works on Chipsoft’s EHR HiX. The encountered problems when developing such a tool might be different when the tool is being developed for another EHR. However, the types of problems could very well be the same but just for different data elements. Further research with different EHRs is required to establish global problems one might encounter when developing an automated data quality assessment tool.

Testing of the tool was done at only two hospitals. One of these hospitals was where the tool was developed. This could also be a factor in the difference in performance between the two hospitals. To get a definitive assessment of the tools performance across hospitals the tool should be tested at more hospitals which were not any way involved in the development process of the tool.

(31)

31 5.1.3. Future work

Further study is needed to identify the impact of the tool after implementation. Currently it can only be hypothesized what benefits and drawbacks the tool will have. It will have to be further investigated after implementation. When the tool has been developed further, the capabilities and limitations of the tool will have to be examined again to evaluate the development process.

Research is needed involving implementing artificial intelligence and machine learning techniques to be able to assess some of the datatypes which currently cannot be assessed by the tool. Currently the tool only utilises straightforward data quality assessments methods whereas more complex methods may yield better results. For example, natural language processing techniques may allow for the assessment of free text datatypes such as strings. The tool may also be expanded to assess the data quality of more than the BgZ. For example, the tool could be expanded to allow for the assessment of the data quality of all registered data within a hospital department.

5.2. Conclusion

Assessing data quality in an automated way is harder and more diverse than it sounds. Data quality has multiple dimensions and data can have different datatypes making it so that there is not one grand assessment method to cover all aspects of data quality. Furthermore, hospitals do not always register all data that is required for fully assessing the data quality of BgZ registrations. This means that a lower percentage of data elements is able to be assessed than should be theoretically possible. However, this study provides a basis for further development of the tool and the opportunity for further research into the impact of implementing automated data quality assessment.

The main question of this thesis is: To what extent can BgZ registrations benefit from

automatic data quality assessment? However, the question can also be asked to what extent the BgZ is suitable for automated data quality assessment. Some specifications regarding data types or coded elements make automated data quality assessment a difficult task. That being said, the tool offers a way to automatically extract and assess BgZ registrations to hospitals using HiX as their EHR.

Being able extract data and assess its quality in an automated way means it will be less time consuming to perform these actions. Having a fast and simple way to assess the data quality of one’s BgZ registrations may prompt hospitals to perform checks more often identifying areas that require improvement thus improving the overall data quality. Furthermore, being able to easily extract data, along with the hypothesised increase in data quality, will further the ability to exchange patient

information which may improve patient care across and within healthcare organisations. In conclusion, BgZ registrations can benefit from automated data quality assessment by improved identification of problem areas, providing hospitals with ways to improve how they register data and provide Nictiz and Registratie aan de Bron with feedback on where they might be able to improve the BgZ standard itself.

(32)

32

REFERENCES

1. Registratie aan de bron. (2018). Zorginformatiebouwstenen. [online] Available at:

https://www.registratieaandebron.nl/wat-is-registreren-aan-de-bron/de-kern-van-registreren-aan-de-bron/zorginformatiebouwstenen/ [Accessed 2 Nov. 2018].

2. Nictiz. (2018). Wat is een Zib? - Nictiz. [online] Available at:

https://www.nictiz.nl/standaardisatie/Zib-centrum/wat-is-een-Zib/ [Accessed 3 Nov. 2018]. 3. Eichelberg, M. (2015). Interoperability Guideline for eHealth Deployment Projects. 1st ed.

[ebook] eStandards.

4. Nictiz (zorg) Architectuur vol.1 Basisdocument. De grondbeginselen van

zorginformatiebouwstenen en hoe ze gebruikt kunnen worden. (2019). 1st ed. [ebook] Nictiz. Available at: https://Zibs.nl/images/2/2a/Architectuurdocument_Registratie_aan_de_bron_-_Volume_1_v1.1.pdf.

5. Registratie aan de bron. (2018). Basisgegevensset Zorg. [online] Available at:

https://www.registratieaandebron.nl/wat-is-registreren-aan-de-bron/de-kern-van-registreren-aan-de-bron/basisgegevensset/ [Accessed 5 Nov. 2018].

6. Basisgegevensset Zorg. (2018). [PDF presentation] Registratie aan de bron, pp.2-5. Available at: https://www.registratieaandebron.nl/pdf/Basisgegevensset_Zorg_v1_0.pdf.

7. Partners, M. (2018). EPD overzicht. 1st ed. [ebook] Zorgvisie. Available at: https://www.zorgvisie.nl/content/uploads/sites/2/2018/04/Epd-overzicht2018.pdf. 8. Turchin, A., Shubina, M. & Goldberg, S., (2011). Unexpected effects of unintended

consequences: EMR prescription discrepancies and hemorrhage in patients on warfarin. AMIA Annual Symposium proceedings. AMIA Symposium, 2011, pp.1412–7.

9. Sittig, Dean F, and Hardeep Singh. Defining Health Information Technology–Related Errors: New Developments Since To Err Is Human. Archives of Internal Medicine, vol. 171, no. 14, 2011, pp. 1281–1284.

10. Murphy, A.R. & Reddy, M.C., 2014. Identification and management of information problems by emergency department staff. AMIA Annual Symposium proceedings. AMIA Symposium, 2014, pp.1845–54.

11. McCormack, J. and Ash, J. (2012). Clinician Perspectives on the Quality of Patient Data Used for Clinical Decision Support: A Qualitative Study. [online] AMIA Annu Symp Proc.

12. Ahmadian, L., van Engen-Verheul, M., Bakhshi-Raiez, F., Peek, N., Cornet, R. and de Keizer, N. (2011). The role of standardized data and terminological systems in computerized clinical decision support systems: Literature review and survey. International Journal of Medical Informatics, [online] 80(2), pp.81-93.