• No results found

Quantitative measure of the similarity of structure definitions expressed in HL7 FHIR

N/A
N/A
Protected

Academic year: 2021

Share "Quantitative measure of the similarity of structure definitions expressed in HL7 FHIR"

Copied!
52
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Medical Informatics

Scientific Research Project

Quantitative measure of the similarity of structure

definitions expressed in HL7 FHIR

Author:

Tessa van Alphen, BSc

Mentor:

Marten Smits, MSc

Tutor:

Dr. ir. Ronald Cornet

(2)

2

Student

T.C. van Alphen Student number: 11160985 Email: t.c.vanalphen@amc.uva.nl

Mentor

M. Smits, MSc Software Engineer Furore

Tutor

Dr. ir. R. Cornet Faculty of Medicine

Department of Medical Informatics, AMC-UvA

Location of Scientific Research Project

Furore

Software Engineering Department Bos en Lommerplein 280

(3)

3

Abstract

Introduction:

In recent years, multiple healthcare standards are developed to increase the interoperability between information systems. Within HL7 FHIR medical information is defined in structure definitions. In FHIR, these definitions can be extended and adapted to handle variability in

healthcare. Multiple structure definitions (profiles) of one concepts may be created which might lead to an impediment in clinical information exchange due to the use of incompatible profiles. In this thesis we will examine to which extent structure definitions can be structurally compared and whether the structural similarity between structure definitions can be quantitatively expressed.

Methods:

We applied the tree edit distance to FHIR profiles to quantitatively measure the similarity between profiles in our prototype tool. We used the Wagner-Fischer algorithm for this. An analysis of the ElementDefinition datatype was performed to identify critical, invariant and non-computable aspects of the elements in profiles. The results were used to assign varying weights to these aspects in our comparison method. Our method was tested by creating a ranking of the profiles in our training set based on their tree edit distances to a comparison profile. The outcome of our method was

compared to a ranking of this set created by four FHIR experts after a discussion meeting. The results were used to adapt the weights in our comparison method. A test set was used to evaluate our final comparison method in the same way as with the training set.

Results:

Aspects that were considered critical were: the cardinality, the datatype, the value range and the binding aspects. The invariant aspects contained information for tools and infrastructure or

descriptive information. Non-computable aspects were constraint, condition and binding. From the discussion meeting with FHIR experts it was concluded that the semantics of elements could not be ignored when comparing profiles. Our final method resulted in a ranking that slightly differed from the ranking of the FHIR experts. The Spearman rank order correlation between the two rankings was 0.725(p>0.05).

Conclusion:

Our research shows that the extent to which FHIR profiles could be structurally compared is quite low, since some critical aspects for a structural comparison were considered non-computable and the FHIR experts concluded that the semantics of elements could not be ignored when assessing the similarity between profiles. Future research is required to address these problems. Nevertheless, the Spearman rank order correlation of 0.725 is quite high, which might indicate that our method does provide an estimate of the human judgement of the structural similarity.

(4)

4

Samenvatting

Introductie:

De afgelopen jaren zijn een aantal standaarden binnen de gezondheidszorg ontwikkeld om de

interoperabiliteit tussen informatie systemen te vergroten. In de HL7 FHIR standaard wordt medische informatie gedefinieerd binnen structure definitions. Binnen FHIR kunnen deze definitions (of

profielen) uitgebreid en aangepast worden om te voldoen aan verschillende eisen binnen de gezondheidzorg. Hierdoor worden er verschillende profielen van hetzelfde medische concept gebruikt. Dit kan leiden tot problemen als informatie gedeeld moet worden. Om te beoordelen of profielen gelijk zijn, moet een methode ontwikkeld worden om profielen te vergelijken. In deze thesis is onderzocht in hoeverre het mogelijk is om profielen structureel te vergelijken en of we de structurele gelijkenis kunnen uitdrukken in een kwantitatieve maat.

Methode:

We hebben de tree edit distance toegepast op FHIR profielen om de gelijkenis tussen profielen in een kwantitatieve maat uit te drukken in onze prototype tool. Hierbij hebben we de Wagner-Fischer algoritme gebruikt. De resource ElementDefinition werd geanalyseerd om aspecten te identificeren die kritische, irrelevante of “non-computable” zijn met betrekking tot een structurele vergelijking. Dit werd gebruikt voor het toewijzen van gewichten aan de aspecten in onze vergelijkingsmethode. De methode werd getest door de profielen in onze training set te ranken op basis van hun tree edit distance ten op zichten van een “vergelijk profiel”. De uitkomst werd vergeleken met een ranking gemaakt door FHIR experts na een discussie bijeenkomst. De resultaten werden gebruikt om de gewichten in onze methode aan te passen. De test set werd tenslotte gebruikt om onze uiteindelijke vergelijkingsmethode te evalueren op dezelfde manier als met de training set was gedaan.

Resultaten:

De kritische aspecten waren de cardinaliteit, het datatype, het bereik voor een waarde en de binding. Irrelevant was de informatie voor tools en infrastructuur en de descriptieve aspecten. De “non-computable” aspecten waren: constraint, condition en binding. In de discussie bijeenkomst met FHIR experts werd geconcludeerd dat de semantiek van elementen niet genegeerd kan worden bij het vergelijken van profielen. Onze vergelijkingsmethode resulteerde in een ranking die in kleine mate verschilde van de ranking van de experts. De Spearman rank order correlatie tussen beiden was 0,725 (p>0,05).

Conclusie:

Onze resultaten laten zien dat de mate waarin FHIR profielen structureel vergeleken kunnen worden niet erg hoog is. Een aantal kritische aspecten voor een structurele vergelijking werden immers ook beschouwd als “non-computable” en de FHIR experts concludeerde dat de semantiek van elementen niet negeert kan worden bij het vergelijken van profielen. Aanvullend onderzoek is nodig om deze problemen op te lossen. Desondanks was de Spearman correlatie van 0,725 redelijk hoog, wat er op kan duiden dat onze methode toch een redelijke schatting maakt van de structurele gelijkenis tussen profielen.

(5)

5

Contents

1 Introduction... 7 1.1 Context ... 7 1.2 Research ... 8 2 Background... 10 2.1 Glossary ... 10 2.2 HL7 FHIR ... 10

2.3 The tree edit distance ... 12

2.4 The Wagner-Fischer algorithm ... 13

3 Methods ... 16

3.0 Literature Search ... 16

3.1 Representation of FHIR profiles ... 17

3.1.1 Materials ... 17

3.1.2 Representing FHIR profiles as tree structures ... 18

3.1.3 Simplifying FHIR profiles for initial comparison ... 18

3.2 Creating a method to compute the similarity between profiles ... 19

3.2.1 Using the tree edit distance ... 19

3.2.2 Applying the tree edit distance to FHIR profiles ... 19

3.2.3 Linear comparison of ordered profiles ... 22

3.2.4 Using the Wagner-Fischer algorithm ... 22

3.2.5 Analysis of the resource ElementDefinition and determining initial weights ... 26

3.3 Optimization and evaluation of our method ... 29

3.3.1 Creating a training set and a test set of profiles ... 29

3.3.2 Ranking the profiles in the training set and test set ... 29

3.3.3 Compute the similarity between the structure definitions and adapt algorithm ... 30

3.3.4 Final evaluation and validation of our method ... 30

4 Results ... 31

4.1 Analysis of the aspects of the “ElementDefinition” ... 31

4.1.1 Critical aspects ... 31

4.1.2 Invariant aspects ... 31

4.1.3 Non-computable aspects ... 32

4.2 Initial weights assigned to the aspects of the ElementDefinition ... 32

4.3 Interrater-agreement between the experts ranking the profiles ... 34

4.4 Conclusions of discussion meeting about ranking profiles ... 35

4.5 Preferred ranking training set compared to results from initial method ... 36

(6)

6

4.7 Preferred ranking test set compared to results from final method ... 38

5 Discussion ... 39

5.1 Main findings ... 39

5.2 Strengths and weaknesses of the study ... 40

5.2.1 Unresolved problems encountered when creating our comparison method ... 40

5.3 Additional comments ... 42

5.4 Meaning of the study ... 42

5.5 Unanswered questions and future research ... 42

References ... 44

Appendix ... 46

(7)

7

1

Introduction

1.1

Context

The organization of healthcare is increasingly changing to support a more multidisciplinary approach to care delivery, resulting in an increasing need for sharing medical information. Sharing medical information is necessary for an efficient care process, for reducing the duplication of tests and to prevent medical errors. Therefore, healthcare records are increasingly becoming digitized to enable timely access to all relevant patient information and thereby aiming to improve the quality of

healthcare. To have access to all relevant information, patient information should be shared between different systems and applications, regardless of the healthcare providers’ location. After all,

healthcare providing organizations, such as hospitals, typically use many subsystems and applications for varying tasks. Each subsystem is usually a standalone system which supports healthcare

professionals in a specific task. The electronic health record (EHR) is therefore often a distributed system, consisting of multiple components working together to serve the overall functions of the system. So, interoperability between subsystems and applications is required for the successful deployment of electronic health records.[1]

To exchange medical information in healthcare and increase the interoperability between systems, healthcare standards are developed. Uniform documentation of clinical information is considered to facilitate the exchange of information between systems, applications or institutions and to reduce the risk of losing or misinterpreting clinical information due to the use of different data definitions. So, to enhance interoperability, standardization is required. This standardization should encompass a description of the information required and a specification of the clinical data elements.

Furthermore, standardization should include the terminology to be used, the manner in which data should be stored, and the way in which exchange of data should take place.

Multiple standards, such as the standards of Health Level 7 (HL7v3, CDA, and FHIR), are developed to structurally describe medical information and to create standard information models and

transactions. [1,2] The most recently developed standard of Health Level 7 is FHIR (see background). Within the specification of this standard medical information is structurally defined, but FHIR has also defined a framework for extending and adapting the existing definitions to handle variability and satisfy diverse health processes. So, while one standard or information model (FHIR) is used, the structure definitions of medical concepts can still differ in different contexts with varying

requirements. For example, structure definitions of the same concept, such as “patient”, can differ between countries, organizations or use cases within an organization due to context-specific

requirements. Specific elements of the information model may be mandatory to use in one situation but not in the other. When the FHIR standard is used, extensions and constraints can be defined and applied to the base structure definition of a clinical concept, which is described in the FHIR

specification, to fulfill the context-specific requirements. As a result, multiple structure definitions of a concept will be created for different situations, although the same standard or information model was used.[3]

The fact that different structure definitions of one medical concept are used or could be used by different countries, institutions or applications, could lead to an impediment in clinical information exchange due to the use of incompatible structure definitions. So, although the flexibility that FHIR provides is useful for handling variability in health processes, it could also hamper interoperability.[1] For example, the impediment occurs when communicating parties use different structure definitions for one concept; one party may expect a specific field to be exchanged while the sending party is not including this field in the structure definition of the exchanged concept. For example, in the structure

(8)

8

definition of a Dutch patient, the “Burgerservicenummer”, a unique person identifying number, may be a mandatory element to include when transferring data, while in other countries this number does not exist and will not be included in the structure definition of a patient. Specifying one structure definition for the information exchange, makes it possible to check whether all required data is present and whether the correct datatypes and structure are used. In the described example, in which a Dutch healthcare provider wants to exchange information with a foreign healthcare provider, the Burgerservicenummer shouldn’t be a mandatory element (or a default value should be specified) and other optional identifiers (alternatives to the Burgerservicenummer) should be included in the used structure definition.

Summarizing, to improve the interoperability between multiple parties, the proliferation of multiple similar structure definitions should be avoided and the re-use of existing structure definitions should be encouraged. Since some of the existing structure definitions might be similar to a certain degree, it might be possible to merge these similar structure definitions into one structure definition. Therefore, it is useful to determine whether structure definitions within one information model can be compared and whether the similarity between the structure definitions can be quantitatively expressed. After all, when it is possible to compare structure definitions, one can search and find similar structure definitions in a repository, such as the Registry Simplifier.net[4] for FHIR structure definitions. Furthermore, when it is possible to compare structure definitions, one can determine the amount of modifications required to harmonize the structure definitions and enable cooperation. In this way, the proliferation of multiple similar structure definitions can be avoided.

1.2

Research

Summarizing, a method to find similar structure definitions in a repository is needed to avoid the proliferation of multiple similar structure definitions and to encourage the re-use of existing

structure definitions. A first requirement for finding similar structure definitions is the existence of a method to compare structure definitions regarding their similarity.

Since FHIR structure definitions can be presented as tree structures, two kinds of similarities can be considered. These similarities are structural similarity and semantic similarity. The semantic similarity measures describe to which extent two terms or two sets of terms are similar regarding their

meaning. [5] Within structure definitions different terms or set of terms can be used to specify aspects of elements or to specify extensions. For example, in the description aspect of an element different terms can be used to describe its meaning. Another example is when coded elements in structure definitions are bound to different terminologies. However, different terms can still have the same meaning. To check whether this is the case, the semantic similarity should be determined. In the biomedical domain, multiple semantic similarity measures have already been developed with varying performances of these methods. [6]

However, before we can start thinking about semantic similarity, the first step in comparing structure definitions is determining the structural similarity between two structure definitions. The structural similarity determines the level of technical interoperability between two structure definitions: a higher structural similarity means a higher level of technical interoperability between structure definitions. It is essential to first examine whether two structures have elements with similar basic aspects, such as the cardinality or datatype. When that is the case, a next step could be to assess whether the elements, and therefore the whole structure, are also semantically similar.

(9)

9

Hence, the aim of this study is to investigate to which extent structure definitions expressed in the same information model, in this case HL7 FHIR, can be compared regarding their structural similarity. Therefore, in this thesis we will create a method to structurally compare structure definitions

expressed in FHIR. This method should compare the structures of the structure definitions and the aspects of the elements (datatype, minimum value, maximum value, etcetera) of the structure definitions. We will investigate the extent to which this comparison is possible and the problems that occur when comparing two structure definitions. As a first requirement, we need to provide a detailed definition of the “similarity” of two structure definitions, so a system will be able to compute the similarity measurement.

The following sub questions will be answered in this thesis:

 Can we identify aspects of the FHIR information model that are critical for a comparison?

 Can we identify aspects of the FHIR information model that are invariant with respect to

comparison?

 Can we identify aspects of the FHIR information model that cannot be computably

compared?

The results of the Scientific Research Project and thesis will be a working prototype version of a tool that quantitatively measures the structural similarity between two structure definitions in FHIR. The prototype aims to answer multiple questions:

 How similar are the structure definitions expressed in a quantitative measure?

 Which aspects are different?

(10)

10

2

Background

2.1 Glossary

Information model: An information model is a framework to structure the information content of typically a constrained domain. It is an abstract and formal representation of concepts with its properties, relationships, constraints and operations. An information model specifies the data semantic for a specific domain to provide an interoperable, stable framework of information requirements within that domain. [7,8]

Concept: A concept is an abstraction or general cognitive unit representing the basic or fundamental characteristics of what it represents in the real world. A concept is instantiated by its actual

instances, which are real world objects or phenomena. Each concept has a definition in which the characteristics of the concept are described and which differentiates it from other concepts. [7] Concept definition: A concept definition is a description of the concept and its characteristics, which differentiates it from other concepts. [9]

Structure definition: Structure definitions define the content of a concept or entity and specify which data elements should or could be included. A structure definition describes a structure, which is a set of element definitions and their associated rules of usage. Element definitions specify well-described data-fields and include the name, cardinality and data type of the element. It also includes

definitions, usage notes, requirements, default or fixed values, constrains, length limits, usage rules, terminology binding and mappings to other specifications. [10,11]

Tree: a tree is an acyclic graph in which any two vertices or nodes are connected by only one path. Aspects: The aspects are the characteristics of an element, which define the element. These aspects are described in the resource “ElementDefinition”.

Attributes: The attributes are the aspects of an element, which are included in the node representing that element when structure definitions are represented as tree structures.

2.2

HL7 FHIR

Fast Healthcare Interoperability Resources (FHIR) is a standards framework created by HL7, which is designed to enable the exchange of clinical information. FHIR is based on RESTful principles and uses open internet standards as much as possible.[12]

FHIR has a composition approach towards information modelling and aims to define the main

entities and concepts associated with healthcare information exchange. Within the FHIR specification these entities and clinical concepts are represented as resources, which can be thought of as “forms” reflecting different kinds of information.[13] They are used for reflecting clinical content, such as vital signs or diagnostic test results, but also for infrastructure and conformance statements. Resources are essentially the common building blocks for all information exchanges and consist of hierarchical structures with well-defined fields and data types. They are represented in either XML or JSON. Each resource representing a clinical concept defines a small set of focused data. Besides, a resource can contain explicit references to other resources to constitute a graph of clinical data. [12-14]

(11)

11

Figure1: base resource “Patient”, a hierarchical structure with well-defined fields. In this figure only the name, cardinality and type are included.

The philosophy of FHIR is that the majority of common use cases within healthcare can be reflected with a base set of resources, which will together represent the information model. Resources should, either by themselves or when combined by the use of resource references, satisfy the

common use cases. After all, these resources will define the information content and structure of the information set shared in most use cases. [14] To satisfy the common use cases, resources need to be applicable to slightly varying contexts and circumstances. So, the base resources should not be too complex. Broadly, FHIR has a rule that resources will only include data elements if it is expected that a high percentage of the implementations will use that specific data element. Therefore, all data elements are discussed extensively and investigated regarding its potential use. So, data elements that will not be frequently used in systems will be omitted from the base resources. [11, 13] Each Resource consists of a specified ResourceType, such as Patient, Practitioner, Procedure or Organization, and an id. The id of the resource, identifies the resource and is always included when a

(12)

12

resource is exchanged. After the id, the meta-data is provided. The metadata contains the context data to the resource, such as the version, the date it’s last updated and the profile it conforms to. Also, each resource has a text element, containing a human readable representation of the resource in XHTML. Furthermore, the resource contains a set of common data elements defined for each type of resource. [13]

So, all resources have these characteristics in common. Each resource also has an identifying URL and an extensibility framework. This latter feature is needed due to the fact that resources are quite generic. After all, resources only include data elements that are frequently used by about 80 percent of the implementations. To support the varying requirements for data exchange and to enable systems to capture all relevant data, the extensibility framework is used. FHIR extensions can be defined and added to a base resource to capture this additional data in specified data-fields.[11] Thus, FHIR does not specify detailed models containing every aspect of clinical records, but uses an extensibility framework to complement the existing resources. Furthermore, FHIR provides the possibility to adjust the resources to fulfill context-specific requirements by enforcing constrains on the base resource. These adjusted resources can express terminology restrictions, element

cardinality constraints, data type constraints, etcetera. These constrained base resources are called FHIR profiles and each profile has a unique URL to identify the profile. Thus, FHIR profiles are used in particular contexts with additional requirements to constraint and extend base resource definitions and maintain semantic consistency between the information exchanging parties.[11-15]

2.3

The tree edit distance

The tree edit distance between two trees is the minimum-cost sequence of edit operations that transforms one tree into another. With the tree edit distance the edit operations are directed to the nodes of the tree. These edit operations basically encompass deleting and inserting nodes or updating nodes by changing its value or label (figure 1). [16]

Figure 1: the edit operations

To each edit operation a cost is assigned and the cost of a sequence of edit operations is the sum of all costs for each operation. For example, transforming tree 1 into tree 2 in the figure below (figure 2) requires 3 edit operations. First node Y is deleted from tree 1, then node U is inserted as a child of node C and finally node Z is updated or relabeled to node W. If all edit operations equal a cost of 1, the edit distance is 3. The edit distance between two trees defines the minimal cost for transforming one tree into the other by performing multiple edit operations. Therefore, it is a measure of

similarity, the lower the edit distance, the more similar two trees are (less edit operations are required).

(13)

13 Figure 2: transforming tree 1 into tree 2

In some use cases, the tree edit distance has varying costs for the edit operations depending on the position of the node. For example, a greater weight can be given to the higher nodes in a tree by assigning higher costs to each edit operation for nodes with a higher position than for nodes with a lower position[17].

Also, extended tree edit distances exist with more edit operations, such as mapping, reversing, moving or copying nodes. [18, 19] However, these additional operations are not necessarily needed to compute a distance, since all trees can be transformed in another tree by only deleting and inserting nodes. Even the updating operation is not required. However, this operation is often included in methods for computing the tree edit distance, since some nodes are really similar to one another, although the value of an attribute or the label differs. For these instances, the update operation is used to change the node. This edit operation often results in a lower cost than

consecutively deleting the node and inserting a new node, since it only requires one edit operation instead of two. Whether the cost is truly lower depends on the costs assigned to each edit operation. However, a lower cost for this edit operation seems more logical because of the high similarity between the initial two nodes.

2.4

The Wagner-Fischer algorithm

The Wagner-Fischer algorithm (textbox 1) is used to linearly compare two strings, which could be seen as two lists of characters. Using this algorithm a matrix holding the edit distances between all substrings (or sub-lists) will be filled in an iterative manner. For example, when comparing string A (sTrings) to String B (string) the algorithm first compares each substring of string B, starting with the first character and adding an additional character to the substring in each comparison, to a substring of A consisting of only the first character. The distances between all substrings of B and this substring of A are stored in the first row of the matrix (Table 1, green cells). Subsequently, the second

character is added to the substring of A and all substrings of B are compared to this substring of A. The distances between all substrings of B and this substring of A are stored in the second row of the matrix. This process is repeated until the full string of A is compared to all substrings of B. Thus, the matrix is filled row by row and from the left to the right, ending at the bottom-right. The distance

(14)

14

between the full strings is the value in the bottom-right corner of the matrix. Each cell contains a number, which represents the edit distance between a substring of string A and a substring of string B.

Table 1: matrix with edit distances between sub-lists of elements of profile 1 and profile 2, when the algorithm is at the point of comparing the second element of both profiles.

s t r i n g 0 1 2 3 4 5 6 s 1 0 1 2 3 4 5 T 2 1 0.5 1.5 2.5 3.5 4.5 r 3 2 1.5 0.5 1.5 2.5 3.5 i 4 3 2.5 1.5 0.5 1.5 2.5 n 5 4 3.5 2.5 1.5 0.5 1.5 g 6 5 4.5 3.5 2.5 1.5 0.5 s 7 6 5.5 4.5 3.5 2.5 1.5

The green cells contain the distances between all substrings of B and the substring of A consisting of only the first character. The blue cell is D[2,2], which shows that at this point the algorithm compared the second character of string A to the second character of string B. The bold red numbers represent the edit distances (between previously compared substrings) that are used to compute the minimal edit distance between the current substrings in the blue cell. The red arrow represents an update operation and the black arrows represent the insert and delete operations. The grey numbers within the matrix are not computed yet when the algorithm is at the point of computing D[2,2].

For example, the blue cell in table 1 represents the distance between the substring of A consisting of the first two characters of string A (s and T) and the substring of B, consisting of the first two

characters of string B(s and t). At this point the algorithm computed the edit distance between both substrings by first comparing the second character of each substring and computing the costs for an update operation (γ(A[i] →B[j])), for an delete operation (γ(A[i] →Λ)), and for an insert operation( γ(Λ → B[j])). In this example, the cost for updating (γ(A[i] →B[j])) a lower case character to an upper case character is 0.5, the costs for updating a character into another character is 1, and the costs for deleting or inserting a character, γ(A[i] →Λ)and γ(Λ → B[j]), are also 1. In this example, the algorithm is at D[2,2] (blue cell in table 1)and the second character of A (T) is compared to the second character of B (t). To compute the edit distance between the substrings (see textbox 1), the computed costs for an update operation is added to D[1,1] (m1 = D[i -1, j-1] + γ(A[i] →B[j])), the computed cost for a delete operation is added to D[1,2] (m2 = D[i -1, j] + γ(A[i] →Λ)) and the computed cost for an insert operation is added to D[2,1] (m3 = D[i, j-1] + γ(Λ → B[j])). The operation resulting in the minimal edit distance between the two substrings will be used and included in the blue cell. In this case the update operation (m1) will be chosen, since this results in 0.5 (red arrow in table 1: 0+0.5=0.5), while the insert and delete operations (with a cost of 1) both result in an edit distance of 2 (black arrows in table 1: 1+1=2). Subsequently, the edit distance between the substring of A consisting of the first two characters and the Substring of B consisting of the first three characters will be computed. This is done in the same manner by first computing the costs required to insert or delete a character or to update the last character of substring A into the last character of substring B and adding these costs to the surrounding cells as described in textbox 1. When continuing this process, one finally ends up at the bottom-right, which is the edit distance between the full strings (1.5 in this example).

This linear comparison algorithm could also be used on lists of objects, instead of characters (strings). In this case, instead of substrings, sub-lists will be compared in an iterative manner. For example, this algorithm can be used for comparing ordered trees. Ordered trees have a strict order and structure of nodes. Therefore, ordered sequences of nodes will be formed when these trees are run through in

(15)

15

a structured manner, such as a depth-first order. These ordered lists of nodes can be compared linearly by using the Wagner-Fischer algorithm[20] in textbox 1. Using this algorithm a matrix holding the edit distances between all sub-lists will be filled in the same iterative manner as with the string comparison.

Textbox 1: Wagner-Fischer algorithm[20].

D[0, 0] := 0;  matrix for holding edit distances between sub-lists

fill the first column of the matrix with distances between each sub-list of A and an empty list by deleting all objects of A

for i := 1 to |A| do D[i, 0] := D[i -1, 0] + γ(A[i] →Λ);

fill the first row of the matrix with distances between each sub-list of B and an empty list by inserting all objects of B

for j := 1 to |B| do D[0, j] := D[0, j-1] + γ(Λ → B[j]);

subsequently fill each row of the matrix with distances between all sub-lists with the minimum number of edits (leading to the lowest total cost)

for i := 1 to |A| do for j := 1 to |B| do begin

m1 := D[i -1, j-1] + γ(A[i] →B[j]);  update operation with a cost γ(A[i] →B[j])

m2 := D[i -1, j] + γ(A[i] →Λ);  delete operation with a cost γ(A[i] →Λ)

m3 := D[i, j-1] + γ(Λ → B[j]);  insert operation with a cost γ(Λ → B[j])

D[i, j] := min(m1, m2, m3);  use the edit operation resulting in the minimum distance

(16)

16

3

Methods

In this study, we investigated the extent to which FHIR structure definitions (profiles) can be

structurally compared. Hence, we aimed to create a method to quantitatively measure the structural similarity. This quantitative measure reflects the level of technical interoperability between two profiles. After our literature search, the following steps were taken to be able to create a method to quantitatively measure the structural similarity:

 Representation of FHIR profiles

o Selection of the tree edit distance to quantitatively express structural similarity o Representing FHIR profiles as tree structures

 Creating a method to compute the similarity between profiles o Application of the tree edit distance to FHIR profiles o Analysis of the FHIR resource ‘ElementDefinition’

o Determining initial weights to be used in our application of the tree edit distance  Optimization and evaluation of our method

o Creating a training set and test set of profiles

o Comparison of the outcome of our quantitative measure to the human judgement of the similarity using the training set

o Determining final weights to be used in our application of the tree edit distance o Validation of the outcome of our quantitative measure using the test set

3.0

Literature Search

A literature search was performed to identify papers regarding the assessment of the structural similarity between two structure definitions(profiles) in a quantitative manner. As FHIR is still a relatively new standard, the comparison of profiles has not been covered in scientific literature. Therefore, additional literature was searched regarding the comparison of templates (HL7 CDA) and the comparison of archetypes (OpenEHR) to identify possible problems. Literature regarding the comparison of tree structures or hierarchical structures and XML documents was also searched, since FHIR profiles are expressed in XML (or JSON) and can be presented as tree structures.

PubMed, Ovid and Google Scholar were searched for relevant information and methods regarding the comparison of FHIR profiles or hierarchical structures (trees). We also performed a limited literature search for possible relevant methods in other domains, such as literature about methods for identifying similar DNA sequences and music similarity. These domains were searched and the literature was quickly scanned to identify possible relevant comparison methods.

We used the following queries:

 FHIR AND profile* AND (comparison OR similarity OR “edit distance”)

 Archetype* AND OpenEHR AND (comparison OR similarity OR “edit distance”)

 Template* AND HL7 AND CDA AND (comparison OR similarity OR “edit distance”)

 (“tree structures” OR “hierarchical data”) AND (similarity OR “edit distance”)

 ("similarity measure" OR "comparison method") AND “XML document”

 measurement AND similarity AND music OR (“musical similarity” OR “rhythm similarity”)

 DNA AND sequences AND ("similarity measure" OR "comparison method")

To complement scientific literature with grey literature and common implemented practices or methods, stackoverflow.com was also searched regarding methods to compare tree structures and

(17)

17

XML documents (which are also trees). Articles related to tree similarity were provided

(stackoverflow) and more relevant literature was searched regarding the (tree) edit distance [21]. From the found literature, the articles regarding comparison methods or edit distances were extracted. The references of the extracted articles were also analyzed to identify additional relevant articles.

3.1

Representation of FHIR profiles

3.1.1 Materials

A FHIR profile is expressed in XML (or JSON) and consists of metadata and a set of ElementDefinitions each describing an element of the structure. For example, the base structure definition of “Patient” includes the elements name, gender, telecom and birthdate. These elements are defined by ElementDefinitions consisting of element-defining aspects, such as a definition, the cardinality (min and max) and datatype(s).[22] Textbox 2 provides an example of how elements are defined in a FHIR profile in XML. The path aspect of an element identifies the element. The path is used to check whether elements that are being compared are similar elements. The path is expressed as a list of ancestor elements. So, profiles have a hierarchical structure, which could be derived from the path aspect. For example, the path “Patient.contact.name” identifies this element as a name element, which is a child-element of contact, which is a child-element of Patient. Thus, the contact element is a child-element of patient and a parent-element of name.

Textbox 2: example of how elements are defined in a FHIR profile in XML.

<element>

<path value="Patient.name" />

<definition value="A name associated with the individual." />

<comments value="A patient may have multiple names with different uses or applicable periods." />

<requirementsvalue="Need to be able to track the patient by multiple names. Example: an official name " />

<minvalue="0" />

<maxvalue="*" /> <base>

<path value="Patient.name" /> <min value="0" />

<maxvalue="*" />

</base> <type>

<code value="HumanName" /> </type>

<isSummaryvalue="true" />

<mapping>

<identityvalue="rim" /> <map value="name" /> </mapping>

</element> <element>

<pathvalue="Patient.name.family" />

<shortvalue="Family name (often called 'Surname')" />

<definition value="The part of a name that links to the genealogy." /> <alias value="surname" />

<min value="1" /> <maxvalue="*" /> <base>

<path value="Patient.name.family" /> <min value="0" />

<maxvalue="*" /> </base>

<type>

<code value="string" /> </type>

(18)

18

3.1.2 Representing FHIR profiles as tree structures

Profiles are represented as tree structures by representing all elements of a profile as nodes. As is shown in figure 4, each element is depicted as a separate node, such as name (path=Patient.name) or gender (path=Patient.gender). The hierarchical structure of the profile is derived from the identifying path aspect. All aspects defining the element within the ElementDefinition, are included as attributes within the node representing the element. In figure 4, only a few aspects are depicted in each node to provide for an example. When comparing profiles, the ElementsDefinitions of the profiles are being compared and the similarity between these ElementDefinitions depend on the values of the aspects. Hence, the similarity between two nodes is dependent on the values of all the aspects included as attributes. In this way, the similarity between profiles can be assessed in a top-down manner, starting at the root nodes of both profiles and ending at the rightmost leaves. After all, the child nodes in this representation do not influence the similarity between two parent nodes.

Figure 4: An example of how a profile is represented as a tree structure.

3.1.3 Simplifying FHIR profiles for initial comparison

The method to quantitatively express the similarity that was used, is the tree edit distance (see Background). The tree edit distance could be applied to ordered and unordered trees. However, the algorithm for ordered trees is less complex and more efficient than the algorithm for unordered trees, which is highly complex[16,17, 23-27]. FHIR profiles are essentially ordered trees, although an unordered part may exist within the ordered tree due to the inclusion of multiple extensions or a sliced element[11, 28]. Slices may be ordered or unordered, which is defined in the ordered Boolean of the slicing aspect. However, extensions are always unordered elements. Therefore, when

(19)

19

comparing profiles, which both may contain unordered slices or extensions, the algorithm for determining the similarity between these profiles will be quite complex. However, before we start implementing highly complex algorithms in our prototype tool for measuring structural similarity, it is useful to first investigate whether a comparison of simplified profiles with only ordered elements yield meaningful results that satisfy the human judgement of the structural similarity. So, due to their complexity, extensions and slices are out of scope for this research and are therefore removed from the FHIR profiles before comparing them.

3.2

Creating a method to compute the similarity between profiles

3.2.1 Using the tree edit distance

The tree edit distance was formalized and applied to FHIR profiles. The edit operations included in our method are inserting, deleting and updating nodes. Other additional operations, such as reversing, copying, moving and mapping[18, 19], are excluded in our method to keep the algorithm for computing the similarity less complex. Also, the additional operations are not useful. If the profiles to be compared are valid structure definitions, reversing and mapping will not be needed, since the elements in structure definitions are defined in a fixed order and hierarchy. Also, copying is not a relevant operation, since a profile will not contain two exactly similar nodes. Moving is

changing the position of a node among its sibling nodes. This operation is only useful, when two ordered trees are being compared and the order of the nodes differs between the two trees. We will not be using the move operation in our method, since we removed extensions and slices, which are the only elements that might be defined in a different order. The remaining elements of a profile are defined in a fixed order and hierarchy. So, moving is not included in our method. Furthermore, in our method the costs for edit operations are not affected by the position of the nodes being compared. For the structural similarity or level of technical interoperability differences in parent nodes or child nodes are equally important. Therefore, no greater weight is given to higher nodes in the tree.

3.2.2 Applying the tree edit distance to FHIR profiles

The tree edit distance was formalized to quantitatively measure the similarity between FHIR profiles in our prototype tool, which was developed using C#. A number of configurations and exceptions were made to the tree edit distance to make it suitable for comparing FHIR profiles. These exceptions are explained in this section.

3.2.2.1 Costs assigned to edit operations

All edit operations (insert, delete and update) were assigned a maximum costs of 1. These costs are commonly used in examples of computing the edit distance.[16,17,27,29] The costs assigned to inserting or deleting an element depend on the cardinality of the element being inserted or deleted. Deleting or inserting an optional element should have a lower cost than inserting or deleting a mandatory element. The difference between the edit costs for an optional element versus a

mandatory element, was determined in a discussion meeting with 3 FHIR experts. In this meeting the importance of differences in cardinality was discussed (see “3.2.5 Analysis of the resource

ElementDefinition and determining initial weights”). The difference between a mandatory element and a prohibited element was assigned a weight between 0 and 1 and the difference between an optional element and a prohibited element was assigned a weight between 0 and 1. In this case, 0 means this difference has no impact on structural similarity and 1 means this difference has a high

(20)

20

impact on structural similarity. These weights were used as the edit cost for deleting or inserting an optional element and for deleting or inserting a mandatory element. After all, an insert or delete operation is used when the element does not exist (is prohibited) in one profile, but does exist in the other profile (mandatory or optional).

3.2.2.2 Limiting the possibility for an update operation

The insert or delete operation is used, when one of the profiles contains elements that do not exist in the other profile. The update operation is included in our method for matching elements that do exist in both profiles, but slightly differ from one another on their defining aspects. The update operation results in a lower edit cost than consecutively deleting an element and inserting an element to match both profiles. Therefore, the possibility to use an update operation should be limited to elements that are really similar to one another (thus, of the same kind). Within FHIR profiles the path aspect identifies the element, so one can tell which elements of the base resource are being constrained. Hence, when comparing elements, the path aspect indicates whether the elements that are being compared are of the same kind. An element with a certain path should only be updated when it is compared to an element with the same path. Otherwise, the elements being compared should be assessed as 100 percent different and insert or delete operations should be used to match the profiles. After all, it seems logical to only use an update operation on elements that are of the same kind, and use a delete and insert operation to result in higher edit costs, when this is not the case. Besides, in practical sense, one will never change one element, such as name, into another element, such as gender. Instead, one would probably insert the missing elements to harmonize both profiles. On the other hand, one probably does want to update the aspects of an element when the elements are of the same kind. For example, one might want to update the cardinality of a name-element when the name is optional in one profile but mandatory in the other. Summarizing, as a first adaptation to the commonly used method for computing the tree edit

distance, the possibility for an update operation is limited. An update operation could only be carried out, when the elements being compared have the same identifying path aspect, instead of allowing updates for all elements that have the same position in the tree structure.

3.2.2.3 Multiple attributes

Furthermore, according to the tree edit distance a node needs to be updated against a certain cost when the node is similar to a node in the other tree, but has a different label (attribute)[16]. So, when the tree edit distance is applied to profiles, an element (node) needs to be updated when the aspects defining the element (attributes of the node) differ from the aspects of its counter-part. As elements are defined by multiple aspects, multiple attributes need to be compared to assess the similarity between nodes. However, as described in the commonly used tree edit distance[16] the update operation (updating one attribute) has a specified cost. This cost does not vary with respect to the number of attributes that need to be matched. Therefore, the tree edit distance is slightly extended to make it suitable for comparing structure definitions. The cost for an update operation depends on the number of attributes that need to be updated and the importance of each of these attributes. Since the aspects are not equally important for the structural similarity between profiles, each aspect is assigned a certain weight. This weight determines the contribution to the total cost for updating the element (with a maximum of 1). The total cost for updating a node is the sum of the weights of each updated attribute.

3.2.2.4 Complex aspects

Some aspects of an element consist of multiple sub-aspects, which also differ in importance

regarding the structural similarity and technical interoperability. Examples are the binding, type and mapping aspects. Differences in some sub-aspects have more impact on the level of technical

(21)

21

interoperability between profiles than differences in other sub-aspects. Besides, elements can also differ on only one of the sub-aspects, but not on the others. The elements should then be assessed as more similar than when all sub-aspects differ. Therefore, the sub-aspects are compared separately and each sub-aspect is also assigned a weight related to the impact a difference has on the structural similarity. In this case, when not all sub-aspects differ, only a percentage of the weight assigned to the main aspect, will be used in computing the total cost for updating one of the elements. 3.2.2.5 Aspects with multiple instances

Some aspects could be instantiated more than once. For example, an ElementDefinition can contain multiple codes, aliases or types. So, when two elements are being compared the lists of instances for each of these aspects should be compared. These two lists should be checked for similar instances and missing instances. When differences are found, one of both lists need to be updated to match the other. The cost for this update operation should also be variable. So, some adjustments are made to the method, such that the costs for updating these kinds of aspects are based on the percentage of differences between the two lists of instances.

3.2.2.6 Cardinality defined in attributes

The min and max aspects of an element together define the cardinality of the element and therefore whether an element is prohibited, optional or mandatory. Thus, whether the element actually exists is defined in the attributes of the node. When an element has a maximum cardinality of zero, other aspects do not matter, since the element does not exist and should be ignored in the comparison of profiles. Furthermore, the descendants (child nodes etc.) of this node should be ignored in the comparison of profiles, as these will also not exist in this case.Therefore, some adjustments were made. When comparing profiles, the profiles are first checked on elements with a cardinality of zero to zero. These elements and their descendants are removed before computing the tree edit distance between the profiles.

3.2.2.7 Different base resource

Some base resources partly contain the same elements, such as the resources Patient and

Practitioner. These resources both contain a name, telecom, address, gender, birthdate, etcetera. However, these elements always differ on the identifying path aspect, since the root node which determines the first part of the path, differs (Patient versus Practitioner). Thus, the comparison of these profiles lead to an edit distance equal to the costs for deleting all elements of one profile and subsequently inserting all elements of the other profile. So, the similarity between these profiles could be computed. However, one is often comparing profiles to find similar profiles or determine the amount of modifications required to harmonize the profiles and enable cooperation. For these purposes, only profiles on the same concept are of interest. Profiles on a different concept could be ignored. Although, the edit distance between profiles on a different concept could be computed and probably results in a high edit distance, an adaption was made to the commonly used method for computing the tree edit distance. For efficiency reasons, a difference in the path attributes of the root nodes of the two trees results in an infinite edit distance instead of computing the costs for the edit operations. This infinite edit distance indicates that profiles on different concepts are being compared and that one of both profiles is probably out of interest.

(22)

22

3.2.3 Linear comparison of ordered profiles

The simplified profiles used in this study are ordered trees, since any unordered parts are removed. The quantitative measure being used is the tree edit distance, in which the position of a different node does not influence the cost for an edit operation. So, no matter where the node is positioned within the tree; the update, insert or delete operation has the same cost. According to Tai[29], Zhang[17] and Bille[16], the ordered tree edit distance can be introduced as a generalization of the string edit distance problem. The string edit distance is a linear comparison of two sequences of characters, in which the position of a difference between the sequences also does not influence the cost for an edit operation[20]. As ordered trees have a strict order and structure of nodes, ordered sequences of nodes are formed when trees are run through in a structured manner, such as a depth-first order. Since in our method the position of the nodes being compared does not influence the edit cost and any differences in the parent-nodes of these particular nodes does not influence the edit cost, these ordered sequences of nodes could also be compared linearly (instead of hierarchically). So, these ordered sequences can be compared in the same way as strings, with the only difference that nodes are being compared instead of characters. Therefore, an algorithm for comparing linear sequences, such as strings, can also be used for comparing ordered trees. In this study, the algorithm of Wagner and Fischer for computing the edit distance between strings[20] was used for comparing ordered profiles.

3.2.4 Using the Wagner-Fischer algorithm

3.2.4.1 Applying the Wagner-Fischer algorithm to FHIR profiles

As a first step, FHIR profiles (tree structures) are transformed in ordered lists of elements by running through these profiles in a depth-first order. These ordered lists are compared linearly by using a slightly extended Wagner-Fischer algorithm[20], which is shown in textbox 3. The additions were made to comply with our implementation of the tree edit distance. The extensions were added to the algorithm to limit the possibility for an update operation and to compute the cost for an update operation based on the different attributes. So, an update operation is only used when the nodes being compared have the same identifying path aspect and a function is included that compares all aspects of the elements and computes the update cost. Using this algorithm a matrix holding the edit distances between all sub-lists will be filled in an iterative manner. One is referred to the

“Background” section for a more extensive explanation of this iterative process of the Wagner-Fischer algorithm. The distance between the full lists is the last value added to the matrix, which can be found in the bottom-right corner of the matrix.

(23)

23

Textbox 3: extended Wagner-Fischer algorithm[20]. *The cost for the update operation depends on the aspects that differ between the two elements and varies between 0 and 1. **The cost for the insert and delete operation depends on the cardinality of the element being inserted or deleted and varies between 0 and 1.

As discussed in “3.2.2 Applying the tree edit distance to FHIR profiles”, the cost for updating an element to match its counterpart should be related to the importance of the different aspects and the number of different aspects. Therefore, a function is added to the algorithm of Wagner and Fischer to assess the difference between two elements by comparing all aspects. The difference between the elements is quantitatively expressed as the size of the update cost (γ(A[i] →B[j]), ranging between 0 and 1). The costs for inserting or deleting an element depend on the cardinality of the element being inserted or deleted (ranging between 0 and 1). The cost for inserting or deleting a mandatory element is higher than the cost for inserting or deleting an optional element. The

extended Wagner-Fischer algorithm in textbox 3 was implemented in C# software and can be found in appendix 1.

3.2.4.2 Example of comparing two profiles using the Wagner-Fischer algorithm

As an example, we will compare the two simplified ordered profiles on “observation” in figure 5. When running through these trees in a depth-first order, two ordered lists of elements will be formed. When only showing the identifying path aspect of each element, these ordered lists are:

Profile1 = {observation, observation.subject, observation.value[x],

observation.referenceRange, observation.referenceRange.low,

observation.referenceRange.high, observation.referenceRange.type, observation.related, observation.related.type, observation.related.target}

Profile2 = {observation, observation.subject, observation.code, observation.referenceRange,

observation.referenceRange.low, observation.referenceRange.high,

observation.referenceRange.type, observation.related, observation.related.type, observation.related.target, observation.component, observation.component.code,

observation.component.value[x]}

The bold elements are unique elements, that only exist in one of both profiles. The italic elements exist in both profiles, but differ on their defining aspects.

D[0, 0] := 0;  matrix for holding edit distances between sub-lists of elements

fill the first column of the matrix with distances between each sub-list of A and an empty list by deleting all elements of A

for i := 1 to |A| do D[i, 0] := D[i -1, 0] + γ(A[i] →Λ);

fill the first row of the matrix with distances between each sub-list of B and an empty list by inserting all elements of B

for j := 1 to |B| do D[0, j] := D[0, j-1] + γ(Λ → B[j]);

subsequently fill each row of the matrix with distances between all sub-lists with the minimum number of edits (leading to the lowest total cost)

for i := 1 to |A| do for j := 1 to |B| do begin

m2 := D[i -1, j] + γ(A[i] →Λ);  delete operation with a cost γ(A[i] →Λ) **

m3 := D[i, j-1] + γ(Λ → B[j]);  insert operation with a cost γ(Λ → B[j]) **

if (A[i].Path == B[j].Path)  update operation only possible when elements have the same path aspect

m1 := D[i -1, j-1] + γ(A[i] →B[j]);  update operation with a cost γ(A[i] →B[j]) *

else  m1 should be bigger than m2 and m3, such that the update operation will not be selected in the final step

m1 := m2 + m3;

D[i, j] := min(m1, m2, m3);  use the edit operation resulting in the minimum distance

(24)

24 Figure 5: example profiles on observation

In this example, all elements of profile 1 and profile 2 (figure 5) with the same path aspect are exactly the same, except for the subject elements. One of the subject elements requires an update

operation, since these element have an additional type aspect which differs. The cost for an update operation depends on the aspects being different. In this example, we use a cost of 0.5 to update the type aspect, such that the updated element matches its counter-part. Furthermore, in this example all elements have a cardinality of 1 to 1, so all elements are mandatory (not shown in figure 5). The cost for deleting or inserting a mandatory element that we use in this example is 1.

We used the adapted Wagner-Fischer algorithm described in textbox 3. A matrix with edit distances between all sub-lists of elements, which is shown in Table 2, will be created.

(25)

25

Table 2: matrix with edit distances between all sub-lists of elements of profile 1 and profile 2.

o o.s o.c o.r o.r.l o.r.h o.r.t o.re o.re.t o.re.ta o.co o.co.c o.co.v

0 1 2 3 4 5 6 7 8 9 10 11 12 13 o 1 0 1 2 3 4 5 6 7 8 9 10 11 12 o.s 2 1 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 o.v 3 2 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 o.r 4 3 2.5 3.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 o.r.l 5 4 3.5 4.5 3.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 o.r.h 6 5 4.5 5.5 4.5 3.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 o.r.t 7 6 5.5 6.5 5.5 4.5 3.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 o.re 8 7 6.5 7.5 6.5 5.5 4.5 3.5 2.5 3.5 4.5 5.5 6.5 7.5 o.re.t 9 8 7.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 3.5 4.5 5.5 6.5 o.re.ta 10 9 8.5 9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 3.5 4.5 5.5

The first row (profile 2) and the first column (profile 1) contain the abbreviations of the path aspects of the elements. The blue cell indicates the first point in the matrix where the algorithm starts with comparing elements (D[i,j]=D[1,1]).

The green cells highlight the edit operations required to turn one profile in the other with a minimum edit distance. When the edit distance is unchanged an update operation with a cost of 0 is carried out.

In this example, the first elements are exactly the same (the only aspect is path, which is equal). So, these elements require an update operation with a cost of 0, which means no update is needed. A delete or insert operation on these elements will cost 1, since these elements were mandatory elements. Thus, in this particular case, γ(A[i] →Λ) and γ(Λ →B[j]) in textbox 3 have a value of 1 and γ(A[i] →B[j]) has a value of 0. According to textbox 3, m1 is computed with the formula m1= D[i -1, j-1] + γ(A[i] →B[j]), since the elements have the same path aspect. Looking at Table 2, one can see that D[i-1,j-1] (thus, D[0,0] as the blue cell is D[i,j]=D[1,1]) has the value of 0. So, in this case m1 will also have the value of 0 (0+0=0). As stated in textbox 3, m2 and m3 are computed with the formulas m2= D[i 1, j] + γ(A[i] →Λ) and m3= D[i, j1] + γ(Λ → B[j]). Looking at Table 2, one can see that D[1, 2] (D[i -1, j]) and D[2,1] (D[i, j-1]) both have the value of 1. So, in this case m2 and m3 will have the value of 2 (1+1=2). Therefore, m1 will be selected as minimum edit distance. The blue cell in table 2 will get the value 0.

When the algorithm is at D[2,2], the point of comparing the subject elements of both profiles

(abbreviated as o.s in table 2), the update cost γ(A[i] →B[j]) is 0.5. After all, as shown in figure 5 these elements differ on the type aspect. The cost for updating this aspect is 0.5, while the costs for

inserting and deleting the subject element (γ(A[i] →Λ) and γ(Λ →B[j]) ) are 1. So, when looking at textbox 3, m1 will lead to 0.5, since D[1,1] has the value of 0 (0+0.5=0.5). As D[1,2] and D[2,1] both have the value of 1, m2 and m3 will lead to 2 (1+1=2). Therefore, m1 is selected as the minimum edit distance.

When the algorithm is at D[2,3], the second element of profile 1 is compared to the third element of profile 2 (subject and code, abbreviated as o.s and o.c in table 2). An update operation will not be used, due to a different path aspect. According to textbox 3, m1 will have the size of the sum of both the edit distances m2 and m3. So, m2 or m3 will be selected as a minimum edit distance, since these edit distances are definitely lower than m1. When looking at textbox 3 and table 2, m2 will result in an edit distance of 3, since D[1,3j], which is D[i-1,j] in this case, has the value of 2 and the cost for deleting the element is 1 (2+1=3). The edit distance resulting from m3 is 1.5, since D[2,2] (D[i,j-1]) has the value of 0.5 and inserting the element has a cost of 1 (0.5+1=1.5). So, m3 will be selected as the minimum edit distance between the two sub-lists of profile 1 and profile 2. When continuing this process, one finally ends up at the bottom-right, which is the edit distance between the profiles. In this example, the edit distance between the profiles is 5.5.

(26)

26

3.2.5 Analysis of the resource ElementDefinition and determining initial weights

3.2.5.1 Focus group with FHIR experts

An analysis of the FHIR resource “ElementDefinition” was carried out to group the aspects and identify the critical aspects of elements that have a big impact on the structural similarity between profiles when differences occur and to identify aspects that are invariant with respect to the

structural similarity. These latter aspects have no impact on the structural similarity when differences occur. Also, all aspects will be analyzed regarding their computability. This analysis was carried out in a focus group and was based on the opinion of three FHIR experts. During a discussion meeting, these FHIR experts discussed the impact of a difference on each aspect on the structural similarity and the level of technical interoperability between profiles. They assigned a weight between 0 and 1 to each aspect to quantitatively express the impact of a difference on that aspect and show the relative importance with respect to the other aspects. An assigned weight of 0 indicated that a difference in that aspect was considered irrelevant and a weight of 1 indicated that a difference in that aspect was considered to have a major impact on the structural similarity.

3.2.5.2 Analyzed scenarios and aspects of the ElementDefinition

Multiple scenarios were discussed for varying differences in the cardinality, which is defined by the min and max aspects, and for varying differences in the value ranges (the range in which the value of an instance should fall), which is defined by the minValue[x] and maxValue[x] aspects[22]. The experts assigned a weight between 0 and 1 to each scenario.

The scenarios for the cardinality when comparing two elements are: 1) both elements are prohibited

2) one element is mandatory the other prohibited 3) one element is prohibited the other is optional 4) one element is mandatory the other is optional

5) both elements are mandatory, but the minimum number of instances of one element is higher than the maximum number of instances of the other element. For example, one has a cardinality of 1-to-3 and the other of 4-to-5.

6) both elements are mandatory, but the cardinality aspects of the elements differ such that the minimum cardinality of the first element falls within the cardinality range of the second element, but the maximum cardinality does not fall within this range. For example, one has a cardinality of 1-to-3 and the other of 2-to-5.

7) both elements are mandatory or both elements are optional, but the cardinality aspects of the elements differ such that only the minimum cardinalities differ or only the maximum cardinalities differ. For example, one has a cardinality of 1-to-3 and the other 2-to-3. The scenarios for the value range when comparing two elements are:

1) the range defined in one element falls within the range defined in the other element 2) the ranges defined in both elements only differ on the lower or upper limit

3) the range defined in one element falls outside the range defined in the other element 4) the ranges defined in both elements differ on the upper and lower limits, but the ranges

partially overlap one another.

Furthermore, the following aspects of elements were analyzed[22] and received a weight between 0 and 1:  representation  label  code  short  definition  comments

Referenties

GERELATEERDE DOCUMENTEN

Szajnberg, Skrinjaric, and Moore 1989 studied a small sample of eight mono- and dizygotic twins and found a concordance of 63%; three of the four monozygotic twin pairs 75%

Finding critical orientations for the mixed volume measure requires rotating B such that three SDR points of SDR(B) coincide with three SDR arcs of SDR(A).. In some cases there is

Foguel gave an example of an operator, in a Hilbert space, with uniformly bounded powers which is not similar to a contraction [3].. so the converse of Theorem 1.7 does not hold

In addition, recent work suggests that the relatedness of co-cited publications might increase with increasing proximity of two publications within the full text

Step 1: Identity features Step 2: Compute feature similarity and matching Step 3: Estimate process similarity and classify processes Step 4: Compare potentially relevant models

Figure 4.24: Effect of fertilizer type, NPK ratio and treatment application level on soil Na at Rogland in

Bij de eerste vier accu’s die Frits test moet er één volle bij zijn.. De vijfde test is ook

There, we aimed at decomposing cerebral hemodynamic signals, measured by means of NIRS, as a sum of the partial linear contributions of different systemic variables such as,