Comparing information models to FAIRify ICU quality registry data sources and implementing the optimal model

(1)

Comparing information models to FAIRify

ICU quality registry data sources and

implementing the optimal model

A FAIRly NICE Thesis

Student

Rowdy de Groot

Email: Rowdy.degroot@amsterdamumc.nl Student number: 11036508

Title

Comparing information models to FAIRify ICU quality registry data sources and

implementing the optimal model

Supervision

Dr. Nirupama Benis (mentor)

Dr. Ferishta Raiez (mentor)

Dr. ir. Ronald Cornet (tutor)

Prof. dr. Nicolette de Keizer (formal mentor)

Location

Department of medical informatics Amsterdam UMC & NICE quality registry, Meibergdreef 15, 1105 AZ Amsterdam

Period

(2)

2

Index

Abstract ... 3 Introduction ... 4 Background ... 5 Aims ... 5 Methods ... 7

Overview of the three models ... 7

SARI subset OMOP CDM representation ... 7

SARI subset ContSys representation ... 8

OMOP to ContSys mapping... 9

MDS in OMOP CDM representation ... 10

Implementing the MDS ETL ... 10

Results ... 11

Overview of the three models ... 11

SARI subset OMOP CDM representation ... 13

SARI subset ContSys representation ... 14

Comparison of time taken for mapping to both models ... 15

OMOP to ContSys mapping... 16

MDS in OMOP CDM representation ... 16

Implementing the MDS ETL ... 16

Discussion... 18

OMOP CDM and ContSys experience comparison ... 18

OMOP CDM and ContSys interoperability ... 19

Discussion of the findings in relation to existing literature ... 19

Recommendations ... 20

Strength and weaknesses ... 20

Future research questions ... 20

Conclusion ... 21

References ... 22

(3)

3

Abstract

Introduction

NICE (National Intensive Care Evaluation) is a Dutch quality registry that collects data on Intensive Care Unit (ICU) admissions from all the ICUs in the Netherlands. Databases are all likely to have varied data structures which could impede efficient data analysis. Making the data FAIR (Findable, Accessible, Interoperable, Reusable) can help with this. The information model is an important component in the process of making data FAIR. Three commonly used models, Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), Clinical Data Interchange Standard Consortium (CDISC) Study Data Tabulation Model (SDTM) and International Organization for Standardizations ISO13940:2015 or ContSys, were examined to find the most suitable model for ICU quality registries and individual ICUs.

Methods

Based on a literature search the three models were compared for certain characteristics and an overlap of domains. Based on the comparison it was decided to map two modules of the NICE dataset to OMOP CDM and one module to ContSys. The OMOP representation was mapped to ContSys for the one overlapping module to determine the interoperability between the two models. One of the OMOP representations was implemented.

Results

The comparison of the models was done with characteristics found in the literature and some which were considered to be important in the context of ICUs. An overview of the overlap of domains between the three models showed what domains were alike. Of the two chosen models, OMOP CDM was the easier model to map. A guidebook, forum, and tools are supplied to help users with the OMOP CDM implementation. ContSys being a more conceptual model, lacks all of this. In OMOP CDM 94.6% of data values were mapped, in ContSys it was 93.5%. The results of the OMOP to ContSys mapping make it apparent that the two models are interoperable. One of the OMOP representations was implemented and a test was successfully performed.

Conclusion

The evaluation of the three information models makes apparent that OMOP CDM is the most usable for ICU quality registries. NICE is recommended to use the OMOP CDM information model to make the NICE data FAIR. ICUs could potentially benefit most of ContSys. However, ContSys could be problematic to make data FAIR due to the lack of implementation guidelines. OMOP and ContSys both could be considered for hospital data.

(4)

4

Introduction

More observational data is becoming available in the form of electronic health care data and insurance claims data. As a consequence, more observational research in outcomes and

(pharmaco)epidemiology is performed (1). All these databases, e.g. databases underlying Electronic Healthcare Records (EHR) in hospitals or databases of national quality registries, are likely to have varied data structures. This makes data analysis hard as observational studies have to deal with different table structures, missing data, free text and a lack of consistent data definitions (2, 3). The NICE (National Intensive Care Evaluation) registry is one such organization that collects observational data. NICE is a Dutch quality registry which collects data on Intensive Care Unit (ICU) admissions from all the ICUs in the Netherlands (4). NICE provides feedback reports, which ICUs can use to monitor and compare their performance with similar ICUs and national averages in order to improve quality of care. Data from the NICE registry can also be used by participating ICUs for clinical research (5). Analyses are done by or with supervision from the NICE team, to make sure that the data is correctly analyzed and interpreted. Other countries have similar ICU quality registries like NICE and there is a strong intention to conduct joint research. However, this is hampered due to the fact that every registry has its own dataset and data definitions.

With the increasing need to improve the infrastructure supporting the reuse of data, more attention is given to make data FAIR (Findable, Accessible, Interoperable, Reusable) (6). OHDSI (Observational Health Data Sciences and Informatics) is an initiative that invites data sources to become partners in a network to make their data sources FAIR. This enables the possibility to answer large multisite research and policy questions by reusing data (7).

The information model is an important component in the process of making data FAIR. It is a

representation of concepts, relationships, constraints, rules and operations which are used to specify data semantics in a certain domain (8). Information models clearly define the data items in the database and the relationships between them. Therefore, information models are useful for interoperability (the ‘I’ of FAIR) between databases.

NICE wants to make their data FAIR to support a wider use of the data, not only nationally for the participating Dutch ICUs but also internationally with other quality registries, and therefore NICE data should be represented using a common information model. Ideally there would be no central data collection by NICE and the ICUs would make their data FAIR, so that NICE can gather

information with a Personal Health Train (PHT) (9, 10) if NICE has permission to do so. Unfortunately, that seems not to be realistic on the short term and therefore NICE has to collect data in a central database.

Three commonly used information models are the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), Clinical Data Interchange Standard Consortium (CDISC) Study Data Tabulation Model (SDTM) and International Organization for Standardizations ISO13940:2015 or ContSys (ISO13940 is ContSys) (11-13). The question is which model is most suitable to be used for ICU quality registries and the individual ICUs.

(5)

5

Background

OHDSI is an initiative that is heavily involved in leading and organizing the evolution and adoption of the OMOP CDM (14). The original focus for OMOP was drug safety, but over the years the OMOP CDM evolved into also supporting analytical use cases, including comparative effectiveness of medical interventions and health system policies (15). The OMOP CDM is a person centric relational model mostly suitable for observational data (2, 7). OMOP contains classes, classes contain domains and the domains contain columns. Domains and columns all have their own definitions. For example, OMOP has a class “Person” that contains the domain “Observation period” and that domain has a column “observation_period_id”. The definition for the domain “Observation period” in OMOP is “The OBSERVATION_PERIOD table contains records which uniquely define the spans of time for which a Person is at-risk to have clinical events recorded within the source systems, even if no events in fact are recorded (healthy patient with no healthcare interactions)” (16). The definition for the column “observation_period_id” is “A unique identifier for each observation period” (16). CDISC is a non-profit organization that develops data standards to transform unstructured data into a framework for generating clinical research data (17). Over the years, CDISC gained more interest from regulatory authorities such as the U.S. Food and Drug Administration and the Japanese

Pharmaceutical and Medical Devices Agency. One of the standards is SDTM which defines a standard structure for study data tabulations (18). CDISC SDTM is meant to support clinical trials and

therefore supports mostly clinical data (18). CDISC SDTM contains classes which contain domains, and the domains contain variables.

ISO13940 or ContSys is an information model made by the International Organization for

Standardization (ISO), which defines a system of concepts for different aspects of the provision of healthcare. The model has been designed to include professional healthcare, self-care, care by a third party and all aspects of social care, so the model was intentionally kept broad (19). ContSys is meant to support all (health)care related data due to its exceptionally broad scope. Because ContSys is a conceptual model it is not oriented for implementation, which is in contrast to OMOP CDM and CDISC SDTM. ContSys contain clauses, which contain concepts. Figure 1 shows the hierarchal structure and relationship between the models.

The NICE database contains many modules, like the Minimal Dataset (MDS), Sequential Organ Failure Assessment (SOFA) and Severe Acute Respiratory Infection (SARI). The MDS is the core dataset from NICE and is used to record the demographics, admission and discharge details,

physiological reasons for admission and severity of illnesses during the first 24 hours of IC admission, together with the outcome measures IC and hospital mortality and length of stay. The SARI dataset is used to report to the RIVM (Dutch National Institute for Public Health and the Environment). The goal of the SARI dataset is to monitor epidemics of SARIs and influenza-like illnesses. SOFA is a recording of a daily data collation for the recording of organ failure. Physiological parameters are recorded daily for organ systems whereby the dysfunction or failure is established (20).

Figure 1. Hierarchal structure and relationship of the data models. The OMOP CDM classes, CDISC SDTM classes and ContSys clauses are on the same class level. The OMOP CDM domains, CDISC SDTM domains and ContSys concepts are on the same domain level. The OMOP CDM columns and CDISC SDTM variables are on the same domain level. ContSys has nothing like OMOP CDM and CDISC SDTM to represent on the detail level. ISO13606 EHRCom (21) or ISO12967 HISA (22) could be used to complement ContSys on the detail level.

(6)

6

Aims

The aim of this research project is to evaluate three information models, i.e. OMOP CDM, CDISC SDTM, and ContSys on their usability for registry data and hospital data in general. In addition, it will also be investigated how suitable these models are to model (a subset of) the NICE dataset. The subset will be represented in the two most suitable models. Based on the results, the entire NICE dataset will be represented in the best suitable model(s) to check the integrity for longitudinal ICU data for that particular model. Integrity of a data model is the extent to which associations in the evaluated data model match the specific project needs (23). Furthermore, this research project includes an investigation on the data interoperability between the three different models. This includes the possibility to transform hospital data in one data model to NICE data in another model. It is to be expected that not everyone will use the same data model. Therefore it is interesting and important to know how interoperable certain data models are. This will be done to have an indication to what extent data can be queried from one FAIR data source to another.

(7)

7

Methods

Overview of the three models

A literature search was performed with PubMed and Google Scholar to find characteristics about the information models that could be compared and methods to compare them. The following terms or a combination of these terms was used for searches: ‘OMOP’, ‘CDM’, ‘CDISC’, ‘ISO 13940’, ‘ContSys’, ‘comparing’ and ‘comparison’. Titles were checked for relevance for each result of the search queries. Titles were considered relevant when it contained at least one of the names of the

information models. The abstract was read when the title was considered relevant. The whole article was read if the abstract mentioned the evaluation of one of the information models which were used as a search term or the information models in question were compared with other information models.

To compare the data models, the OMOP domains, CDISC domains and ContSys concepts were compared with each other. These levels were chosen because they are the lowest level without being too model specific (see Figure 1). For the comparison of the three data models, the OMOP definition for a domain was taken as a basis definition (24). The OMOP definition that was used as basis was generalized if the definition was too specific to OMOP. For example, the definition for the domain “Observation period” in OMOP is “The OBSERVATION_PERIOD table contains records which uniquely define the spans of time for which a Person is at-risk to have clinical events recorded within the source systems, even if no events in fact are recorded (healthy patient with no healthcare interactions)” (16). This definition was generalized to “Period of observation when person is at risk of recording clinical events” in the comparison.

Next, the list of CDISC SDTM domains (25) was checked to find fitting descriptions of domains which would fit or had overlap with a (generalized) definition taken from OMOP. If the CDISC domain definition matched or had overlap with the OMOP definition then the domains were matched. The same process was done for the ContSys concepts (13) with the OMOP domains. The whole process was repeated where the definition of the CDISC domains were compared to the ContSys definitions. It was possible that more than one domain definition from each model would fit the generalized OMOP definition.

SARI subset OMOP CDM representation

The SARI dataset is a subset of the NICE Minimal Data Set and was used in the initial mappings to OMOP CDM and ContSys.

The SARI dataset was first represented in the OMOP CDM. Figure 2 shows the steps to map a dataset to the OMOP CDM. The guide from OHDSI was used to conduct the process of making the Extract Transform, Load (ETL) design for the OMOP CDM (15). The process started by scanning the SARI dataset file using White Rabbit version 0.8.0, a tool delivered by OHDSI (26). White Rabbit generated a scan report, which contained information about the fields and frequency distributions of the values. The scan report created by White Rabbit was loaded into Rabbit-in-a-Hat version 0.9.0 which is also provided by OHDSI (26). Rabbit-in-a-Hat provides a graphical user interface to help the user create an ETL specification document according to the OMOP template. In Rabbit-in-a-Hat, the SARI dataset and columns were mapped to the OMOP CDM tables and columns. Usagi is the application provided by OHDSI to create mappings between source terms and the Vocabulary standard concepts (27). Figure 2 also shows the step to implement the ETL and to run ATLAS. However, those steps were not performed at this stage. ATLAS is used to design and execute observational analyses to generate evidence from patient level observational data.

(8)

8 A file with source codes and definitions from the SARI dataset was loaded into Usagi. Usagi suggests mappings based on textual similarities or code descriptions. Usagi gives the suggestions a matching score from 0.00 to 1.00 where 0.00 means no match and 1.00 a certain match. The automated suggestion for a target concept was approved when the automated suggestion was an appropriate target match. If the automated suggestion was not correct, then a manual search was performed for an appropriate target concept. When there was not a suitable match for a source concept, the concept was mapped to CONCEPT_ID = 0. If a source term was too complex to map to one concept, multiple concepts were mapped to it. The SARI dataset contains a column “APACHE 4 diagnosis” which includes 445 possible values for admission diagnosis. All of these values were mapped to SNOMED CT codes. It was possible to filter on vocabularies in Usagi, so when the SNOMED CT vocabulary filter was applied, Usagi only suggested SNOMED CT concepts for the data values. A reviewer checked all mappings and revisions were made when both the mapper and the reviewer agreed on the changes. When the mappings were approved, they could be used in the ETL design. To check the accuracy of Usagi it was counted how many times the suggestion from Usagi was accepted, how many suggested mappings had a matching score of 1.0 and how many times it was not possible to map. Usagi only suggests one concept, so when more than one concept is needed, the user needs to choose this manually. For example, there was a value “Thyroidectomy and

parathyroidectomy”, Usagi only suggested “Parathyroidectomy”. A concept for “Thyroidectomy” had to be manually added to completely represent the source term. Therefore, these cases were treated as if the user had to manually pick (other suggested) concepts and it was not counted as an accepted suggestion. The SARI set contains many of these complex data values.

Figure 2. Steps to map a dataset to the OMOP CDM. The blue boxes represent actions that are needed to map a dataset to the OMOP CDM and the purple boxes are deliverables in the process. The yellow star is a Data Partner milestone. Source: OHDSI EHDEN. The book of OHDSI.

SARI subset ContSys representation

The contsys.org website (13) was used to guide the process of representing the SARI dataset in ContSys. Figure 3 shows the steps to map to ContSys. It was first checked which clauses (that contain the concepts) represented the SARI dataset the most. Next, the concepts within these clauses were chosen to represent the column names from the SARI dataset. Descriptions and examples of a ContSys concept were checked to see if the column name and use would fit the description.

(9)

9 Concepts with a description that matched the column name were placed in a cell of considered concepts. If all concepts were checked and the list of considered concepts was complete, then a decision was made on which of the considered concepts represented the column name the best. The reason for it being the best choice was also recorded. For each mapping an Electronic Health Record Communication (EHRCOM) (21) type of data value was noted, which represents the type of data value from the SARI dataset. Comments or logic was added to the mapping when necessary. The ContSys representation was also checked by a reviewer and revisions were made when both the mapper and the reviewer agreed on the changes.

SNOMED CT has a very close alignment with ContSys. SNOMED CT can represent clinical statements within a record while ContSys establishes the relationship within the healthcare business (19). Therefore, the decision was made to map all SARI data values of the data items to SNOMED CT codes. ContSys does not provide any tools for the mapping process. However, Usagi from OHDSI was used for the vocabulary mapping with the SNOMED CT vocabulary filter. The OMOP CDM term mapping was mostly reused for the ContSys term mapping. Only the SARI column names that were not mapped to a SNOMED CT code were changed to a suitable SNOMED CT code to make the whole term mapping compliant with ContSys. For both the OMOP and ContSys representation the time needed to complete the mapping was noted to compare the mapping experience. The models were also compared on how they handle negative findings, an important part of the NICE database.

OMOP to ContSys mapping

Both the representation of SARI in OMOP and in ContSys were used to determine the

interoperability between OMOP and ContSys. This was done to check if hospital data in ContSys can be transformed to NICE data in OMOP. The used OMOP domains, columns and corresponding definitions were noted from the SARI in OMOP representation. Then the mapped SARI terms from the SARI in OMOP representation and the data value types were noted. Lastly, the ContSys concepts and their definitions were noted which were mapped to the SARI term in the SARI in ContSys representation. Logic for mapping data values from OMOP to ContSys and comments were added if necessary. Figure 4 shows the steps to map from OMOP to ContSys.

(10)

10

MDS in OMOP CDM representation

The MDS is the core dataset of the NICE database and was represented in the OMOP CDM. The same methods that were used for the SARI in OMOP CDM representation were also applied to the MDS in OMOP CDM representation. The only difference is that White Rabbit version 0.9.0 was used instead of version 0.8.0. The MDS also contains the 445 APACHE 4 values. The APACHE 4 source value mapping from the SARI dataset was reused for the MDS source value mapping. The APACHE 4 values were not taken into account when evaluating the accuracy of the Usagi suggestions. No vocabulary filter was applied for this mapping.

Implementing the MDS ETL

The process was guided by the Extract, Transform and Load course from the EHDEN academy (28). PostgreSQL was chosen for the implementation. PostgreSQL is one of the supported databases for Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) and will be the only supported database in later versions for ATLAS. The ETL designed with Rabbit-in-a-Hat was used as design for the implementation in PostgreSQL. The implementation of ACHILLES is also possible instead of ATLAS as was shown in Figure 2. ACHILLES was implemented in RStudio and connected with the database in PostgreSQL. ACHILLES was used to characterize the data in the database. ACHILLESHEEL is part of ACHILLES and was used to assess the data quality.

(11)

11

Results

Overview of the three models

In the literature search three articles were found regarding the comparison of data models. Characteristics defined for data model comparisons were used from these articles. These

characteristics were complemented with characteristics that were considered to be important. See Table 1 for the characteristics.

As part of this study the three models were compared to decide which models were most suitable for the NICE registry and the hospitals that deliver data to NICE. Suitability was determined with the results for the characteristics. Table 1 provides an overview of the three information models with their evaluation for each characteristic. These characteristics are defined in Appendix Table 1. Based on the results it was decided to map the SARI dataset to OMOP CDM and ContSys and not to CDISC. This was mostly decided on the results for the purpose of model and strengths characteristics and the characteristics defined by Garza et al. (29), Kahn et al. (30) and Moody et al. (23).

Table 1 Overview of the characteristics of information models.

* There is no list available of adopters for OMOP. Garza et al. only found 5 adopters. There are most likely many more adopters, therefore this should be seen as a minimum amount of adopters.

Characteristic OMOP CDM V6 CDISC SDTM ISO13940 - ContSys

Type of data Observational data (7) – administrative claims and electronic health records (2)

(Clinical) Study data (31) (Health)care data (32)

Purpose of model Observational research

purposes (15) Collecting, preparing and analyzing study data (31) Support continuity of care (33) Structure of model Person-centric relational

model (2) A standard structure for representing the planned sequence of activities and the treatment plan for the trial

System of concepts

Strengths Tools, clear guide,

supports cohorts Support for clinical trials Also supports self-care, healthcare third parties and extends to include all aspects of social care (19)

Weaknesses Has required fields which need to be filled in

Some domains are for

very specific purposes Issues with concepts, terms and definitions (34)

Very generic concepts Number of domains or

concepts 38 domains (15) 46 domains (25) 173 concepts Ontology/vocabulary The OMOP Standardized

Vocabularies which contain 111 vocabularies of which 78 have been externally adopted, like SNOMED CT (15)

CDISC Controlled

Terminology (25) SNOMED CT has a relationship with ContSys (19)

Integrity (23) 100% (29) Garza et. Al

100% for SARI 100% (86% incorporates the SUPPQUAL domain) (29)

100% for SARI Extensibility /Scalability

(12)

12

Ease of querying the model

a. No. of nested queries b. No. of table joins c. Estimated query performance (29) a. 1 b. 2-5 c. Faster (29) a. 1 b. 4-12 c. Slower (29) Not Applicable Ease of anonymization and de-identification (29)

Medium (29) Difficult (29) Easy Integration (23, 29, 30) 100% (29) Garza et. Al

94.6% ~ 452 out of 478 for SARI

67% (29) 93.5% ~ 447 out of 478 for SARI

Field experience (30) Release 2009 Release 2004 Release 2007 Stability (model updates

in last 2 years. Major update is a new version.

Minor update is an updated version.) (30)

1 major

3 minor 0 major 2 minor 0 major 0 minor

Adoption (number of

adopters) (30) 5* (29) The network consists of over 100 standardized databases (15)

>348 (29) In the Netherlands, a few adopters (35) Tooling Athena - look up

concepts in the vocabulary

White rabbit – perform a scan of the source data Rabbit-in-a-hat – define logic from data source to OMOP

Usagi – create code mapping

ATLAS – analysis tool ACHILLES – database characterization ARACHNE – facilitates distributed network analyses (15)

Not available Not available

Time spent on SARI

dataset modeling 8 hours (does not include mapping with Usagi)

Not Applicable 8 hours (does not include mapping with Usagi)

Appendix A shows the overlap in definitions between the domains from OMOP CDM, CDISC SDTM and ContSys (although domains in ContSys are called concepts). The classes from the three models are divided into different domains (note that classes in ContSys are actually called clauses). The clauses from ContSys act the same way as the classes from OMOP CDM and CDISC SDTM. Hence, the concepts from ContSys were treated as the same level as the domains from OMOP CDM and CDISC SDTM. Appendix A was made to research how similar the three models are and on which domains they mostly overlap.

(13)

13

SARI subset OMOP CDM representation

All of the variables from the SARI dataset were mapped to the OMOP CDM tables and columns. Appendix B contains the scan report generated by White Rabbit which was loaded into Rabbit-in-a-Hat. All variables but one fitted with the OMOP CDM columns without adjustments. The SARI subset has the age of a patient (nice_age) at the time of ICU admission. OMOP CDM does not use age but uses birth day, birth month and birth year of which only birth year is required. Birth year can be calculated by subtracting the age from the hospital admission date. Consequently, the birth year can be one year off, if the admission date is prior to the actual birthday of the patient. Birth month and birth day will be missing but those are not essential for OMOP CDM. Appendix C contains the Word file generated by Rabbit-in-a-Hat, which contains the SARI to OMOP CDM representation.

OMOP CDM on the other hand has fields that are required. However, the SARI dataset does not include the data to fill all those required fields. Table 2 shows which required fields in the used OMOP CDM tables could not be used with the data from the SARI dataset. However, a workaround was found to deal with the missing required fields on a forum from the community. For instance, the SARI dataset does not contain data about the race or ethnicity of patients while these fields are required in the OMOP CDM. In that case the workaround is to set RACE_CONCEPT_ID and

ETHNICITY_CONCEPT_ID to zero (36). Required fields that require a date were set to 01-01-1700 if NICE does not provide for the required field.

Table 2 Required fields from OMOP CDM, which could not be filled with data from the SARI dataset. Top row are domains and the rows below are columns within those domains. The type concept ids could not be filled in with SARI data. However, these were filled in with a standard concept id from OMOP.

Person Observation

period Visit occurrence Condition occurrence Fact relationship Measurement Observation Procedure occurrence

Person

id Person id Person id Condition occurrence id

Domain

concept id 1 Measurement id Observation id Procedure occurrence id

Race concept id

Period type

concept id Visit concept id Person id Domain concept id 2 Person id Person id Person id Ethnicity

concept id

Condition

start date Relationship concept id Measurement date Observation date Procedure date Condition concept type id Measurement type concept id Observation type concept id Procedure type concept id

In Usagi out of 33 source terms, only one term was unmapped (97% mapped). Eight source terms (24.24%) were mapped to two concept terms and two (6.06%) source terms were mapped to three concept terms. Appendix D contains the source terms to concept terms mapping. The APACHE 4 diagnoses values were separately mapped, as these would also be used for the ContSys

representation. There were 445 APACHE diagnoses of which 26 could not be mapped.

With 33 source terms and 445 APACHE diagnoses together, a total of 478 terms were used for mapping and of these 26 terms were unmapped (94.6% mapped). In total 96 suggestions from Usagi were accepted. Out of the 478 terms, 58 suggestions had a matching score of 1.0 and 56 of these suggestions were accepted. The only two suggestions with a matching score of 1.0 that were not accepted were for “male” and “female”. Usagi suggested LOINC codes for these, but the OHDSI guide states that both “male” and “female” must be mapped to the standard OMOP codes.

Therefore, the Usagi suggestions were rejected and the terms were manually mapped to the OMOP standard codes. The suggestion with the lowest matching score that was accepted had a matching score of 0.47.

(14)

14

SARI subset ContSys representation

All SARI column names were mapped to a ContSys concept, see Figure 5. The APACHE 4 diagnosis column was mapped to two ContSys concepts, because the APACHE 4 diagnosis contains both diagnosis and procedures. The “Died in hospital” and “Chronic renal insufficiency” columns in SARI were mapped to the “Excluded condition” and the “Working diagnosis” to represent the Boolean value of those columns. If the Boolean value is zero, it is an “Excluded condition”. When the Boolean value is one, it is a “Working diagnosis”. “Died in hospital” is also mapped to “Observable condition” and “Chronic renal insufficiency” is also mapped to “Health issue”. This is done to make a distinction between these two mappings. The term “Died in hospital’ is more an observable condition and “Chronic renal insufficiency” is more a health issue. Each mapping was made unique whenever possible. However, that was not always possible since certain column names from SARI are too similar for ContSys that uses very general concepts. Table 3 shows pairs of SARI columns of which elements of the pair was mapped to the same ContSys concepts. Appendix E contains the complete ContSys representation. Of the 478 data columns or values mapping from OMOP, sixteen were not mapped to SNOMED CT codes and were therefore mapped to a SNOMED CT code to make it compliant with ContSys.

Table 3 SARI columns which are mapped to the same ContSys concepts

SARI column names ContSys concept

Hospital number & Intensive care number Healthcare organization

Date of birth & Gender Demographic element

Hospital admission date & Intensive care

admission date Initial contact

Hospital discharge date & Intensive care

discharge date Healthcare activity period

Maximum ventricle heart rate & Maximum

creatinine Healthcare information

Figure 5 shows the SARI dataset represented in ContSys and the relation between the used concepts. Most concepts could be linked to each other with a maximum of one other unmapped concept in between. The “Healthcare treatment” concept was too separated from the other mapped concepts and is therefore represented without any relations. The “Healthcare information” concept does not have any relationship in ContSys. The “Element” and “Demographic element” concepts are separated, as they are concepts from ISO 13606-1 Health informatics— Electronic health record communication — Part 1: Reference model (21), a complimentary model for ContSys.

(15)

15

Comparison of time taken for mapping to both models

The OMOP and ContSys representations of the SARI set both took about eight hours to complete. The eight hours do not include the mapping with Usagi, as ContSys is not supported by the Usagi tool. However, many hours for the ContSys representation have been spent on checking all the descriptions of the concepts and deciding which definitions were fitting. Then a significant amount of time was spent on deciding which concepts fit best. The OMOP process was more

straightforward. For both OMOP and ContSys the SARI values mapping to the OMOP vocabulary and SNOMED CT were also timed. However, the value mapping from OMOP was mostly reused for ContSys and Usagi was used for ContSys. Therefore, in reality the time spent on the ContSys representation would be higher. Mapping the APACHE 4 values with Usagi took twelve hours to complete. Mapping half of the APACHE 4 values manually has previously been done and took one dedicated month to complete. This would likely be the case with ContSys if Usagi was not used. A dataset with many data values is mapped much quicker to OMOP with the help of the tools, which ContSys lacks.

A notable difference between OMOP and ContSys is how they handle negative findings. OMOP is not designed to handle negative findings. ContSys has specific concepts to represent negative findings such as “excluded condition” and “considered condition”.

Figure 5 SARI in ContSys represented. This figure contains the SARI to ContSys mapping and the relationship between used ContSys concepts. The white boxes below a ContSys concept box with a continuous border show the SARI column names mapped to that ContSys concept. The boxes with discontinuous borders are concepts which are not used in the SARI mapping, but form a link between concepts that are mapped. The colors of the boxes represent the clauses from the ContSys model. Light blue is “Responsibility”, purple is “Healthcare actor”, light green is “Healthcare matter”, dark green is “Time”, dark blue is “EHRCOM Reference model”, red is “Activity” and “Process”, yellow is “Electronic health record component”, white with a yellow border means that it does not belong to a clause. The bigger white arrows point to the super class of that concept. The edges represent the relationship between the concepts with cardinality specifications taken from the ContSys website.

(16)

16

OMOP to ContSys mapping

All SARI terms from the SARI in OMOP representation were mapped to the corresponding SARI terms from the SARI in ContSys representation. Therefore, the OMOP columns from the OMOP

representation were also mapped to the ContSys concepts from the ContSys representation. The OMOP column definitions fit with the ContSys concept definitions. Appendix F contains the OMOP to ContSys mapping. The table is structured with OMOP in mind as the mapping is from OMOP to ContSys. Most of the logic is about transforming the OMOP IDs to SNOMED CT codes. Comments are mostly about that certain OMOP columns are mapped to the same ContSys concepts and are thus redundant in ContSys. These OMOP columns are used to fulfil different roles in OMOP, but will result in ContSys as redundant.

MDS in OMOP CDM representation

Appendix G contains the scan report generated by White Rabbit for the MDS which was loaded into Rabbit-in-a-Hat. All of the 204 variables from the MDS were mapped to the OMOP CDM tables and columns. Again only nice_age needed the same adjustment as was done in the SARI in OMOP representation to fit the year_of_birth column in OMOP. The same workaround that was used in the SARI in OMOP representation, was applied in this mapping to deal with required OMOP columns which could not be filled with data in the MDS. Appendix H contains the Word file generated by Rabbit-in-a-Hat, which contains the MDS to OMOP CDM mapping.

The file that was loaded into Usagi with all of the source terms contained 817 variables that needed to be mapped. The APACHE 4 values from the SARI source term mapping were automatically mapped which left 372 values to map. Another 25 values were excluded as they were not longer used in the MDS or don’t have meaning outside the NICE dataset. So in total 347 values had to be mapped in Usagi. For 35 values no fitting mapping was found and these were left unmapped (89.9% mapped). Usagi gave 69 (19.9%) values a matching score of 1.0. Again the suggestions for “male” and “female” were rejected as Usagi did not suggest the recommended mappings for “male” and “female” by OHDSI. In total 124 (35.7%) suggestions from Usagi were accepted. The accepted suggestion with the lowest matching score had a matching score of 0.38. Appendix I contains the source terms to concept terms mapping for the MDS.

Implementing the MDS ETL

The implementation of the NICE MDS in the OMOP CDM was completed. Appendix J contains a written guide with details and personal experiences that are not available in the book of OHDSI. Necessary ids were correctly filled in with the corresponding concept ids from the Usagi source to concept mapping. Usagi created the SOURCE_TO_CONCEPT table. This table will search for the corresponding concept id for the source code. However, source codes are not used in the NICE database. Therefore an extra table CODES was created which links source values in the MDS to the corresponding source code given in Usagi. With the CODES table it was possible to fill the OMOP CDM tables with the correct concept ids.

Required fields in the OMOP CDM for which the MDS did not contain data were set to 0 or set to 01-01-1700 if it regarded a date. Many of the OMOP CDM tables are defined to have a concept type id. Table 4 shows the chosen concept type id for each used table in the OMOP CDM. The ACHILLES and ACHILLESHEEL tests were also completed. Results can be found in Appendix K for ACHILLES and in Appendix L for ACHILLESHEEL.

(17)

17

Table 4 Chosen concept type id for each used OMOP CDM table in the implementation of the MDS ETL

OMOP CDM table Concept type id

Care site 706367 Intensive care unit

Condition occurrence 44786627 Primary Condition

Measurement 5001 Test ordered through EHR

Observation 38000280 Observation recorded from EHR

Observation period 44814724 Period covering healthcare encounters Procedure occurrence 44786630 Primary Procedure

Visit occurrence 32024 Visit derived from encounter on medical professional claim

(18)

18

Discussion

With an increase in the amount of observational data there is an increase in varied data structures. This requires improved and more complex infrastructure to support the reuse of data. This is where making data FAIR can help. An information model is essential for data sources to become FAIR. However, it was unknown which information model is most suitable for quality registries such as NICE and to what extent information models that contain hospital data or ICU quality registry data are interoperable. Finding these optimal information models and determining the interoperability will help data sources become FAIR, which will allow answering large and multisite research and policy questions.

The evaluation of the three information models, i.e. OMOP CDM, CDISC SDTM and ContSys, makes apparent that OMOP CDM is the most usable for registries and that ContSys is most suitable for hospital data based on the results in Table 1. OMOP CDM and ContSys both proved to be usable as an information model for the SARI dataset. However, as in general OMOP CDM is the most usable for registries. The MDS was also represented with OMOP CDM and it has been proven with the implementation that is compatible with ACHILLES. Based on the results, it is concluded that OMOP CDM and ContSys are interoperable. Therefore, data can be transformed from ContSys to OMOP CDM with minimal data loss.

OMOP CDM and ContSys experience comparison

Of the two models, OMOP CDM was the easier model to map. OHDSI provides a guidebook, forum, and tools to help users to apply the OMOP CDM (15). In contrast to OMOP, ContSys is not

implementation orientated because it is a conceptual model. Consequently, ContSys does not provide a guidebook or tools to help use ContSys. The OMOP classes and domains are also more specific than the clauses and concepts from ContSys. This made it much clearer which SARI item should be connected to a certain OMOP column. This was sometimes problematic in ContSys, because the definitions for the concepts are very general. In many cases, the SARI column names would fit with many concepts. For the SARI representation in ContSys an effort was made to use unique concepts however that was, in certain cases, impossible. This could be problematic when implementing the ContSys representation. On the other hand, ContSys does not provide any guidelines for the implementation and the user can therefore freely decide how to implement the model. This allows the user to decide how to deal with this and other problems. This is for example also the reason why “ease of anonymization and de-identification” for ContSys was rated as “easy” in Table 1. However, this is problematic if ContSys is used for FAIR data. As the implementation for ContSys could be different for everyone. As was stated in the results section, the SARI

representation in OMOP and ContSys both took about eight hours to complete. However, for the ContSys representation this was with the help of OHDSI tools. So without the OHDSI tools the ContSys representation would have taken much more time to complete.

Another difference between OMOP and ContSys is how they handle negative findings. ContSys has concepts to represent negative findings such as “excluded condition” and “considered condition”. It could also be argued that “healthcare information” could represent negative findings. Healthcare information is information that is relevant for a person’s healthcare. In certain cases, negative findings could be relevant for the healthcare of a person. OMOP in theory, does not collect negative findings. Firstly, the discussion exists if negative findings should be collected in OMOP CDM (37). Queries potentially have to be doubled in case negative findings are collected, because the query needs to include a specific disease and needs to exclude those without the disease. Secondly, if it is decided to include negative findings, it is unclear how it should be done. The best solution seems to

(19)

19 be to put it into OBSERVATION with the observation_concept_id 4132135 (Absent) and a SNOMED CT code for the pertinent negative as value_as_concept_id (38), in case it is decided that negative findings add valuable information. But for some data sources, such as the NICE registry, it is necessary to represent negative findings.

OMOP CDM and ContSys interoperability

Based on the results of the OMOP to ContSys mapping it can be concluded that the two models are interoperable. The interoperability is mostly possible due to the broad definitions of the concepts of ContSys and that ContSys does not tell its users how to implement it. The broad definitions of the ContSys concepts were somewhat problematic in the SARI representation in ContSys but proved to be very useful in the OMOP to ContSys mapping. Although, the decision making process of choosing fitting concepts was already done in the SARI to ContSys phase. The fact that ContSys does not guide its users how to implement the model should also benefit the interoperability of the two models. Users are free to determine how to implement the ContSys model, which means that users are also free to choose how they implement an interoperability aspect with other data models. This is useful only if two sites want to make their data interoperable. The downside is that this free interpretation of the implementation could lead to new problems and errors to arise. Especially when used to make data FAIR. The OMOP model has guidelines for implementation, but this could also provide a good starting point. For the OMOP to ContSys mapping it was convenient that all of the APACHE 4 diagnoses were already mapped to SNOMED CT codes. If that were not the case, then the OMOP to ContSys mapping would require more work. However, Usagi from OHDSI could be used to remap the data values that were not mapped to a SNOMED CT code in the OMOP representation to speed up the mapping process.

Although the OMOP to ContSys mapping was done to research the interoperability from OMOP to ContSys, certain conclusions could also be drawn for a ContSys to OMOP mapping. The OMOP to ContSys mapping indicates no problems for a ContSys to OMOP mapping. The OMOP to ContSys mapping can also be read as a ContSys to OMOP mapping if read from right to left (Appendix F). For example, “gender_concept_id” from OMOP is mapped from the SARI column “gender” and “gender” is mapped to “demographic element” in ContSys. This can also be read as “demographic element” from ContSys is mapped from the SARI column “gender” and “gender” is mapped to

“gender_concept_id” in OMOP. Thus indicating that a ContSys to OMOP mapping is possible. The SNOMED CT codes from the ContSys representation could be left untouched or could be

complemented with concepts from the OMOP vocabulary when mapping from ContSys to OMOP. This is especially interesting for unmapped source values in the ContSys representation.

Discussion of the findings in relation to existing literature

Si et al. and Lima et al. applied the same procedure to map to OMOP CDM as was used in this study (39, 40). A small difference is that Lima et al. decided not to use Usagi but instead found the concept IDs for source terms in Athena. Athena is a website by OHDSI for searching concept IDs. That the same process used is throughout the OMOP community is expected due to the clear guides OHDSI provides. Notable is that Lima et al. mapped electronic patient records to the OMOP CDM. This may imply that OMOP CDM is also suitable for patient records and thus ICUs could consider using OMOP CDM instead of ContSys, taking into account the representation of negative findings in OMOP. A reason to do so this is that OMOP CDM is a matured information model which is also mentioned by Lima et al. In case an ICU decides to use OMOP CDM as the quality registry does, no interoperability problems should arise when extracting data from the ICU to be transported to the quality registry. However, this study assumed that ContSys is a better information model for continuity of healthcare and Lima et al. did not compare information models.

(20)

20 Liyanage et al. (41), Guo et al. (42) and Garza et al. (29) all compared CDMs and concluded that OMOP CDM is the best CDM for cohort studies and longitudinal community registries. Integration percentages in these studies vary from 88.7% to 98.8% for Guo et al. and is 100% for Garza et al. The SARI set scored 94.6% and the MDS scored 89.8% so the integration rate for NICE in OMOP CDM is in line with the other studies. It is remarkable that the SARI set scored on the high end. The SARI set has a significant number of complex source terms and a vocabulary filter was also applied.

Recommendations

NICE is recommended to use the OMOP CDM information model to make the NICE data FAIR. OMOP CDM, ContSys and CDISC SDTM were developed for specific purposes and their fit for other purposes depends on how closely the information model matches the intended use (29). The NICE registry is focused on observational data and the results show that both the integrity and integration are high. It is therefore likely that the OMOP CDM is the most suitable information model to make NICE data FAIR.

However, NICE receives data from ICUs, which are more focused on continuity of care as their data comes from EHRs. This means that hospitals or ICUs would probably benefit most by using ContSys to make their data FAIR. However, ContSys could be problematic to make data FAIR and according to Lima et al., OMOP can also be considered.

For the implementation process it is recommended to start with OMOP for NICE as OHDSI provides tools and clear guides for the implementation of the OMOP CDM which is a good starting point. In case an ICU chooses for ContSys, the ContSys implementation for the ICUs should be started when the OMOP implementation for NICE is completed. So, that it can be assessed if certain decisions need to be made for the ContSys implementation to optimize the interoperability with the OMOP data source from NICE.

Strength and weaknesses

This study has several strengths. First, the choice of information models to map data was based on an evaluation of three information models. Secondly, two datasets were represented in the OMOP CDM. So, conclusions for the usability of OMOP CDM are based on two examples, although these are overlapping. However, the APACHE 4 diagnosis were only mapped to SNOMED CT codes in the source to concept terms mapping using Usagi. This most likely reduced the number of source terms that could be mapped. This could influence generalizability as other data sources could decide to use different vocabularies in the OMOP CDM. On the other hand, for the MDS source to concept term mapping no vocabulary filter was applied. A second limitation is that the ContSys representation is open for interpretation. Because the descriptions for concepts in ContSys are so general, different people might decide to make different mappings. On this aspect it is noticeable that ContSys is not as mature as OMOP.

Future research questions

Further research is needed on how to optimally make a representation in ContSys. If different sites use ContSys differently, it may affect the interoperability with other information models. Therefore, more research is needed on how to optimally map using ContSys and how this influences the interoperability with OMOP CDM. The interoperability between OMOP CDM and other information models should also be further researched. Hospitals could decide that ContSys is not the optimal information model for them to use and decide to use a different information model. In such cases it is unknown to what extent the interoperability with OMOP is.

(21)

21

Conclusion

Information models are essential for data sources to become FAIR. This research shows that OMOP CDM is the optimal choice among information models for national ICU quality registries that need to make their data FAIR. Hospitals could opt for ContSys or OMOP. Results from this research make apparent that the information models OMOP CDM and ContSys are interoperable.

(22)

22

References

1. Makadia R, Ryan PB. Transforming the Premier Perspective® Hospital Database into the Observational Medical Outcomes Partnership (OMOP) Common Data Model. Egems. 2014;2(1). 2. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association. 2011;19(1):54-60.

3. Reich C, Ryan PB, Stang PE, Rocca M. Evaluation of alternative standardized terminologies for medical conditions within a network of observational healthcare databases. Journal of

biomedical informatics. 2012;45(4):689-96.

4. NICE. Introductie 2020 [cited 2020 10th of January]. Available from: https://www.stichting-nice.nl/index.jsp.

5. NICE. Wat we doen 2020 [cited 2020 10th of January]. Available from: https://www.stichting-nice.nl/watwedoen.jsp.

6. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data. 2016;3.

7. Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Studies in health technology and informatics. 2015;216:574.

8. Lee YT, editor Information modeling: From design to implementation. Proceedings of the second world manufacturing congress; 1999: International Computer Science Conventions Canada/Switzerland.

9. GO-FAIR. Personal Health Train 2020 [cited 2020 11th of June]. Available from: https://www.go-fair.org/implementation-networks/overview/personal-health-train/.

10. Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, et al. Distributed analytics on sensitive medical data: The Personal Health Train. Data Intelligence. 2020:96-107.

11. OHDSI. OMOP Common Data Model 2020 [cited 2020 29th of January]. Available from: https://www.ohdsi.org/data-standardization/the-common-data-model/.

12. CDISC. SDTM 2020 [cited 2020 29th of January]. Available from: https://www.cdisc.org/standards/foundational/sdtm.

13. O N. A system of concepts for the continuity of care 2019 [cited 2020 29th of January]. Available from: https://contsys.org/page/default.

14. OHDSI. Data standardization 2020 [cited 2020 10th of January]. Available from: https://www.ohdsi.org/data-standardization/.

15. OHDSI. The Book of OHDSI 2019 [cited 2020 10th of January]. Available from: https://ohdsi.github.io/TheBookOfOhdsi/.

16. OHDSI. OBSERVATION_PERIOD 2018 [updated 30 November 2018; cited 2020 25th May]. Available from: https://github.com/OHDSI/CommonDataModel/wiki/OBSERVATION_PERIOD. 17. CDISC. About CDISC 2020 [cited 2020 16th of January]. Available from:

https://www.cdisc.org/about.

18. CDISC. Study Data Tabulation Model. 2019.

19. O N. Background of this system of concepts 2019 [cited 2020 20th of January]. Available from: https://contsys.org/page/Background.

20. NICE S. Our registries 2020 [cited 2020 14th of April]. Available from: https://www.stichting-nice.nl/dd/#modules.

21. ISO. ISO 13606-1:2019 Health informatics — Electronic health record communication — Part 1: Reference model 2019 [cited 2020 8th of June]. Available from:

(23)

23 22. ISO. ISO 12967-1:2009 Health informatics — Service architecture — Part 1: Enterprise viewpoint 2009 [cited 2020 11th of June]. Available from:

https://www.iso.org/standard/50500.html.

23. Moody DL, Shanks GG. Improving the quality of data models: empirical validation of a quality management framework. Information systems. 2003;28(6):619-50.

24. OHDSI. CommonDataModel 2019 [cited 2020 3rd of March]. Available from: https://github.com/OHDSI/CommonDataModel/wiki.

25. CDISC. Study Data Tabulation Model Implementation Guide: Human Clinical Trials 2019 [cited 2020 10th of January].

26. OHDSI. WhiteRabbit 2020 [cited 2020 27th of January]. Available from: https://github.com/OHDSI/WhiteRabbit.

27. OHDSI. Usagi 2020 [cited 2020 27th of January]. Available from: https://github.com/OHDSI/Usagi.

28. EHDEN. EHDEN academy 20020 [cited 2020 27th May]. Available from: https://academy.ehden.eu/.

29. Garza M, Del Fiol G, Tenenbaum J, Walden A, Zozus MN. Evaluating common data models for use with a longitudinal community registry. Journal of biomedical informatics. 2016;64:333-41. 30. Kahn MG, Batson D, Schilling LM. Data model considerations for clinical effectiveness researchers. Medical care. 2012;50.

31. CDISC. CDISC Standards in the Clinical Research Process 2019 [cited 2019 12th of December]. Available from: https://www.cdisc.org/standards.

32. ISO. ISO 13940:2015(en) Health informatics — System of concepts to support continuity of care: ISO; 2015 [cited 2020 13th of February]. Available from:

https://www.iso.org/obp/ui/#iso:std:iso:13940:ed-1:v1:en.

33. TC251 C. EN 13940-1: Health Informatics-System of Concepts to Support Continuity of Care-Part 1: Basic Consepts. European Committee for Standardization. 2006:105.

34. O N. Issues with the current model 2019 [cited 2020 20th of January]. Available from: https://contsys.org/page/ModelIssues.

35. Nictiz. CONTSYS 2016 [cited 2020 4th of February]. Available from: https://www.nictiz.nl/standaarden/contsys/.

36. Reich C. Conventions of Race and Ethnicity 2019 [updated 19th of August 2019; cited 2020 13th of February]. 13th of February 2020:[Available from: https://forums.ohdsi.org/t/conventions-of-race-and-ethnicity/7654.

37. Reich C. Negative information in OMOP CDM 2018 [cited 2020 9th of March]. Available from: https://forums.ohdsi.org/t/negative-information-in-omop-cdm/4923/8.

38. Sholle E. Negative information in OMOP CDM 2018 [cited 2020 9th of March]. Available from: https://forums.ohdsi.org/t/negative-information-in-omop-cdm/4923/10.

39. Si Y, Weng C. An OMOP CDM-based relational database of clinical research eligibility criteria. Studies in health technology and informatics. 2017;245:950.

40. Lima DM, Rodrigues-Jr JF, Traina AJ, Pires FA, Gutierrez MA. Transforming two decades of ePR data to OMOP CDM for clinical research. Stud Health Technol Inform. 2019;264:233-7. 41. Liyanage H, Liaw S-T, Jonnagaddala J, Hinton W, de LUSIGNAN S, editors. Common Data Models (CDMs) to Enhance International Big Data Analytics: A Diabetes Use Case to Compare Three CDMs. EFMI-STC; 2018.

42. Guo GN, Jonnagaddala J, Farshid S, Huser V, Reich C, Liaw S-T. Comparison of the cohort selection performance of Australian Medicines Terminology to Anatomical Therapeutic Chemical mappings. Journal of the American Medical Informatics Association. 2019;26(11):1237-46.

(24)

24

Appendix

Appendix table 1

Eight out of seventeen characteristics in Appendix Table 1 were used in research from Garza et al (29) where they compared CDMs to determine which CDM is most suitable to share data from an Electronic Health Record (EHR)-based community registry.

Appendix Table 1 Characteristics and their definition(s).

Characteristic Conceptual definition Operational definition

Type of data Type of data suitable for the

information model

Purpose of model The purpose for which the information model is intended to be used

Structure of model Design of the model

Strengths Strengths of the information

model in comparison with other information models

Weaknesses Weaknesses of the

information model in comparison with other information models

Number of domains / concepts Absolute number of domains

or concepts represented in the model

Ontology/vocabulary Which ontology/vocabulary is used in the information model

Integrity (23) The extent to which

associations in the evaluated data model match the specific project needs

The percentage of associations in the data model held by the evaluated model

Extensibility /Scalability (30) How much the information model can accommodate addition of new data elements Ease of querying the model

(29) The ease of querying the evaluated model for cohort identification

1. Number of table joins required for each query

2. Number of nested queries needed for each query

3. Qualitative estimate (faster or slower) of the overall query performance over views of the data model based on the complexity of the query used for cohort identification.

(These assume a relational database implementation)

(25)

25 Ease of anonymization and

de-identification (29) The ease of de-identification and anonymization of the data captured in the evaluated model

Qualitative estimate (easy, medium or difficult) of the complexity of the de-identification and anonymization process Integration (23, 29, 30) The extent to which the model

supports controlled terminologies

Each model was evaluated on the integration and use of a controlled vocabulary. The terminologies supported by the models were compared.

Field experience (30) The release year of the model,

which indicates the number of years of experience.

Stability (30) The number of changes to the

data model Number of model updates in the last two years. Major update means a change in version number. A small update means that a current version of the model was updated.

Adoption (30) The size of the community

using and supporting the data model

The number of adopters for the model

Tooling Is tooling available to map the

data source to the information model

Time spent on modeling Time in hours spent on modeling the data to the information model

(26)

Appendix A

*Yellow boxes have overlap with other model(s)

OMOP CDISC SDTM ISO13940 - ContSys

Standardized Vocabularies Special-Purpose Domains Healthcare actor

CONCEPT Comments (CO) healthcare actor

VOCABULARY Demographics (DM) healthcare emplyment

DOMAIN Subject Elements (SE) healthcare organization

CONCEPT_CLASS Subject Visits (SV) healthcare personnel

CONCEPT_RELATIONSHIP healthcare professional

RELATIONSHIP Interventions General Observation Class healthcare professional entitlement

CONCEPT_SYNONYM Concomitant Medications (CM) healthcare provider

CONCEPT_ANCESTOR Exposure as Collected (EC) healthcare supporting organization

SOURCE_TO_CONCEPT_MAP Exposure (EX) healthcare third party

DRUG_STRENGTH Substance Use (SU) next ofkin

Procedures (PR) organization role

Standardized Metadata other carer

CDM_SOURCE Events General Observation Class subject of care

METADATA Adverse Events (AE) subject of care proxy

Clinical Events (CE)

Standardized Clinical Data Tables Disposition (DS) Healthcare matter

PERSON Protocol Deviations (DV) clinical process interest

OBSERVATION_PERIOD Healthcare Encounters (HO) considered condition

VISIT_OCCURRENCE Medical History (MH) excluded condition

VISIT_DETAIL health condition

CONDITION_OCCURRENCE Findings General Observation Class health condition evolution

DEATH Drug Accountability (DA) health issue

DRUG_EXPOSURE Death Details (DD) health need

PROCEDURE_OCCURRENCE ECG Test Results (EG) health problem

DEVICE_EXPOSURE Inclusion/Exclusion Criterion Not Met (IE) health problem list

MEASUREMENT Immunogenicity Specimen Assessments (IS) health state

NOTE Laboratory Test Results (LB) health thread

NOTE_NLP Microbiology Specimen (MB) healthcare matter

SURVEY_CONDUCT Microscopic Findings (MI) input health state

OBSERVATION Morphology (MO) observed condition

SPECIMEN Microbiology Susceptibility Test (MS) output health state

FACT_RELATIONSHIP PK Concentrations (PC) potential health condition

PK Parameters (PP) professionally assessed condition

Standardized Health System Data Tables Physical Examination (PE) prognositc condition

LOCATION Questionnaires (QS) resultant condition

LOCATION_HISTORY Reproductive System Findings (RP) risk condition

CARE_SITE Disease Response (RS) target condition

PROVIDER Subject Characteristics (SC) working diagnosis

Subject Status (SS)

Standardized Health System Data Tables Tumor Identification (TU) Activity

PAYER_PLAN_PERIOD Tumor Results (TR) automated healthcare

COST Vital Signs (VS) automatic medical device

clinical process outcome evaluation

Standardized Derived Elements Findings About healthcare activity

DRUG_ERA Findings About (FA) healthcare activity directory

DOSE_ERA Skin Response (SR) healthcare activity element

CONDITION_ERA healthcare activity management

Trial Design Domains healthcare assessment

Results Schema Trial Arms (TA) healthcare communication

COHORT Trial Disease Assessment (TD) healthcare documenting

COHORT_DEFINITION Trial Elements (TE) healthcare evaluation

Trial Visits (TV) healthcare funds

Trial Inclusion/Exclusion Criteria (TI) healthcare investigation

Trial Summary (TS) healthcare needs assessment

healthcare planning

Relationship Datasets healthcare process evaluation Supplemental Qualifiers (SUPP-- datasets) healthcare provider activity

Related Records (RELREC) healthcare resource

healthcare resource management healthcare third party activity healthcare treatment

(27)

prescribed third party activity self-care activity

Process

adverse event

adverse event management clinical process

healthcare administration healthcare process

healthcare quality management healthcare service

healthcare service directory

Healthcare planning

care plan clinical guideline clinical pathway core care plan health objective

healthcare activities bundle healthcare goal

multi-professional care plan needed healthcare activity protocol

uniprofessional care plan

Time

clinical process episode contact

contact period episode of care

episodes of care bundle health approach health condition delay health condition period health related period healthcare activity delay healthcare acitivty period

healthcare acitivty period element healthcare appointment

indirect healthcare activity period initial contact

mandated period of care resource delay

self-care period

subject of care preference delay

Responsibility

authorization by law care period mandate clinical process mandate consent competence

continuity facilitator mandate demand for care

demand for initial contact demand mandate

dissent

healthcare activity mandate healthcare commitment healthcare mandate informed consent

mandate to export personal information reason for demand for care

referral request

subject of care desire

(28)

certificate related to a healthcare matter clinical report

discharge report

electronic health record component electronic health record extract electronic patient summary health concern

health record

health record component health record extract

healthcare information for import healthcare information request medium (duplicate)

non-ratified healthcare information personal health record

professional health record sharable data repository

summarized healthcare information repository

EHRCOM Data value

attachment value boolean value coded simple value coded value CV

data time value date value duration value

instance identifier value integer value

physical quantity value point in time value real value

simple text value string value time value URI value

EHRCOM Reference model

attestation information audit information base component cluster compostion content data value demographic cluster demographic element demographic entity demographic extract demographic folder demographic item element entry external link

extracted component set folder

item link section