Information Graphs modelling Diabetes Disease

(1)

Page | 1

University of Amsterdam

Faculty of Science

Thesis Master Information Studies – Business Information Systems

Final version: August 21, 2015

Information Graphs modelling Diabetes

Disease

Hayo Bart – 10254846

Supervisor: Mw. prof. dr. H. Afsarmanesh Daily supervisor: Dhr. M. Shafahi MSc

First Examiner: Mw. prof. dr. H. Afsarmanesh Second Examiner: Dhr. dr. M. W. van Someren Signature:

_______________________________

Signature:

(2)

Page | 2

0 Abstract

The development of disease risk prediction models, commonly used by practitioners to assess someone’s risk on developing a particular disease, requires the exploration the vast body of (bio)medical knowledge. The continuous growth of this body of knowledge, as is indicated by the exponential growth of the MEDLINE biomedical bibliographic database, however, poses challenges with respect to such knowledge exploration efforts. Numerous researchers attempted to address this issue through the development of a variety of tools. Most of these tools, however, lack intuitiveness in their use, as well as present a rather scarce amount of information, usually obtained from just one source whereas this information is available from external sources. There is thus a need for a tool that is both intuitive and represents (disease) related information from multiple sources. This research aims to address this gap, and, as such, aid researchers in their knowledge exploration efforts, through the development of a dynamic model that represents (bio)medical knowledge, available from disparate sources across the Web, as a network of interrelated (bio)medical concepts. Additionally the research aims to incorporate Semantic Web technologies into the model to deal with large amounts of dynamic and heterogeneous information that are nowadays available. To this end, this thesis introduces BioMed Xplorer, a tool that enables (bio)medical researchers to explore the body of (bio)medical knowledge in a graph-like format through an intuitive and user-friendly interface. Additionally, BioMed Xplorer provides researchers with concept (disease) related information from a multitude of sources, as well as with the provenance data associated to the represented knowledge. The knowledge represented by BioMed Xplorer is provided by a RDF knowledge base, which was obtained by mapping SemMedDB, a relational SQL database representing relationships among biomedical concepts that are extracted from PubMed articles, to RDF. This mapping was conducted using a mapping file that was developed based on a core ontology introduced in this research. Both the developed ontology as well as BioMed Xplorer were validated by domain experts as well as by a comparison to prior work.

(3)

Page | 3

1 Introduction

The (bio)medical field is vast and dynamic, with knowledge developing rapidly as a result of continuously ongoing research. Within this field, extensive research is conducted into identifying risk factors of diseases as well as assessing their effect on the presence and associated severity of a disease. The available knowledge from this research on risk factors, enables researchers to develop risk prediction models that might be used by practitioners to assess someone’s risk on developing a particular disease.

Conventional methods for developing such risk prediction models would involve identifying the risk factors and their effects from the ever-evolving body of (bio)medical knowledge. Achieving this aim would thus involve checking a wealth of scientific publications for relevant statements regarding factors that might affect a disease. This, however, is a cumbersome activity, especially when considering the fact that the U.S. National Library of Medicine’s (NLM) bibliographic database MEDLINE, as of today, contains over 22 million citations, over 750.000 of which were added in 2014 (U.S. National Library of Medicine, 2015c), and that these numbers have grown exponentially over the 20 years (Hunter & Cohen, 2006; U.S. National Library of Medicine, 2015a).

As a result of the sheer size and continuous growth of the body of (available) (bio)medical knowledge, exploring the available body of (bio)medical knowledge, and finding the relevant knowledge for inclusion in risk prediction models, becomes increasingly challenging for researchers, potentially causing an information overload (Hunter & Cohen, 2006; Lu, 2011).

Numerous researchers reckoned this problem and (attempted to) address(ed) it from different perspectives. To this end, one group of researchers address this issue from an information retrieval perspective through the development of alternative and enhanced Web tools (compared to PubMed) to retrieve (bio)medical publications, as is extensively discussed by Lu (2011). A second group of researchers takes a knowledge extraction perspective through the development of text mining and information extraction tools as discussed by Cohen and Hersh (2005). These tools can be employed to (automatically) examine the relationships between specific kinds of information within and between publications, as such ease researchers’ cognitive loads. Finally a third group of researchers take these knowledge extraction tools even further by developing comprehensive visualizations that represent knowledge extracted from (bio)medical publications. Among these tools are AliBaba (Plake, Schiemann, Pankalla, Hakenberg, & Leser, 2006), EBIMed (Rebholz-Schuhmann et al., 2007), PGviewer (Tao, Friedman, & Lussier, 2005) Semantic MEDLINE (Kilicoglu et al., 2008), and the Semantic Navigator (Bodenreider, 2000).

Even though these knowledge representation and visualization tools possess a great expressive power, as a result of their visual nature, four common shortcomings can be identified among them. A major shortcoming among all these tools is the lack of information that is represented. All tools merely present names and identifiers of represented concepts and lack information such as descriptions or definitions, while these are available externally on the Web. A second shortcoming is posed by the restricted scope, with most of the tools (AliBaba, EBIMed, PGviewer) just focusing on a particular subdomain within the (bio)medical field. Examination of the tools reveals that some of the tools (PGviewer and Semantic Navigator) are not particularly intuitive in their use and have a rather sharp learning curve, which is considered as a third shortcoming. Finally, some of the tools are no longer active (AliBaba, PGviewer), or do not work properly (EBIMed).

Considering the challenges in exploring and efficiently retrieving (bio)medical knowledge, the expressive power of knowledge representation and visualization tools, and the aforementioned shortcomings among these tools, it becomes clear that there is a need for a meaningful representation of the available (bio)medical knowledge that: (1) is intuitive, and (2) represents information from

(6)

Page | 6 multiple sources. Due to the large amounts of heterogeneous and dynamic information that is nowadays available across a multitude of sources, relational databases are considered to be less than ideal for storing and instantiating such knowledge representations (Hendler, 2014). Linked data, on the other hand, provides a promising solution to this issue as it is able to cope with such large amounts of dynamic and heterogeneous information (Berners-Lee, Hendler, & Lassila, 2001).

This thesis presents the first steps into this direction by the development of a dynamic model representing disease related (bio)medical knowledge, available from disparate sources across the Web, as a network of interrelated (bio)medical concepts, while also incorporating Semantic Web technologies to deal with large amounts of dynamic and heterogeneous information. Such a model might aid researchers in their knowledge exploration efforts, easing the identification of relevant knowledge and, as such, assisting them in the development of risk prediction models and even provide them with recommendations for future work. In this research the Diabetes Mellitus Type 2 disease, from now on referred to with “Diabetes”, will be considered as a case study for development of the model.

In this thesis, we aim at addressing the following research question:

“How can we develop a dynamic model representing information related to a disease (e.g. Diabetes)?”

In order to answer this research question, four major aspects have to be addressed. Development of the model, first of all, requires (bio)medical knowledge. Therefore, it is key to identify the sources that can provide such knowledge. Once the sources from which relevant information can be obtained are known, it is necessary to investigate how the data from these various sources can be interlinked, and, as such compose the model. In addition to composing the actual model efficient (pre)processing, storage, and querying of the model is of great importance for the exploitation of the model at a later stage. Finally, it is imperative to visualize model in an intuitive manner, enabling exploration of the model to its maximum. These four major aspects provide us with the following four sub-questions that need to be addressed in this thesis:

1. From what sources can we (automatically) obtain disease related information?

2. How can the data obtained from these sources be interlinked and, as such, compose a (disease related information) model?

3. How can the model be efficiently (pre)processed, stored, and queried? 4. How can the model be visualized in an intuitive manner?

In Chapter 2, we will outline the research approach that was designed in order to answer these questions.

(7)

Page | 7

2 Research Approach

In order to address the research question and sub-questions posed in Chapter 1, we designed a research approach consisting of the following five phases: (1) State of the Art, (2) Data Source Characterization, (3) Data Gathering and Preprocessing, (4) Data Fusion and Interlinking, and (5) Model Visualization. Each of these phases has its own focus and emphasis and will be briefly outlined below.

As in any research, the first phase consists of assessing the state of the art by identifying relevant work that has been conducted and is current conducted with respect to the posed research question.

Following the assessment of the background, the second phase involves the identification of potential data sources that can provide us with relevant disease related information for incorporation into our model. Additionally, this phase covers the selection of the data source(s) that should serve as a base for the model we are planning to develop. Completion of this phase allows us to answer sub-question 1.

The third phase of the research focuses on gathering the data and preprocessing it. Data gathering involves obtaining the data from the selected data source(s), whereas preprocessing comprises the design of a standard data structure, in the form of an ontology, to which the data in our model should conform. Furthermore, this stage involves the development of a mapping that can be employed to convert the gathered data to this designed structure. These activities provide us with a partial answer to the second sub-question.

Once the data from the selected source(s) has been gathered and preprocessed, it can be fused with data from external resources to further enrich the model. In addition to fusing the data, the fourth phase also encompasses data interlinking. In this stage the gathered data is converted to the designed data structure, using the developed mapping, and as such composes the actual model of disease related information. The final stage in this phase consist of setting up a back-end for storage of the model as well as to provide external applications (such as a visualization) with the ability to query and process (the data in) the model. Completion of this phase allows us to fully answer the second sub-question, while providing a partial answer to the third sub-question.

Finally, the fifth phase of the research involves the development of an interactive visualization representing the data stored in the model in an intuitive way. Visualizing the model in the front-end,

Figure 1 – System architecture of BioMed Xplorer, showing the distinct components that together compose the developed system. Each of the model design phases discussed in this thesis has a designated component within the system. External sources, not directly part of the system, are shown in green, whereas the internal components are shown in grey.

(8)

Page | 8 first of all, requires querying the model in the back-end and subsequently processing the returned data in the front-end to achieve the appropriate data representation. Upon completion of this phase, sub-questions 3 and 4, as well as the overarching research question can be answered.

Completion of these five stages delivers a system with an architecture that is shown in Figure 1. As one might notice, the architecture consists of four core modules, each of which corresponds to one of the major design and development stages that were discussed above. The components of these modules will be gradually defined in the corresponding chapters and sections of this thesis, as such fully describing the system architecture. A more elaborate description of the system architecture is available in Appendix A.

The remainder of this thesis is structured according to the five phases that were outlined above, with each section elaborately discussing a particular phase of the research. Chapter 3 discusses the state of the art with respect to the posed research question. In Chapter 4, the identification and selection of (potential) data sources for inclusion in the model is described. This is followed by a discussion of the data gathering and preprocessing in Chapter 5, whereas Chapter 6 covers the fusion and interlinking of the data. The visualization of the model is subsequently discussed in Chapter 7, and the work is validated in Chapter 8. This is followed by a discussion of the limitations of this research and possible future work in Chapter 9. Finally, Chapter 10 concludes the thesis.

(9)

Page | 9

3 State of the Art

Before being able to start the development of the disease related information model, it is key to investigate the prior work that has already been conducted with respect to the research question, as this assists us in building an understanding of the subject and problem under study. In order to develop such a comprehensive overview of the prior work, relevant literature was reviewed by means of a Systematic Literature Review. This methodology aims to “identify, evaluate, and interpret all available research relevant to a particular research question, or topic area, or phenomenon of interest” in a thorough and fair manner (Kitchenham & Charters, 2007, p. 3).

In Section 3.1 we will provide an overview of the approach used to conduct this Systematic Literature Review, whereas Section 3.2 will discuss the main findings of the review by means of summarizing and analyzing the relevant studies.

3.1 Approach

As mentioned above, a Systematic Literature Review aims to identify the state of the art (academic work) with respect to the subject under study. The Systematic Literature Review was conducted according to the guidelines for performing Systematic Literature Reviews in Software Engineering which were outlined by Kitchenham and Charters (2007). According to these guidelines, any Systematic Literature Review consists of five stages, being (1) study search, (2) study selection, (3) study quality assessment, (4) data extraction, and (5) data synthesis.

The first phase, study search, involves retrieving as many primary studies as possible that are potentially relevant with respect to the posed research question(s). From the set of potentially relevant studies obtained in the first phase, the actually relevant studies need to be selected. This is the aim of the second, study selection, phase. In the third phase the selected studies are subsequently assessed for their quality, which might serve as a means for weighting the importance of individual studies in the remainder of the Systematic Literature Review. Upon completion of the quality assessment, the data that the researchers perceive as relevant needs to be extracted from each of the studies. This activity is the main focus of the fourth phase. Finally, the extracted data from all studies needs to be summarized and integrated in order to obtain a comprehensive overview of the prior research that has been conducted in the field of interest. These activities are conducted in the fifth, and final, data synthesis phase.

Even though Section 3.2 only covers the results of our Systematic Literature Review, corresponding to the fourth and fifth stage of a Systematic Literature Review, a detailed description of the study search (stage 1), study selection (stage 2), and study quality assessment (stage 3) is included in Appendix B.

3.2 Results

As was indicated in Appendix B, the study search, selection and quality assessment stages of the Systematic Literature Review, resulted in 20 studies, ranked according to their perceived quality, to be included for review in the data extraction and synthesis stages. However, with the limited time of the research in mind, only the top 10 studies were reviewed. In this section we will provide a detailed analysis of these studies and, as such, provide a comprehensive overview of the most important prior work that has been conducted with respect to our topic of interest.

A close examination of the studies included in our review, revealed two main patterns. First of all, the analysis of the studies showed that the majority (7) of the studies in the review discuss the development of either a model or system that can be employed to achieve a particular goal, with the remainder (3) of the studies discussing the development of an approach that can be utilized to achieve

(10)

Page | 10 the specified goal. Further analysis revealed three broad topics that could be identified from the studies under review, being (1) knowledge extraction, (2) knowledge representation, and (3) data federation, with each of the studies in the review covering at least one of these topics. Studies covering knowledge extraction, apply techniques such as Natural Language Processing (NLP), text mining, and relation extraction to extract structured knowledge from unstructured data sources. Studies focusing on knowledge representation, on the other hand, cover the development of standardized and structured formats that can be used to represent knowledge, which usually take the form of ontologies. Finally, studies on data federation aim to aggregate data from multiple and disparate sources. These identified topics, together, exhibit close correspondence to the need for an intuitive knowledge representation that represents data from multiple, heterogeneous and disparate sources that was identified in Chapter 1. In the remainder of this section each of the reviewed studies will be analyzed individually. This analysis is based on the data that was extracted from the reviewed studies. Among these are the aim, the data sources and tools that were used, as well as the outcomes of each of the studies. Furthermore, studies were categorized according to the aforementioned topics. All the data that was extracted is systematically documented in the Excel progress sheet, which is available for download from Google Drive1_.

Among the studies that focus on knowledge extraction is Wei’s (2012) work, which discusses and demonstrates the effect of data fragmentation, across multiple dimensions, on High-Throughput Clinical Phenotyping (HTCP). The specific part of this work that focuses on knowledge extraction is the work of Wei, Tao, Jiang, and Chute (2010). In this study, the authors propose a novel automated approach for the identifying and characterizing patients with a specific phenotype. This approach employs natural language processing techniques to automatically extract concepts from the SNOMED-CT ontology in unstructured patient clinical notes. The extracted concepts are subsequently used as features in a Support Vector Machine (SVM) classification algorithm to automatically identify new patients with a particular phenotype based on their clinical notes. Wei et al. (2010) conclude their work with an evaluation of their novel approach that utilizes clinical notes, through the comparison to conventional methods using structured data for HTCP. This evaluation showed excellent performance of the proposed new approach.

The work of Roberts, Straif, McKay, Statter and Cunningham (2008) also purely focuses on knowledge extraction. Their work is centered on the LarKC, Large Knowledge Collider, system which is “a platform for massive distributed incomplete reasoning that […] is used in several case prototypes” (p. ii). Among these use cases is the assistance of LarKC in carcinogen research. Their work outlines the requirements of LarKC in this use case, in which it can be used to assist in the production of standard reference publications, or so called monographs, that evaluate potential carcinogens, or in Genome-Wide Association Studies (GWAS) that aim to examine the association between genes and cancer. Effective support of LarKC in this use case is the extraction of knowledge from unstructured text, such as biomedical publications. To this end, Roberts et al. (2008) propose the use of semantic annotation, which automatically extracts entities and relations from unstructured text and matches and links these with equivalent concepts present in semantic models, such as ontologies.

The topics of knowledge extraction and knowledge representation are combined in the work of Marir, Said, and AlAlami (2014). In their work, they aim to address the issue of early discovery of disease conditions through the application of text mining to free-form text on social networks of people sharing their experiences in being diagnosed with a Diabetes. Through the application of text mining Marir et

(11)

Page | 11 al. aimed to build a knowledge base of symptoms of and factors contributing to Diabetes. This knowledge base was finally represented through the design and development of an ontology describing the Diabetes disease in terms of the types, diagnosis, general impact and high-risk factors.

In addition to focusing on both knowledge extraction and representation, Ye (2011) also focuses on the federation of data. In her work, Ye discusses the development of a system that represents knowledge about environmental and behavioral factors involved in human diseases, as well as body parts and symptoms that are affected, respectively caused, by diseases. The knowledge represented is stored using a core ontology that was developed during the research and federates several taxonomically organized sources, including MeSH, OMIM and the UMLS. Finally, Ye also developed a knowledge-gathering tool, specific to the biomedical domain, that employs pattern learning and constraint based reasoning to extract knowledge, in the form of new relations between concepts, from unstructured text based on the designed ontology. These new relations can subsequently be incorporated into the knowledge base to enrich the system.

On the contrary to Ye’s (2011) work that covered all three topics, Mohammad and Benlamri (2014) solely focus on knowledge representation, though in the context of the development of a differential diagnosis model that is extensively discussed in their work. The developed differential diagnosis model is able to assist practitioners in the diagnostic process by providing diagnostic recommendations. There are two cooperating core components to the differential diagnosis model, the first one being the evidence-based recommender component that employs (dynamic) rules derived from flexible clinical pathways to provide a recommendation. The second core component is a proximity-based recommender component that employs data mining techniques to provide clinicians with a diagnostic prediction, as well as generates new rules for inclusion in the evidence-based component, from training datasets. Both of these recommender components make extensive use of the Disease Symptom Ontology and Patient Ontology. The Disease Symptom Ontology describes the relations between diseases and symptoms and was developed by linking the Disease Ontology and the Symptom Ontology. The Patient Ontology, on the other hand, describes concepts and attributes related to patients and is used by the recommender to navigate a variety of medical documents. The differential diagnosis model was validated through the application of the model to several test cases, which showed promising results.

Whereas Mohammad and Benlamri (2014) discuss the development of a knowledge representation in the context of their differential diagnosis model, Rahimi, Liaw, Taggart, Ray, and Yu (2014) discuss the validation of a knowledge representation. In their work they provide a validation of an ontology-based approach to the diagnosis and management of patients with Diabetes. Their approach utilizes the Diabetes Mellitus ontology, which, according to clinicians, is “a realistic model of the real world of Diabetes diagnosis and management” (Rahimi et al., 2014, p. 4), to automatically compose ontology-based semantic queries that can be used to diagnose and manage patients with Diabetes. The validation was conducted using real world Electronic Health Record (EHR) data from a general practice and involved an assessment of the sensitivity and specificity (accuracy) according to three different ontology attributes. The obtained accuracy values were compared with a gold standard, which was set according to a manual audit of the same data. Results suggest that the DMO-based algorithm is sufficiently accurate to support a semantic approach. It should, however be noted that the accuracy can be negatively affected by incomplete or incorrect data.

The federation of data from different sources, in combination with the representation of knowledge, plays an important role in the work of Buckeridge et al. (2012). In their work, the authors describe a population health record that allows the monitoring and assessment of the health status of

(12)

Page | 12 a population, in the form of timely population health indicators that are automatically calculated in a high geographical resolution. This is achieved through the integration of multiple, heterogeneous and disparate clinical and administrative data sources. Beyond describing this population health record, the authors discuss the requirements and architecture of the infrastructure of the system as well as its initial implementations. Within this infrastructure, an important role is played by the indicator ontology, representing the public health indicators of a population, which was developed by the authors. The key role of this ontology is in the fact that it enables the integration of data and knowledge, facilitates knowledge discovery and exploration, as well as provides a base for data manipulation and analysis.

Just like Buckeridge et al. (2012), Muchado (2014) also focuses on both the representation of knowledge and the federation of data, though taking a different perspective than Buckeridge et al. Whereas knowledge representation takes on a supportive role in the work of Buckeridge et al., it is the main outcome of the work of Muchado as she aims to develop a semantic model that represents and integrates domain knowledge characterizing a disease and its prognosis process. This model is developed by using Semantic Web technologies and through the federation of existing ontologies, among which are SNOMED-CT, the NCI Thesaurus and the Ontology for Clinical Research. Additionally, Machado improved the results of knowledge exploration methods obtained with translational medicine datasets through the development of a methodology utilizing the knowledge contained in existing ontologies. Together these two contributions aim to advance the current knowledge about a disease and form the first step in the creation of a disease analysis framework for assisting doctors in the diagnosis and prognosis process of a disease.

Finally, the work of Pathak, Kiefer, Bielinski, and Chute (2012), and of Pathak, Kiefer, and Chute (2013) solely focus on the federation of data available across disparate sources. Both studies demonstrate the use of Semantic Web and Linked Data technologies for representing patient data from Electronic Health Records (EHR) as RDF. The data is subsequently exposed by making use of the SPARQL protocol, which, in turn, enables querying and the federation of such private patient data with publicly available data from the Linked Open Data Cloud. Whereas Pathak et al. (2012) federate such private EHR data with other private phenotype data for identifying subjects genotyped with Diabetes Mellitus Type 2, Pathak et al. (2013) federate the private EHR data with public data from DrugBank to, as such, identify potential drug-drug interactions for widely prescribed cardiovascular and gastroenterology drugs. Resulting from both studies is a proof-of-concept system that allows the representation of patient clinical data and genotype data as RDF, and exposes it via SPARQL endpoint for accessing and querying it, as such enabling the federation of the data with other available public data.

It becomes clear from the analysis of reviewed studies that the scope of studies in the field of interest is relatively broad, with studies focusing on rather distinct subjects. Despite this wide scope, any of the studies covers at least one of three recurring topics, whereas covering multiple of these topics is not uncommon either. Even though there is a close correspondence between these identified recurring topics and the identified gap motivating this research, this gap is (largely) left unaddressed by the reviewed studies. This lack of coverage of the identified gap is an additional source of motivation for conducting the research at hand, aiming to further explore the identified gap.

(13)

Page | 13

4 Data Source Characterization

Central to the development of a model is the data that eventually will be represented in the model and thus needs to be utilized for building and populating the model. Before the actual development of a model can start, it is thus imperative to identify data on which the model can be based, as well as to identify the sources from which the data can be obtained. With the research question in mind, the identification, and subsequent selection, of data sources that provide disease related information, pertaining to, for example, symptoms, inheritability, and genetics of a disease, thus are the first key activities in the development process of the disease related information model. It are these activities that are central in this section.

In Section 4.1 we will outline the sources that were identified during the search process. One source will subsequently be selected to serve as a base for the development of the model. The selection of this source as well as a motivation for our choice will be provided in Section 4.2.

4.1 Data Source Identification

The search for data sources returned a wide variety of potentially relevant data repositories, consisting of structured sources on the one end, and unstructured sources on the other hand. These structured sources usually take the form of databases and ontologies that present disease related information in a structured and (pre)defined format, as such facilitating the automated processing of data. Unstructured sources, on the other hand, generally come in the form of websites aimed at informing the general public about a particular disease, and included well known biomedical portals such as Medline Plus2_and WebMD3_{. Such sources present the disease related information in an unstructured format as free-form} text, which impedes the ability to automatically process the data provided by these sources. Considering the required effort to extract the data from such unstructured sources, in combination with the limited timespan of the research, we decided to solely focus on the incorporation of structured data sources into our disease related information model.

Three types of sources can be distinguished among these structured data sources representing disease related information, being (1) standardized terminologies or vocabularies, (2) ontologies, and (3) databases. In the remainder of this section we will discuss the identified data sources of each of these types individually.

4.1.1 Standardized Terminologies

A large part of the identified structured data sources are standardized terminologies or vocabularies that are widely used across the (bio)medical domain to provide standard definitions of the terms used in the field as well as their hierarchical relationships. Several well-known standardized terminologies that were identified are the International Classification of Diseases (ICD) version 9 and 10, which are terminologies developed by the World Health Organization (WHO) to define (a hierarchy of) diseases, disorders, injuries and other health related conditions (World Health Organization, n.d.); Medical Subjects Headings (MeSH), which is the U.S. National Library of Medicine’s controlled thesaurus of (bio)medical terms that is used to index scientific articles in the MEDLINE and PubMed bibliographic databases (U.S. National Library of Medicine, 2015b); and, the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT), which provides a standardized and machine-interpretable way to represent clinical terminology supporting in the development of high-quality content in health records (IHTSDO, n.d.). Beyond the standardization of used terms and their taxonomy, these standardized

2_{For details see:}_{http://www.nlm.nih.gov/medlineplus/} 3_{For details see:}_{http://www.webmd.com/}

(14)

Page | 14 vocabularies do not provide any disease related information. This, however, does not necessarily limit the potential use of such terminologies in our model, as the terms in defined by the vocabulary can be re-used in our model to represent the different concepts, such as diseases, symptoms or drugs.

In addition to separate terminologies in the (bio)medical field, two structured data sources were identified that integrate the abundance of separate (bio)medical terminologies, being the Unified Medical Language System (UMLS) Metathesaurus and the National Cancer Institute (NCI) Metathesaurus. Whereas the UMLS Metathesaurus integrates 150 terminologies, representing the relations among almost 3.2 million concepts (U.S. National Library of Medicine, 2013; U.S. National Library of Medicine, 2015d), the NCI Metathesaurus represents 22 million relationships among 2 million concepts mapped from 75 terminologies (National Cancer Institute, n.d.). These two integrated terminologies both represent the concepts and the (hierarchical) relationships among the concepts of their source terminologies in an integrated fashion and are freely accessible4_{. Integration is achieved by} clustering synonymous concepts across terminologies and through the inheritance of relations from source terminologies (Bodenreider, 2004). This integration subsequently enables the translation among terms and relations in the composing terminologies. While the UMLS Metathesaurus is mainly restricted to hierarchical parent-child relationships (U.S. National Library of Medicine, 2009b), the NCI Metathesaurus is not and also represents other relationships among concepts where, for example, one concept (e.g. Neoplastic Cell) is an abnormal cell of another concepts (e.g. Breast Carcinoma). The lack of such disease related information in the UMLS Metathesaurus restricts its usage solely to the reuse of the concepts in the metathesaurus to represent the concepts in the developed model. The NCI Metathesaurus, on the other hand, could be used as a comprehensive source that also provides relationships among concepts, as such providing the disease related information that is aimed to be represented (e.g. symptoms, drugs, etc.). Considering their integrative nature, terms from these metathesauri are preferred to be included within the developed model, over terms from individual terminologies.

4.1.2 Ontologies

In addition to the standardized terminologies, numerous of the identified structured data sources take the form of ontologies, which semantically define the terms used in a particular domain as well as the relationships between these terms in a standardized Semantic Web (file) format, such as the Resource Description Framework (RDF) or the Web Ontology Language (OWL).

A well-known example of an ontology that was identified during our search, is the Disease Ontology5_{. This open-source ontology provides a unified hierarchical representation of the concept of} disease through the semantic integration of disease and medical vocabularies including MeSH, ICD, NCI Thesaurus, SNOMED CT, and OMIM, in the form of a directed acyclic graph (Schriml et al., 2012). Furthermore, the disease ontology is commonly used for disease annotation, acts as a standard representation of human disease in biomedical ontologies, and provides a cross-mapping between resources (Schriml et al., 2012). Due to the sole representation of hierarchical relationships among diseases, and the lack of disease related information, the potential use of the Disease Ontology in our model is restricted to the re-use of the terms defined in the ontology.

Another frequently used ontology in the (bio)medical field that was identified during the search process, is the Gene Ontology6_{. This ontology aims to provide a description of the roles of genes and}

4_{For details see}_{https://ncim.nci.nih.gov/ncimbrowser/}_{(NCI Metathesaurus) and}

https://uts.nlm.nih.gov/home.html (UMLS Metathesaurus; requires a free UMLS License)

5_{For details see}_{http://www.disease-ontology.org} 6_{For details see}_{http://geneontology.org/}

(15)

Page | 15 gene products in any organism, in the form of a structured, precisely defined, common, and controlled vocabulary, as such enabling the annotation of gene and protein sequences (Ashburner et al., 2000). As of today, the Gene Ontology represents over 40.000 biomedical concepts (Gene Ontology Consortium, n.d.). Relationships described within the vocabulary have a hierarchical character, though are not solely limited to parent-child relationships as, for example, relations that indicate the regulation of one gene by another, are also represented within the ontology. To this extent, the Gene Ontology thus is a potentially relevant source to be included in the developed model. The hierarchical nature of the represented relationships lead to the Gene Ontology taking on the form of an acyclic directed graph (Ashburner et al., 2000).

Our search for structured data sources also returned the NCI Thesaurus7_{, which is a biomedical} ontology defining terms used in clinical care, translational and basic research, and public information and administrative activities, as well as the relationships among these terms. The ontology makes use of a logical framework that supports reasoning and, as such, enables inferencing of relationships among concepts. In 2007, the ontology included approximately 43.000 biomedical concepts, related to each other through 100 types of relationships, resulting in around 135.000 asserted and inherited relationships among concepts (Sioutos et al., 2007). These relationships are not only restricted to hierarchical parent-child relationships, but also include other semantic relationships that, for example, relate diseases to a rich set of molecular, pharmaceutical, clinical, and biological concepts (Sioutos et al., 2007). To this extent, the NCI Thesaurus thus provides potentially relevant disease related information for incorporation in the developed model. This makes the NCI Thesaurus are rather useful source for inclusion within our model.

A final ontology that was retrieved during our search was the UMLS Semantic Network8_. Whereas the other identified ontologies focus on defining a set of concepts and their interrelations, the Semantic Network aims to categorize the concepts that are represented in the UMLS Metathesaurus according to 134 so-called “semantic types” as well as defines 54 associative relationships that can exist among the different types of concepts, both of which are hierarchically organized (McCray, 2003). The Semantic Network thus provides “an overarching conceptual framework for all UMLS concepts” (McCray, 2003, p. 81). Among the relationships represented in the network are many non-hierarchical relationships (U.S. National Library of Medicine, 2009a), which could potentially be utilized to represent disease related information, such as “causes”, “predisposes”, and “affects”. With the standardization of the concepts and relationships that is provided by the Semantic Network in mind, this source is thus considered to be relevant for inclusion within our model. In order to take full advantage of its potential, the relationships in the Semantic Network, however, need to be instantiated by using the relationships in statements relating two concepts. To this end, the terms defined in the Semantic Network thus would need to be used in combination with a source that provides such relationships between concepts.

4.1.3 Databases

A third type of structured data are databases that provide disease related information in a relational format. Among the identified databases is OMIM,9_{or Online Mendelian Inheritance in Man, which is a} freely accessible comprehensive database on human genes and genetic disorders based on biomedical literature (Hamosh et al., 2002). OMIM extensively focuses on the representation of a wide range of disease related information, including information about the inheritance, pathogenesis, and diagnosis

7_{For details see}_{https://ncit.nci.nih.gov/ncitbrowser/}

8_{For details see}_{https://uts.nlm.nih.gov/home.html}_{(requires a free UMLS License)} 9_{For details see}_{http://www.omim.org/}

(16)

Page | 16 of diseases, which is provided through full-text summaries. OMIM represents its concepts in a flat structure, meaning that there are no direct relationships among the concepts in the database (Rappaport et al., 2013), even though relationships among a particular disease and other biomedical concepts might be encoded in the textual summaries in OMIM. As of today, the OMIM database contains over 23.000 entries (John Hopkins University, 2015), with the disease related information of each entry being elaborately documented.

During the search for data sources, we also identified MalaCards10_{, which is a freely accessible,} integrated database of human diseases and their annotations (Rappaport et al., 2013). Just like OMIM, MalaCards also puts a heavy emphasis on the representation of disease related information, though MalaCards focuses on human disease in general, whereas OMIM solely focuses on genetic diseases. Among the disease related information that is represented in MalaCards is disease-specific information on drugs, clinical features, phenotypes, associated genes, and related diseases. Additionally, associations between diseases are represented in the form of a related disease network. At the time of writing, the MalaCards database contains nearly 19.000 diseases (Weizmann Institute of Science, 2015b), with their annotations being automatically retrieved by merging and mining a wide range of sources (Rappaport et al., 2013). The range of sources includes text-mined as well as unstructured sources, among which are the Disease and Gene Ontologies, the NCI Thesaurus, the standardized terminologies of MeSH, ICD, and SNOMED CT, and the UMLS and OMIM, all of which were discussed above (Weizmann Institute of Science, 2015a).

Finally, the search process also returned SemMedDB, which is a database purely focused on the representation of relationships (“semantic predications”) among biomedical concepts that are extracted from the titles and abstracts of articles in the PubMed (bio)medical bibliographic database (Kilicoglu, Shin, Fiszman, Rosemblat, & Rindflesch, 2012). As of today, SemMedDB contains about 70 million relationships, among 1.4 million concepts, extracted from nearly 24.5 million publications (Lister Hill National Center for Biomedical Communications, 2014b). Extraction of the relationships is conducted using SemRep, a tool that employs extensive Natural Language Processing techniques to extract subject-predicate-object triples from free form text (Rindflesch & Fiszman, 2003). SemRep maps subject and object pairs to UMLS Metathesaurus concepts and / or Entrez-Gene terms, assigning each concept one or more of the 134 semantic types defined in the UMLS Semantic Network. Furthermore, predicates are mapped to the 54 relationships defined in the UMLS Semantic Network (Rindflesch & Fiszman, 2003). SemMedDB thus applies the semantic types and relationships that were defined in the UMLS Semantic Network to represent actual disease related information, such as concepts that are a cause of a particular disease, or concepts that treat a particular disease. SemMedDB furthermore acts as the backbone of Semantic MEDLINE, a Web-application that visualizes the semantic relationships among biomedical concepts as represented in SemMedDB (Kilicoglu et al., 2008).

Considering the strong emphasis of the aforementioned databases on the representation of disease related information, all three databases have the potential to be included within our model.

4.2 Data Source Selection

As became clear in Section 4.1, the search for data sources resulted in a set of data sources that provide disease related information and / or are potentially relevant to be included in our model. Before proceeding with model development, it is, however, key to designate one data source to form the base of the model.

(17)

Page | 17 A network-like presentation of disease related information, which is being developed, consists of two components, namely concepts and relationships between these concepts. Concepts can be sourced from the identified standardized terminologies, such as ICD, MeSH, and SNOMED CT, or from the identified ontologies, such as the NCI Thesaurus, the Disease Ontology, and the Gene Ontology. Instead of sourcing concepts from one or multiple individual terminologies, another option is to source the concepts from the UMLS or NCI Metathesaurus, both of which integrate, among many others, the aforementioned sources into a single terminology. Using these metathesauri allows us to take advantage of the concepts represented in all the separate terminologies, broadening the scope of the terms that are covered and, as such, expanding the (potential) knowledge base of the model. Therefore, the use of either the UMLS or NCI Metathesauri to define the concepts in the model is preferred over the use of the separate terminologies.

Relationships among the concepts, on the other hand, can be sourced from the data sources that contain disease related information, among which are OMIM, MalaCards, and SemMedDB. Additionally, the UMLS and NCI Metathesauri also contain relationships among concepts, which are inherited from their source terminologies, circumventing the need to source the relationships from the individual terminologies. It appears that disease related information is either presented in a structured format (as in SemMedDB and the metathesauri), clearly relating two (bio)medical concepts to each other through a relationship, or in an unstructured format using free-form text (as in OMIM), describing relationships among concepts that are not explicitly designated as such. With the aim of the research in mind, data sources providing this information in a structured format are preferred over data sources using an unstructured format. To this end, OMIM is thus not considered to serve as the primary source for the relationships in the developed model, though it might be used as a supplementary source at later stages in the model development. Even though part of the disease related information in MalaCards takes an unstructured format, the inclusion of information presented in a structured format does not lead to excluding MalaCards at this stage of the selection process.

Considering that the overarching aim of the research is to aid (bio)medical researchers in their knowledge explorations efforts, and that this (bio)medical knowledge originates from scientific research that is published, it would be preferred if the relationships in the model are directly derived from (bio)medical literature. SemMedDB is the only identified source that satisfies this auxiliary requirement, since the structured statements in its database are derived from the titles and abstracts of articles in PubMed. Additionally, provenance data, that indicates from which article and sentence a statement is derived, is also stored in SemMedDB, enabling statements to be traced back to their source. Relationships in the UMLS and NCI Metathesauri, and MalaCards, on the contrary, are aggregations of the relations represented in the source vocabularies, as such, inhibiting the ability to trace relationships back to their origins. To this end, SemMedDB is thus selected as the primary source, presenting disease related information, for incorporation into the developed model. This choice is further motivated by the fact that SemMedDB (70 million statements) is considerably larger than the other identified sources containing disease related information (MalaCards: unknown, NCI Metathesaurus: 22 million, UMLS Thesaurus: 3.2 million concepts), making SemMedDB the most extensive source containing a wealth of disease related information. Finally, the wide scope, covering terms across the entire biomedical domain, as well as the ease of accessing and obtaining SemMedDB, with a dump of the database being readily available, also played a role in the choice for SemMedDB.

(18)

Page | 18

5 Information Graph Silos – Data Gathering & Preprocessing

Given that SemMedDB is the data source that was selected to form the base of the developed model, the next step in the development of the disease related information model involves gathering the data as well as preprocessing it, which involves the design of a standard data structure for the model and a mapping to convert the data to this standardized structure. Through the completion of this preprocessing step, the data will be prepared for the actual population of the model. These data gathering and preprocessing steps are the focus of this section.

The data gathering step will be covered in Section 5.1. To facilitate preprocessing of the data in the subsequent step, an outline of the database structure will also be provided in Section 5.1. Section 5.2 will subsequently cover the data preprocessing, including the design of a standard data structure and the development of a mapping to convert the data into the desired structure.

5.1 Data Gathering

Data was gathered by downloading the SQL-dump of SemMedDB, which is provided free of charge when in possession of a UMLS license11_{, from the website of the Semantic Knowledge Representation project} (Lister Hill National Center for Biomedical Communications, 2014b). Given the SQL-format of the data in SemMedDB, MySQL Workbench (6.3) and MySQL Server (5.6) were used to locally import, host, and query the 49 gigabyte sized dump of SemMedDB.

5.1.1 Database Structure

The design of the relational SQL database containing SemMedDB’s data is shown in the Entity-Relationship diagram presented in Figure 2. Central to the database are the concepts, which are related to each other through predicates. The UMLS concepts used in the database are stored in the CONCEPT table. Each of these concepts can have one or multiple semantic types, as defined in the UMLS Semantic Network, with each version of a concept with a specific semantic type having its own entry in the CONCEPT_SEMTYPE table. These semantic type specific concept instances subsequently act as

11_{A UMLS License can be requested free of charge from the U.S. National Library of Medicine at}

https://uts.nlm.nih.gov/

(19)

Page | 19 arguments in one or multiple statements, as shown in the PREDICATION_ARGUMENT table, with each statement having two arguments: a subject and an object. These statements that relate a subject to an object through a predicate are represented in the PREDICATION table. Each statement, in turn, is derived from one or multiple sentences, which are represented in both the SENTENCE_PREDICATION and SENTENCE tables, with the former representing the sentences in the context of the relationships in the PREDICATION table and the latter solely containing the sentences. Furthermore, each of the sentences is uniquely related to a publication, which are represented in the CITATIONS table. Finally, the PREDICATION_AGGREGATE table aggregates the information from all these separate tables for easy access. This table represents all the statements, including their subject, predicate, and object, and has one entry for each of the sentences from which a particular statement is derived. An extensive description of the contents of each of the tables is provided in SemMedDB’s online documentation available from Lister Hill National Center for Biomedical Communications (2014a).

5.2 Data Preprocessing

As was introduced in Chapter 1, the large amounts of dynamic and heterogeneous information that is nowadays available across a wide range of sources, calls for the use of linked data. To this end we therefore aimed to develop our model using Semantic Web technologies. According to Berners-Lee et al. (2001) and Antoniou, Groth, Van Harmelen, and Hoekstra (2012), the Semantic Web consists of three main components, being (1) labeled graphs that encode meaning by representing concepts and the relations among them, and are usually expressed as (subject-predicate-object) triples in RDF; (2) Uniform Resource Identifiers (URIs) to uniquely identify the items in the datasets as well as to assert meaning, which is reflected in the design of RDF; and, (3) ontologies to formally define the relations that can exist among data-items. In order to develop our model using the Semantic Web, the existence of these three components in the model needed to be ensured. Processing the data in SemMedDB such that these three components exist, was the main aim of the preprocessing stage.

5.2.1 Ontology Design

Before being able to generate the labeled graphs from the database and ensure the use of URIs, an ontology needed to be developed that represents the desired data structure of these graphs. This ontology should define the data-items, as well as the relations among them, that we want to represent.

Considering that the model should represent the statements, and their provenance data, in SemMedDB as RDF graph, it is key for the ontology to closely resemble SemMedDB’s database design. Fortunately, prior work has been conducted in this area by Tao, Zhang, Jiang, Bouamrane, and Chute (2012). In their work, Tao et al. (2012) aim to optimize the organization and representation of Semantic MEDLINE data (SemMedDB) for translational science studies by reducing redundancy through the application of Semantic Web technologies. More specifically Tao et al. (2012) aim to achieve this by representing the concepts and associations in SemMedDB as RDF. Part of this work involved the development of a desired RDF data structure. An instance of the ontology that the authors developed is shown in Figure 3. In their ontology, Tao et al. (2012) represent each subject-predicate-object triple, also known as a statement, in SemMedDB as an instance of the association class. Subsequently, the subject, predicate, and object of a statement are related to the association using the object properties hass_name, haspredicate, and haso_name respectively. Finally, the provenance data pertaining to a statement is associated to the association class using the hasPMID property, which presumably is a datatype property considering the fact that it has an ID as value.

Despite successfully decreasing the redundancy of the information in SemMedDB, two shortcomings can be identified in the ontology that was developed by Tao et al. (2012). The first one

(20)

Page | 20 being the limited amount of

information that is represented by the ontology, compared to the information that is available in SemMedDB. As an example consider a subject or object entity, corresponding to an entry in the CONCEPT table of SemMedDB: whereas SemMedDB contains the name, an arbitrary ID, the UMLS unique concept identifier (CUI), the semantic types, and potential Genetic Home Reference (GHR) and OMIM identifiers of a

concept, the only information represented in the ontology is the name of the concept. Even though this shortcoming might not directly affect the work conducted by Tao et al. (2012), it does limit our work as the information presented in the ontology forms the foundation of the developed model. Additionally it impedes the ability to incorporate external resources into the model as retrieving the appropriate entities from these sources requires the use of unique identifiers.

A second shortcoming is in the lack of the reuse of the terms defined in existing vocabularies, which is one of the founding principles of the Semantic Web (Shadbolt, Hall, and Berners-Lee, 2006). In their ontology Tao et al. (2012) defines the classes of terms themselves, whereas classes in the Web of Data that are equivalent might already exist. Such reuse facilitates the linking of data into a Web of Data, which is an overarching goal of the Semantic Web (Berners-Lee et al., 2001).

Despite the limitations of in the work of Tao et al. (2012), their ontology was considered as a starting point as well as an opportunity to improve on and extend their ontology. To this extent, an ontology was developed that addresses the identified shortcomings, by representing most of the information contained within SemMedDB, as well as through reusing terms from as much existing ontologies as possible. The ontology that was developed is shown in Figure 4, which shows that the ontology takes on a similar structure as the ontology presented by Tao et al. (2012).

The ontology was developed in the Web Ontology Language (OWL2), which is a more expressive extension of RDF and RDFS that provides extensive reasoning support (Antoniou, Groth, Van Harmelen, & Hoekstra, 2012), using Protége12_{. The ontology is published on a Persistent Uniform Resource Locators} (PURL) domain, which allows the underlying Web address of a resource to change while not affecting the availability of the systems that depend on this resource (PURL Home Page, n.d.). The ontology is published on its base URI: http://purl.org/net/fcnmed. The ontology can either be downloaded or directly imported into Protége from this URL. In remainder of this section we will outline the ontology structure, according to the need for RDF Reification and the reuse of existing vocabularies.

5.2.1.1 RDF Reification

In the conventional use of RDF, two data items are related to each other via a predicate acting as a property, as such resulting in a subject predicate object triple (Schreiber & Raimond, 2014). In case of SemMedDB, this would result in triples that relate two concepts to each other through a relation: concept relation concept. RDF, however, does not allow one to directly make a statement about such a

12_{For details see}_{http://protege.stanford.edu/}

Figure 3 – Instantiated ontology of the data in SemMedDB as developed by Tao et al. (2012).

(21)

Page | 21 statement, for example to indicate the source of the statement. The semantics of RDF, on the other hand, allows one to make such a statement indirectly through reification of the statement (Hayes & Patel-Schneider, 2014). To this extent one should explicitly assert a statement to be an instance of the rdf:Statement class, using the rdf:type property. The subject, predicate, and object can then subsequently be asserted to the statement using the rdf:subject, rdf:predicate, and rdf:object properties respectively, with the subject, predicate, and object being class instances. In addition to simply asserting the subject, predicate, and object to a statement, reification also enables the involvement of the statement as an entity in any other triple, as such enabling one to make metastatements: statements about statements.

While reification is simple and effective, it lacks both efficiency, considering that the representation of one statement using reification requires four triples, as well as formal semantics for connecting a statement and the resource describing it (Nguyen, Bodenreider, & Sheth, 2014). These issues are addressed by Nguyen et al. through the proposal of an alternative approach to reification, which emphasizes the uniqueness of a relationship given a particular context. To this extent, Nguyen et al. introduced the singleton property that considers every property relating two data-items in a specific context to be unique and, as such representing it through a specific property instance of a generic property. When, for example, making two statements having the same predicates but different concepts, such as A affects B and C affects D, Nguyen et al. argue that the affects predicate is a key to both statements considering their specific contexts. In order to make metastatements about these triples, traditional reification would require one to create an instance for the statement as a whole and subsequently assert it the rdf:Statement type and a subject, predicate, and object. The singleton property, on the other hand, allows one to indicate that affects1 (connecting A and B) and affects2 (connecting C and D) are rdf:singletonPropertyOf the general affects property, as such instantiating the property and allowing one to use this singleton property to make metastatements using only two triples. Despite reducing the number of triples required for making metastatements, as well as providing a formal semantics, the singleton property is not (yet) supported in ontology languages such as RDFS and OWL.

(22)

Page | 22 Considering that the provenance data in SemMedDB applies to statements as a whole in combination with the lack of support for the singleton property in OWL, reification was thus necessary in order to represent this provenance data in the ontology. Tao et al. (2012) also recognized this need, however, they did not use the RDF Reification vocabulary as outlined by Hayes and Patel-Schneider (2014). Instead they applied a custom vocabulary by using the Association class and the hass_name, haspredicate, and haso_name object properties. Our ontology on the other hand implements the RDF reification vocabulary.

As the statements contained in SemMedDB relate two UMLS concepts to each other, both the subject and object of an rdf:Statement are modelled as instances of a Concept class. The concepts are related to each other through one, of 58, relationships that are identified by SemRep. The predicate of an rdf:Statement therefore is modelled as instance of a Relation class, containing 58 instances. This set of relationships consists of two disjunctive subsets, with one subset containing 31 relationships directly derived from the UMLS Semantic Network, such as “causes” (Kilicoglu et al., 2012). The second subset, on the other hand, consist of the remaining 27 relationships, which are negated versions of the relationships in the first subset, such as “neg_causes”, referring to “does not causes” (Kilicoglu, Rosemblat, Fiszman, & Rindflesch, 2011). Relationships belonging to the negated set are prefixed with “NEG”, whereas all other relationships are considered to belong to the set of affirmed relationships. These two subsets of relations are represented in the ontology as two subclasses of the Relation class, being the AffirmedRelation and the NegatedRelation classes respectively. A full list of descriptions and definitions of the used relationship is provided by Kilicoglu et al. (2011).

The provenance data in SemMedDB includes both the sentences from which a statement is derived as well as the publications in which these sentences occur. Reification of the statements enables the assertion of this provenance data to their respective statements. To this end, sentences are represented as instances of the Sentence class, which are related to the rdf:Statement class through a derivedFrom property. The articles in which these statements and sentences are contained, are represented as instances of an Articles, which are related to the rdf:Statement class through a source property. Furthermore sentences are related to articles through the partOf property, indicating that a sentence is part of an academic article.

In addition to the object properties, relating classes to each other, discussed in this section, a number of datatype properties, asserting data values (such as IDs) to classes, are asserted to each of the classes in the ontology as well. Collectively, these properties aim to represent as much information from SemMedDB in the ontology as possible. These datatype properties are described in Appendix C.

5.2.1.2 Vocabulary Reuse

With the Semantic Web’s emphasis on the reuse of existing vocabularies in mind, we aimed to reuse as much existing classes and properties in the developed ontology as possible (Shadbolt et al., 2006). To this extent all elements of the ontology, which include the classes and both the object and datatype properties, except elements from the RDF or RDFS namespaces, were checked for the presence of an already defined equivalent concept or property in existing ontologies. This was accomplished by making use of the online RDF vocabulary search and lookup tool vocab.cc13_{that allows one to enter any term,} returning any classes and properties that (partially) match the term (Harth & Stadtmüller, n.d.). In general the highest ranked term that corresponds to the role of the term in our ontology (e.g. class or property) is selected for reuse in the ontology, unless is specified otherwise. The search for existing terms, in the end lead to the incorporation of terms from three existing vocabularies, being (1) the

(23)

Page | 23 Bibliographic Ontology, (2) the Dublin Core Metadata Terms, and (3) the Simple Knowledge Organization System. The terms included from each of these terminologies will be briefly outlined in the remainder of this section.

The Bibliographic Ontology (bibo) defines the concepts and properties used for describing publications and bibliographic references, such as quotes, books, and articles, on the Semantic Web (D’Arcus & Giasson, 2009). Among the terms from this vocabulary that are included in the developed ontology are the bibo:issn and bibo:pubmed properties, respectively defining the ISSN and PubMed ID of a publication. These properties were selected over equivalent properties in DBpedia due to the specific focus of the Bibliographic Ontology on bibliographical items. Additionally the bibo:AcademicArticle class was selected to represent the publications from which the statements in SemMedDB are sourced. This term was selected over the bibo:Article class due to its specificity and the fact that all citations referred to in SemMedDB are academic articles.

Whereas the Bibliographic Ontology focuses on defining bibliographic terms, the Dublin Core Metadata Terms (dc) aim to describe data about any type of data source (metadata) (Dublin Core Metadata Initiative, n.d.). Terms from this vocabulary that are incorporated into our ontology include dc:source, describing a resource (subject) that is derived from a related resource (object), and dc:isPartOf, describing a resource (subject) that in included in a related resource (object) (DCMI Usage Board, 2012). These terms were selected to be incorporated in the ontology due to the lack of semantic correspondence of the defined highest ranked term (sourceDb) to the term that we intended to describe (an article is the source of a statement), and the specificity of the Dublin Core Metadata Terms over DBpedia respectively.

Finally, the Simple Knowledge Organization System (SKOS) defines a standard way for representing knowledge organization systems, such as thesauri, subject heading systems, and taxonomies, in RDF enabling linking and sharing these systems via the Web (W3C, 2012). The only term from this vocabulary that was included into our ontology was the skos:Concept class, which is viewed as defining ideas or units of thought (Miles & Bechhofer, 2009).

5.2.1.3 Summary

In summary, a comprehensive ontology was developed to represent the statements, relating (bio)medical concepts stored in SemMedDB. Additionally, the ontology models the provenance data, in terms of the scientific publications and sentences from which the statements were derived. The developed ontology builds on the work of Tao et al. (2012), improving the ontology they developed by extending the information captured in the ontology and by promoting the reuse of terms in existing vocabularies. An overview of the classes and object properties between them that are modelled in the ontology is provided in Table 1 and Table 2 respectively.

5.2.2 Data Mapping

The ontology developed in Section 5.2.1 defines the desired data structure for the developed model. Generating the labeled graphs from the SQL in SemMedDB, however, requires a mapping that specifies how the data in the database is matched and converted to the appropriate class instances, properties, and property values specified in the ontology.

5.2.2.1 D2RQ

Such a mapping can be developed using D2RQ, a declarative language for describing mappings between relational databases and RDF(S) and OWL ontologies (Bizer & Seaborne, 2004). D2RQ mapping files are RDF files, which are written in the Turtle syntax14_{. A mapping file enables RDF applications to access}