Applying concept networks for knowledge discovery

(1)

Applying concept networks for knowledge discovery

Wytze Jelle Vlietstra

January 30, 2014

(2)

(3)

Applying concept networks for knowledge discovery

Author:

Wytze J. Vlietstra, BSc. Student number 5673593 wytze.vlietstra@gmail.com MSc. student Medical Informatics Academisch Medisch Centrum Universiteit van Amsterdam

Mentors:

Erik M. van Mulligen, PhD. Department of Medical Informatics Erasmus Universitair Medisch Centrum Erik A. Schultes, PhD.

Department of Human Genetics Leids Universitair Medisch Centrum Rein Vos, MD, PhD.

Department of Medical Informatics Erasmus Universitair Medisch Centrum

Tutor:

Floris Wiesman, PhD.

Department of Medical Informatics Academisch Medisch Centrum Universiteit van Amsterdam

Location of Scientific Research project: Department of Medical Informatics

Erasmus Universitair Medisch Centrum Dr. Molewaterplein 50

3015 GE Rotterdam

Department of Human Genetics

Leids Universitair Medisch Centrum, Building 2 Einthovenweg 20

2333 ZC Leiden

Time period:

(4)

Samenvatting . . . 4 Abstract . . . 5 1. Introduction . . . 6 1.1 Problems . . . 6 1.1.1 Reasoning Problem . . . 7 1.1.2 Research problem . . . 7 1.2 Goal . . . 8 1.3 Research Questions . . . 8 1.4 Approach . . . 8 1.5 Outline . . . 9

2. Existing implementations of Swanson’s ABC algorithm . . . 10

2.1 Methods . . . 12 2.2 Philosophical Background . . . 12 2.3 Text Source . . . 14 2.3.1 Corpus selection . . . 14 2.3.2 Titles . . . 16 2.3.3 MeSH terms . . . 17 2.3.4 Abstracts . . . 18 2.4 Analysis Method . . . 20 2.4.1 Matrix techniques . . . 21 2.4.2 Co-occurrence network . . . 22 2.4.3 Outlier detection . . . 23

2.5 Categorization and Ranking . . . 25

2.5.1 Categorization and Filtering . . . 25

2.5.2 Ranking . . . 26

2.6 Evaluation . . . 29

2.6.1 Recreating Swanson’s Results . . . 29

2.6.2 Prior-post analysis . . . 30

2.7 Future: Relationship mining . . . 31

2.8 Implications for own research . . . 32

2.8.1 Text source . . . 32

2.8.2 Analysis method . . . 32

2.8.3 Categorization and Ranking . . . 33

2.8.4 Evaluation . . . 33

(5)

Contents 2

3. Knowledge discovery: Point-to-point . . . 36

3.1 Datasets . . . 36

3.1.1 EMC Knowledge Base . . . 36

3.1.2 Manually curated migraine dataset . . . 36

3.2 Methods . . . 37

3.2.1 Path retrieval . . . 38

3.2.2 PubMed Source analysis . . . 38

3.2.3 Time development analysis . . . 38

3.2.4 Implementation details . . . 38

3.3 Results . . . 39

3.3.1 Path properties . . . 39

3.3.2 PubMed data analysis . . . 41

3.3.3 Time development . . . 42

3.4 Discussion . . . 43

4. Knowledge discovery: Migraine cloud . . . 46

4.1 Methods . . . 46

4.1.1 Network statistics . . . 48

4.1.2 Additional filters . . . 48

4.2 Results . . . 49

4.3 Discussion . . . 52

5. Selection of output format . . . 56

5.1 Description and discussion of various output formats . . . 56

5.1.1 Lists of paths . . . 56

5.1.2 Time development plots . . . 58

5.1.3 Lists of compounds . . . 59 5.1.4 Graph . . . 60 5.2 Lessons learned . . . 60 5.3 Discussion . . . 62 6. Discussion . . . 63 6.1 Main findings . . . 63

6.1.1 Swanson literature review . . . 64

6.1.2 Point-to-point . . . 65

6.1.3 Cloud approach . . . 65

6.1.4 Selection of output format . . . 66

6.2 Strengths and weaknesses . . . 68

6.2.1 Knowledge extraction . . . 68

6.2.2 Knowledge dissemination . . . 68

6.2.3 Knowledge base . . . 69

6.3 Relation to other work . . . 70

6.4 Recommendations for future research . . . 70

6.4.1 Knowledge Base . . . 70

6.4.2 Path analysis . . . 71

6.4.3 Time development . . . 72

6.4.4 Cloud approach . . . 72

(6)

References . . . 73

Appendices 79 A. Preliminaries . . . 80

A.1 Supervised and unsupervised learning . . . 80

A.2 Concepts . . . 80

A.3 Text Mining . . . 80

A.3.1 Thesaurus . . . 81

A.3.2 Natural Language Processing . . . 81

A.3.3 Word Sense Disambiguation . . . 82

A.4 RDF . . . 82

A.5 Graph Databases . . . 83

A.6 Big Data . . . 83

B. Reverse Semantic Type Influence . . . 84

C. Connection and knowledge distribution in the knowledge base per concept . . . 85

(7)

SAMENVATTING

Sleutelwoorden: Kennisontdekking uit literatuur, Grafen database, Concepten, Resultaten presentatie, Geautomatiseerde redeneren

Doelen: Deze scriptie onderzoekt hoe geautomatiseerd redeneren op gestructureerde en ge¨ıntegreerde kennis kan worden toegepast voor kennisontdekking in een door het Erasmus MC ontwikkeld netwerk van concepten en hun onderlinge relaties. Een klassiek geautomati-seerd redenerings algoritme is ontwikkeld door Swanson, en levert een lijst van geassocieerde termen aan die gerangschikt en gefilterd dienen te worden. Daarvoor wilden wij in de li-teratuur geschikte algoritmen en methoden vinden. Daarnaast wilden wij uitzoeken hoe Swansons algoritme, orgineel ontwikkeld voor kennisontdekking uit literatuur, kan worden toegepast op een concepten netwerk, omvat in een grafen database. Ook wilden wij onder-zoeken welk formaat om resultaten te presenteren de resultaten uit deze database voor een eindgebruiker zo inzichtelijk en bruikbaar mogelijk maakt.

Methoden: In de literatuur review zijn implementaties van Swansons algoritme vergeleken op diverse vlakken, zoals bijvoorbeeld tekstbron, rangschik algoritme, en evaluatie techniek. Zowel Swansons algoritme als een nieuw ontwikkeld algoritme die wij onze “cloud” methodiek hebben genoemd zijn toegepast voor kennisontdekking. De cloud methodiek onderscheidt zich van Swansons algoritme door zijn toepassing van geaggregeerde, kwantitatieve data, en door zijn beperkingen op ge¨ıncludeerde tussenliggende concepten. Daarnaast waren bij de cloud methodiek niet alleen de verbinding naar migraine van belang, maar ook naar geselec-teerde migraine-omliggende concepten. De resultaten van de twee algoritmes zijn vergeleken met een bestaande handmatig samengestelde set van biomarkers. Voor het presenteren van resultaten hebben wij verschillende formaten ontwikkeld en voorgelegd aan een expert. Resultaten: De literatuur review leverde meerdere rangschik algoritmes op, maar deze bleken niet toepasbaar te zijn op onze database. Bij onze eerste aanpak waren wij niet in staat gedeelde eigenschappen te vinden in de paden tussen de biomarkers en migraine. De cloud methodiek resulteerde in lijsten van kandidaten, waarin wij 96 van de 99 biomarkers konden terugvinden. Bij het evalueren van deze lijsten met gewogen ROC plots slaagden wij erin 75% van de score in de top 10% van de lijst te behalen, alhoewel dit alsnog ruim 3600 kandidaten waren. Voor het presenteren van resultaten hebben wij vier formaten ontwikkeld. Conclusies: Uit onze literatuur studie concluderen wij dat de daar beschreven rangschik algoritmen niet toepasbaar zijn op onze grafen database. Doordat wij niet in staat zijn gedeelde eigenschappen tussen de biomarkers en migraine te vinden in de paden zijn wij bovendien genoodzaakt Swansons algoritme in zijn geheel te verwerpen als geautomateerd redenerings algoritme voor grafen databases. De cloud methodiek daarentegen levert ons met zijn rangschikbare lijst van kandidaten een compleet resultaat op, waarbij de geaggre-geerde kwantitatieve data een intu¨ıtief en nuttig hulpmiddel bieden. Het aantal aangeleverde kandidaten blijft echter hoog. De lijst van kandidaten biedt als formaat om resultaten te presenteren de hoogste data dichtheid en flexibiliteit, met zijn geaggregeerde kwantitatieve data en zijn mogelijkheid om de kandidaten naar eigen inzicht te rangschikken en te filteren.

(8)

Keywords: Literature based knowledge discovery, Graph database, Concepts, Output for-mat, Automated reasoning

Objectives: This thesis investigates how automated reasoning can be applied for knowledge discovery on structured and integrated knowledge in a concept-based network developed at the Erasmus MC. A classic automated reasoning algorithm was developed by Swanson, and returns a list of associated terms which have to be ranked and filtered. We set out to investigate how Swanson’s algorithm, which was originally developed for literature based discovery, can be applied to a concept-based network encompassed in a graph database. Additionally we wished to learn suitable ranking algorithms and filtering methods from a literature review. We also wanted to examine which output format makes the results from this database as interpretable and usable as possible for end users.

Methods: In the literature review we compared implementations of Swanson’s algorithm on various themes, for example text source, ranking algorithm, and evaluation methodol-ogy. Both Swanson’s algorithm and a newly developed algorithm which we have named the “cloud” approach have been applied for knowledge discovery. The cloud approach differs from Swanson’s algorithm in its application of aggregated, quantitative data, and by its lim-itations on included intermediate concepts. Also, not only the connection to migraine was important for the cloud approach, but also the connections to selected migraine-adjacent concepts. The results of the two algorithms were compared to an existing manually curated set of migraine biomarker compounds. To present our results we developed multiple output formats which were offered to an expert.

Results: We learned multiple ranking algorithms from the literature review, but found they were not applicable on our graph database. With our first approach we were unable to discern shared properties from the paths between the biomarker compounds and migraine. The cloud approach resulted in lists of candidate compounds, on which we retrieved 96 out of the 99 biomarker compounds. When evaluating these lists with weighted ROC plots we were able to achieve 75% of the score in the top 10% of the list, which nevertheless were still more than 3600 candidates. We developed four output formats for presenting results. Conclusions: Based on our literature review we conclude that the previously described ranking algorithms cannot be applied on our graph database. As we are unable to discern shared properties from the paths between the biomarker compounds and migraine, we have to reject Swanson’s algorithm as an automated reasoning algorithm for graph databases. Contrarily, the cloud approach results in a complete and rankable list of candidates, in which the aggregated quantitative data provide an intuitive and helpful tool. However, the number of candidates remains high. The lists of compounds offers the highest data density and flexibility as an output format, with its aggregated quantitative data and its capabilities for users to rank and filter candidate compounds according to personal preference.

(9)

1. INTRODUCTION

Traditionally, biomedical knowledge has been disseminated in paper form. Disseminating knowledge in paper form requires researchers to read scientific articles, interpret them and extract their knowledge from them. Incorporating this new knowledge, the researcher de-velops an updated theory and performs experiments on a model to test the validity of this enhanced theory. If the updated theory proves to be valid, the researcher writes an article about this theory to disseminate the new knowledge.

However, the rate of publication and the number of articles available in PubMed, the main biomedical literature resource, keeps increasing steadily. With a growth from 17 million to over 21 million articles in only five years, and a growth rate that increases by 4 percent per year, it has become impossible for researchers to keep up with all developments, even in their respective fields [10, 26].

With the rise of information technology, a similar development can be observed with biomedical databases. Employed by a wide variety of research, each database is tailored to its own specific field and needs, requiring both familiarity with the available databases, as well as skills in using them from the researchers who wish to find relevant knowledge [61].

The human limitations in large-volume knowledge extraction has caused more efficient alternatives to be investigated. The same advances in information technology as mentioned earlier have made automated reasoning an especially attractive alternative for knowledge discovery. Automated reasoning facilitates a computer to process structured data and reason according to pre-programmed logic. The benefits of automated reasoning are the ability to include more knowledge as well as the ability to create and analyse more complicated models than humans are capable of. One example of such an automated reasoning algorithm is Swanson’s ABC algorithm, which finds connections in texts between terms previously not associated with each other, by connecting them to each other through intermediate terms. For example one article could state protein A stimulates physiological process B, while another article states physiological process B is responsible for disease C. Although there is no article connecting protein A to disease C, they are likely to be associated with each other. Swanson developed an algorithm to identify such previously unnoticed associations. In the process he described, called literature based discovery, associations are based on the co-occurrences of terms found in articles. However, such associations have to be pruned for relevance, which is dependent on the knowledge source and the desired application. To be able to perform such pruning, patterns between relevant associations have to be discerned.

1.1 Problems

During the course of this project we will address two sets of problems. We will describe the broad questions of biomedical information management, and specifically investigate the preferred approach to automated reasoning. The results of this project will be used by an expert in his research into migraine, whereby he will be assessing their practical use for his research and possibilities for future applications.

(10)

1.1.1 Reasoning Problem

The current format in which the majority of knowledge is disseminated, i.e. articles contain-ing free text, is ill suited for interpretation by computers. The solution lies in formalizcontain-ing the knowledge contained in these articles. In the context of biomedical literature, formal-izing knowledge means representing concepts such as diseases or proteins in a standardized structure. These concepts can then be connected to another standardized concept based on evidence found in literature or databases [5]. The result is an environment in which all knowl-edge about the concepts and their associations contained in the literature and databases is formalized in a standardized structure on which a computer is able to perform automated reasoning.

Large amounts of knowledge are stored in specific databases such as the Open Phar-macological Space, GO, and UNIPROT [73]. Currently, researchers must search for and combine this knowledge manually, in a process similar to the process of extracting knowl-edge from articles. The structures containing this knowlknowl-edge are often incompatible, severely contributing to the fragmentation. As a result, automated reasoning cannot be performed reliably, as missing concepts or links between concepts is likely. This fragmentation issue subverts one of the primary goals of automated reasoning: the principle that all available knowledge can be included when creating models. Therefore, if automated reasoning is ever to be equivalent to human reasoning, this issue needs to be resolved. Graph databases have the capacity to contain enormous amounts of data, which does not have to conform to a pre-specified structure, and which they can access rapidly. As such, graph databases are capable of integrating information from a variety of sources, such as various databases or formalized texts. The integration of large amounts of knowledge from various databases and texts is a recent development in information science, and has been coined ”Big Data”.

With their specific architecture, graph databases are highly suitable for automated rea-soning, for which they have a large number of algorithms available. However, their capacities and their preferred approach for biomedical knowledge discovery using automated reasoning have not yet been developed nor tested. In this thesis we will test the automated reason-ing and knowledge discovery capacities of an EMC developed graph database containreason-ing knowledge from both texts and databases in a concept-based network.

1.1.2 Research problem

Besides addressing the problems defined by ourselves for this project, we also aim to address a broader question by helping a migraine expert with his research. The migraine expert has performed a manual literature review to identify a set of biomarker compounds associated with migraine. To obtain this set of biomarker compounds, several literature databases had to be searched and articles retrieved. In total, he read 1030 article abstracts and 122 full-text articles. To limit his search, he focussed on compounds measured in patients’ cerebral spinal fluid (CSF), excluding other biofluids. As can be imagined, this is a long and arduous process, requiring reading many thousands of pages of scientific literature. He was therefore interested in our capability to replicate his set of compounds through automated reasoning, and our capability in suggesting additional biomarker compounds. In addition to suggesting additional biomarker compounds, the migraine expert was also interested in our capability to confirm and quantify his selection of biomarker compounds in his manually curated migraine dataset. Should we prove capable, automated reasoning would provide him with a powerful aid for future research by saving many hours of manual labour and decreasing the chance of human error or oversight.

(11)

1. Introduction 8

1.2 Goal

The goal of this project is to compare the results of a manual literature based discovery process, in the form of a manually curated migraine dataset, to an automated reasoning process performed on a concept-based network, using the EMC Big Data graph database. The capabilities of these automated reasoning processes, such as Swanson’s algorithm, to provide additional or novel knowledge to the end user will be tested. As Swanson’s algorithm is the traditional approach to knowledge discovery we will examine both its implementations and the methods of ranking and filtering associated terms described in the literature. Finally, to describe a suitable output format for such complex and voluminous data, we will discuss several possible formats.

1.3 Research Questions

1. Which methods for term ranking and filtering have been described for literature based knowledge discovery?

2. How can concept-based networks be applied for knowledge discovery?

3. What are the advantages and disadvantages of the various developed output formats for the data from concept networks?

1.4 Approach

1. To answer our first research question a literature review will be performed on Swan-son’s classic literature based discovery algorithm. The literature review will investigate the various methods for ranking and filtering terms and examine their suitability for our approach by comparing their implementations to our knowledge base. Although our knowledge base follows a Big Data approach, which would make the inclusion of knowledge discovery from structured databases suitable, as of yet it primarily consists of literature based sources, having the UMLS semantic network at its basis, and with the majority of its relationships being text mined from PubMed.

2. To answer our second research question we will examine and test two automated rea-soning algorithms for knowledge discovery on the EMC graph database. We test both Swanson’s classic algorithm, as well as a novel algorithm resulting from what we have named the cloud approach. To benchmark the automated reasoning capacities of these algorithms we compare our results to a manually curated migraine dataset developed at the LUMC.

3. To answer our third research question we will examine the possible output formats, discussing their various drawbacks and benefits. In addition to discussing the four output formats themselves, we will also discuss the response they elicited from the migraine expert.

(12)

1.5 Outline

Appendix A will provide background information of relevant techniques and tools. It is rec-ommended you read this section if you are unfamiliar with the (literature based) knowledge discovery field. Chapter 2 will discuss the literature review, in which we examine the im-plementations of Swanson’s algorithm by comparing our graph database to these implemen-tations, and we investigate their described methods for ranking intermediate and indirectly associated terms. Chapter 3 describes our first approach to knowledge discovery, in which we attempted to describe a pattern based on the paths connecting migraine to the compounds in the manually curated migraine dataset. Chapter 4 describes an alternative approach to knowledge discovery which employs network statistics and algorithms. In chapter 5 the var-ious output formats which have been developed during this project will be discussed, after which a general discussion is presented in chapter 6. Additional figures and information are available in the appendices.

(13)

2. EXISTING IMPLEMENTATIONS OF SWANSON’S ABC ALGORITHM

Swanson was the first to discover that new associations between terms could be identified by applying an algorithm to biomedical literature, in a process later named literature based

discovery (LBD) [52]. A librarian by trade, he first discovered the connection between

Raynaud’s disease and fish oil by accident. By starting out with Raynaud’s disease, he found blood coagulation factors as intermediate terms, which led to fish oil as a possible treatment [48]. This discovery made him see the potential of text mining as a means to identify yet unnoticed associations between diseases and compounds, and led to another landmark discovery, the magnesium – migraine connection.

Most knowledge in texts is explicit, easily recognizable and extensively described as it was the goal of the author to transfer it. However, there is also implicit knowledge, which is not explicitly stated in a single article, but can be logically inferred from a whole corpus of literature. Due to the vast volume of literature in the biomedical field, or simple oversights, this knowledge has not been inferred and subsequently made explicit. Two implicitly con-nected terms are never co-mentioned in the same article, but both are co-mentioned with the intermediate term. Swanson’s algorithm identifies logically connected, but not explicitly stated connections by examining the whole corpus of biomedical literature. His algorithm can also resurrect neglected discoveries, which might have been stated explicitly a small number of times, but have never been followed up [54].

In knowledge discovery, commonly used terms are hypothesis generating processes and hypothesis checking processes, or open and closed processes [67]. Hypothesis generating, or open processes suggest connections between terms by applying an algorithm to a literature corpus. This process connects the term to new, previously unknown terms through interme-diate terms. The resulting terms can then be filtered, ranked, and analysed in an attempt to describe new knowledge. The connection of these terms should then make logical sense to an expert, by which he discovers new knowledge. Hypothesis generation, was Swanson’s original intended use [67].

With hypothesis checking, or closed processes two terms are supplied for which the in-termediate terms are mapped. Hypothesis checking is useful for quantifying the strength of a connection between two terms, providing an indication of likeliness of the hypothesis connecting the unconnected terms. The intermediate terms can then be analysed both qual-itatively and quantqual-itatively to determine the strength of the connection between the two starting terms.

To formalize his manual approach, Swanson developed the ABC algorithm. In his ABC algorithm, the A and C-terms are the terms to be connected. These terms do not co-occur in the same text. The term(s) are the intermediate terms. The general idea is that the B-terms co-occur with the A and C-B-terms, thereby connecting the A and C-B-terms (Figure 2.1). The number of B-terms can vary, but does not always indicate the strength of the connection. A large number of B-terms of course indicates a strong connection between A and C, but if they are very general their biological relevance will be limited. A small number of B-terms does not necessarily indicate a weak connection. If the B-terms make biological sense, it might be that the A and C terms are connected.

(14)

Fig. 2.1: Swanson’s ABC structure represented as a Venn diagram. In this figure the A and C-terms are connected by multiple B-terms.

Literature based knowledge discovery searches for undiscovered transitive relationships, in which A is logically connected to C through a biologically valid B-term. The principle of transitivity stems from mathematics and is intu¨ıtive, but its logic is complex when applying it to texts.

Swanson’s original discoveries were acci-dental, based on a manual process in which he exhaustively read the relevant literature from which he ultimately was able to de-scribe these implicit connections [48, 51]. He recognized automation would consider-ably reduce the amount of manual labour and increase efficiency, and automated his algorithm in a program named Arrowsmith

several years later [44, 55]. He called his

process “more fortuitous than systematic” [50] and even after it was implemented in Arrow-smith “more or less systematic” [53]. Swanson explicitly stated that his process should be supervised, stating his algorithm should stimulate research, not codify it [55]. Even though the process has not yet been completely codified, a large number of steps that initially required human intervention have been automated.

The goal of this literature review is to examine the various automated implementations of Swanson’s algorithm and compare their approaches to knowledge discovery, especially investigating algorithms for ranking and filtering terms. The themes of this chapter’s sections have been chosen to best represent the various steps in the ABC algorithm’s process. Each section has been built up historically, to describe the evolution in the field.

Before these steps are discussed however, the philosophical backgrounds to literature based discovery are examined. Swanson himself extensively treated the implications and background of his algorithm, as described in section 2.2. However, he chose not to inte-grate these implications into his implementation of his algorithm, in which he was very pragmatic. Several other authors applied divergent philosophies in more complex implemen-tations, which will also be described.

The aforementioned steps start with text source selection, discussed in section 2.3. After the optional literature corpus selection, one or more of their text fields has to be selected. The majority of the biomedical literature has multiple fields of text, and the utilization of each as a text source has its separate benefits and drawbacks, which will be discussed. Second, the analysis methods will be discussed in section 2.4. Techniques are broadly divisible between co-occurrence and matrix techniques. In this section several alternatives to the standard co-occurrence technique will be analysed. Third, post processing of the resulting terms by ranking and filtering them will be discussed in section 2.5. Many terms are rare and not relevant. Several algorithms have been investigated for ranking the terms in such a way that the relevant ones would be at the top for the researcher. Fourth, the evaluation methods will be discussed in section 2.6, while the implications for our own research will be discussed in section 2.8. Finally, we will reflect on the literature review, discussing broader observations which did not fit other sections.

In this chapter the following terminology will be used. The term “term” will be used in the context of the ABC-terms in Swanson’s algorithm. Term can therefore refer to a concept, a MeSH term, or a literal word or phrase. One or more words used as terms in the ABC algorithm will be referred to as words and phrases respectively.

(15)

2. Existing implementations of Swanson’s ABC algorithm 12

2.1 Methods

Articles authored by Don Swanson containing “b term” and “implicit associations” were searched on Google Scholar and Web of Science (date: 01-07-2013). Articles citing the returned articles were again selected for containing “b term” and “implicit associations”. All returned articles were included.

The literature search returned 65 results, containing 28 unique peer reviewed articles. Seven excerpts from books, five theses, and five articles which were locked behind a paywall and therefore unavailable were excluded. Other entries like drafts and course materials were excluded as well. Seventeen additional peer reviewed articles were added by analysing the reference lists of the selected articles.

Table 2.1 will provide an overview of all named implementations. Some systems, such as BITOLA, Manjal, and Iridescent were not named when initially described, and only received their names in later publications. Also, due to the literature selection criteria, the overview is not complete. For an alternative overview see Weeber et al. [68].

In addition to literature based knowledge discovery, it is also possible to perform knowl-edge discovery on structured databases, which is a whole field in itself. However, as we base this chapter on Swanson’s algorithm, who performed knowledge discovery exclusively on literature sources, knowledge discovery from structured databases will not be discussed.

2.2 Philosophical Background

In this section the most prominent and interesting philosophical backgrounds and choices surrounding Swanson’s algorithm will be discussed. Some authors have prominently applied philosophical choices in their implementations. Other authors took a very pragmatic ap-proach, with minimal philosophical backgrounds. Nonetheless, it was necessary to make a selection of the philosophical backgrounds and choices to discuss. Therefore, this section should not be considered an exhaustive treatment of all philosophical backgrounds, but a general overview of the most prominent and important ones.

Bruza et al. note that according to the definitions of the philosopher C.S. Peirce, Swan-son’s algorithm is a manifestation of abduction [8]. Induction is merely the determination of a value from a known formula, while deduction is an evolution of the consequences of a hypothesis. Contrary to these, abduction is defined as a logical operation which induces a new idea. Therefore, the process of literature based discovery should be labelled abduction. The definition of this process is important due to the subtle difference in implications. The term “undiscovered public knowledge” implies that the knowledge is already there, just wait-ing to be found, independent of the searcher. It can be impartially induced [49]. However, when defining it as abduction, one stresses the novelty of the new ideas. Abduction does not allow the reasoning process to be separated from the person doing the reasoning; it is subjective. The relevance of subjectivity for Swanson’s algorithm will be further discussed below. Therefore abduction is the term that will be used for the remainder of this chapter to describe this process.

Swanson described his theories and thoughts about the background of the process in his article “Undiscovered public knowledge” [49]. Applying Popper’s philosophies about

knowledge, Swanson discusses two aspects of undiscovered public knowledge. First, the

critical approach to science, and second the distinction between objective and subjective knowledge.

The classical idea about science was if one used only facts and observations, an objective and value free theory could be formulated. This ideology was named positivism, and placed an unreasonable expectation on scientists. Popper rejected this idea in 1934. He stated that

(16)

the order was reversed: a subjective theory was developed, and subsequently tweaked and adapted until it makes logical sense and fits the observations. If the consequences of a theory go against facts or observations, the theory has to be adapted or rejected. Other scientists can perform their own experiments to test this theory, and as long as it cannot be disproven, it will remain valid. An implication of this is the requirement of theories to be falsifiable.

A resulting theory is not necessarily true. It might contain hidden faults and errors which have not yet been found, and therefore remains conjectural. So what separates a scientific theory from other theories? Its conscious and inherent fallibility, an invitation of criticism, and a willingness to change itself in the light of new observations. Nonetheless the fundamental origin of a theory still remains. It has started as a subjective notion, not as an objective reading of nature. This subjective origin permeates into the manner in which scientific knowledge is communicated. Scientific articles are the result of a subjective process, no manner how hard one tries to write them objectively. This places a heavy burden on scientific communication, as ultimately theories are only as good as the observations that have been published and interpreted. Often a theory might go against facts in another

scientific domain. However, as long as these domains are not aware of each other, the

faulty theory remains unchallenged [49]. Swanson’s algorithm explicitly attempts to bridge domains, and thus should be able to illustrate these contrary ideas and observations. It should therefore be a method to increase cross domain knowledge and discussion, allowing researchers to incorporate additional knowledge from other domains in their theories.

Swanson formulated two conditions for the literatures used when applying his algorithm. In the context of literature based discovery, a literature is defined as a set of articles in which either the A, B, or C-term is found. As a first condition, these literatures should be complementary. Only by combining the knowledge held within the separate literatures can new knowledge be inferred. One should not be able to infer the newly knowledge from

either separate literature. The second condition is the isolation between the A and

C-term literatures. Two literatures can be considered to be isolated, or non-interactive, if they are not co-mentioned in another article, and if they do not cross-reference each other. Both these conditions are central to literature based discovery [50]. Due to the increasing degree of specialization in biomedical research, ranging from patient studies to bio-molecular studies, inter-domain research currently is a scientific blind spot. According to Swanson, if researchers do not co-mention or cross reference their research it is safe to assume that they are not aware of the state of knowledge in the other domain [50].

Awareness of other domains is an active process. Swanson describes how Popper posed the analogy that the human mind is more like a searchlight instead of a bucket [49]. Perceptions, connections, and conjectures are formed based on the interests and experiences of the person. Simply adding new knowledge to the mind does not automatically lead to new insights.

Insights have to be actively pursued. This analogy is fitting for the whole of scientific

literature as well. Simply because a new piece of knowledge is published, does not mean that every logical implication of the addition of this piece of knowledge is automatically known as well. Each will have to be investigated and described on its own in order to be made explicit and added to our collective knowledge. To be able to perform these investigations will either require a huge number of cross-domain experts, or automated research aides able to process huge amounts of structured information.

Swanson describes multiple categories of implicit relationships which could connect terms. Although these relationships are not extracted from the texts, they can be deduced by a re-searcher when terms are placed next to each other. Swanson describes the implicit relation-ships connecting terms through similarity, influence, or “locale”. Similarity can be considered an “is a” or “is like a” connection, while influence connections are far more complex, such as “stimulates” or “causes”. Lastly, the relationship could indicate a locale, specifying for

(17)

example a component in a body or cell [56]. The type of implicit relationship is relevant for the interpretation of the connection. To be able to infer such implicit relationships between the literatures requires expert knowledge as well as logic [52].

Even with the three classifications of the categories of implicit relationships described above it remains impossible to identify all the “correct” connections. This problem is de-scribed by Yetsigen-Yildiz and Pratt [78]. They divided the intermediate terms, or B-terms, into a relevant set and a mixed set. The relevant set consists of B-terms that have been confirmed as biologically valid, and can therefore be considered truly correct. The mixed set consists of three subsets: the terms that another expert would mark as relevant, false neg-atives which are biologically valid but are not yet known, and true negneg-atives which are not biologically valid. This both accounts for and illustrates the subjectivity of knowledge. The possibility that results are incomplete or subject to change is explicitly taken into account. This will also play a role with evaluation, as term lists were compared to the relevant list, while there might be relevant terms in the mixed sets.

Swanson always remained practical in his implementation of the ABC algorithm, never translating abstract principles to programming designs. Only fairly recently two groups for-malized their own thoughts about implementations of the ABC algorithm, which we discuss below.

Several articles from a group in Slovenia describe an approach in which outliers were searched. Outliers are terms and articles which are at the crossroads of two literature do-mains. They base their philosophy on context-crossing associations, named bisociation by Koestler, who describes a bisociation as a new association between concepts from two dif-ferent domains [36]. Combining Koestler’s ideas about bisociation with Mednick’s theory of associative creativity, these researchers define creativity as the ability of combining distinc-tive concepts. Therefore, according to them, “creadistinc-tive” hypotheses can only be generated in cross domain research. Cross domain research is assumed to be described in outlier arti-cles, which they aim to identify. This is an interesting approach, as the conditions for both Koestler’s and Swanson’s approaches are similar. Only whereas Swanson’s approach stems from knowledge, Koestler’s approach stems from creativity. A disadvantage is that the lit-eratures currently have to be pre-selected, and as such the Slovenian’s approach is currently only suitable for hypothesis checking.

Miyanishi et al. took an explicitly different approach [29]. They were interested in using infrequent terms for literature based discovery. However, in order to be able to find new and relevant connections, this has to be done within a domain. The most relevant hypotheses according to them were the ones which were not made explicit, but were easiest to abduct due to their closeness. This stands diametrically against the approach from the Slovenian researchers described above.

2.3 Text Source

To be able to perform literature based discovery, the source material must first be selected. This section has been divided into four subsections, one for the corpus selection, and one for each article field that is commonly used for literature based discovery. Each subsection has been divided further to treat the term processing methods, where relevant.

2.3.1 Corpus selection

The ABC algorithm was first implemented as a manual process [48, 51]. Hence, several efficiency steps were included at its conception.

(18)

Initially Swanson performed a pre-selection of the literature to be examined, limiting the number of articles that had to be processed. This pre-selection remained even when Swanson automated his algorithm in Arrowsmith. However, other authors quickly switched their text selection to the full Medline database.

The Medline database consists of the abstracts and other meta data such as titles, MeSH terms, authors, and journal of the articles contained in PubMed. There is, however a subtle difference between Medline and Pubmed in included citations between the Medline and

PubMed. 1 However, in general they can broadly be considered equal. Using Medline for

text mining is the most practical solution due to several reasons. Medline is structured, separating the items listed above, allowing for easy processing by an algorithm. Moreover, the abstracts and metadata contained within it are freely available and are not as severely protected by copyright as the full text articles. Some authors imposed restrictions on what types of publications were used from the Medline database, excluding for example lectures, dictionaries and biographies, or including only chemically oriented publications [2, 77].

The glaring disadvantage of incorporating the Medline database is the incompatibility with other literature resources. An advantage of manual corpus selection, which is for exam-ple necessary with Arrowsmith, is the ability to incorporate other literature databases such as Embase and Scisearch [56].

The pre-selection of starting literatures has seen a revival in a recent project as part of an implementation of Koestler’s philosophy of creativity by a Slovenian group [23]. The phi-losophy behind this implementation was further examined in section 2.2, while the resulting implementation itself is discussed in section 2.4.3.

Once the corpus has been selected, one or more text fields have to be chosen. Every article is divided into one or more fields. In modern biomedical articles, these fields are at least a title, an abstract, the main body of the text, and the citations. The National Library of Medicine (NLM) also assigns MeSH terms to every article in Medline as an additional field. It is worth to notice that the addition of an abstract as a field is a fairly recent development, and is therefore not always available in pre-1980 articles.

Other (scientific) domains may adhere to different divisions of fields, but the description

above is relevant for Medline, the main resource of biomedical literature. Since we are

mainly utilizing the UMLS semantic network, with relationships which were text-mined from Medline, we consider these fields as standard.

No author has developed an automated literature based discovery system that processes full texts, even though this was part of Swanson’s original algorithm [48]. After he selected his literature based on article titles and MeSH headings, Swanson himself read all the relevant articles in order to form his conclusion as part of his first, manual implementation of his algorithm. However, as we only take in to account automated implementations of Swanson’s algorithm, full text approaches are not further discussed. Applications that process full text articles do exist, but due to constraints their use is limited to specific and selected articles [65]. Wren et al. performed a small comparison of terms retrieved with abstracts and full text [75]. They found 138 of the 141 terms which could be processed and which were contained in the full text also represented in the abstract. This suggests that using abstracts does not dramatically decrease recall compared to full text. One constraint is the limited speed in extracting relevant information from full-text. Another constraint is licensing. More often than not, the full texts of articles are copyrighted. As described above, both drawbacks are not valid for article’s metadata stored in the Medline database.

1

(19)

2.3.2 Titles

Swanson listed the advantages of titles as having biological meaning, being easy to interpret and providing a constrained context for the co-occurrences [55]. They are short, specific and highly information dense. A disadvantage of titles is their brevity, preventing them to comprehensively cover the article’s content and their requirement of making syntactic sense. In Arrowsmith, the first publicly available and automated implementation of his algo-rithm, Swanson only processed the article titles of the selected articles [55]. This appears to be a regression from the manual process Swanson adhered to before, in which he used both titles and MeSH headings [16]. We cannot know this for sure, as in his article describing his fish oil – Raynaud discovery Swanson never specifically mentioned the fields he used. Only when describing his magnesium – migraine discovery did Swanson mention he used article titles [51]. One has to keep in mind though that when he performed his search, around November 1985 article abstracts were less common [48]. Therefore it is likely that he simply used all fields available to him when performing his search.

Words and phrases

In Arrowsmith, Swanson used words and short phrases (2 – 3 words) as terms [55]. This approach requires several pre-processing steps. A so called stop list is used to exclude ubiq-uitous terms such as “protein” and “patient”. This stop list was made by hand, without a structured methodology. Stop lists are only required when using words as terms, although Swanson notes that with more modern techniques such as using MeSH terms or conceptu-alization this step is no longer necessary but can still be useful [6, 13]. Once the stop terms are removed, the remaining terms can be processed further. These remaining terms have to be normalized, converting them to the same format, for example, from plural to single. Even then problems remain with synonymy, which heavily influence the resulting term lists [25].

UMLS concepts

Fifteen years after the fish oil – Raynaud discovery, a direct comparison between using words and phrases or concepts was made by Blake and Pratt [6]. Blake used both literal words as well as words mapped to UMLS concepts from article titles as terms to compare their performance when recreating Swanson’s results. Even though Blake still used a stop word list, several manual curation steps were no longer required. They noted that conceptualization of title words improved results. When words as terms processing was compared to terms conceptualized with the UMLS thesaurus in an experiment recreating Swanson’s magnesium – migraine discoveries, precision increased from 8.3% to 9.8%, while recall increased from 22.7% to 30.1%. When results were filtered further by categorizing the concepts, precision more than doubled to 22.3%, but recall decreased to 19.4% [6]. These results convincingly show the benefits of conceptualizing terms instead of using words as terms.

Miyanishi et al. also mined UMLS concepts from article titles, but only used terms which could be mapped with a high degree of certainty [29]. To measure similarity, they adapted Seco’s Information Content, that defines similarity by finding the common ancestor of the MeSH term, and then counted the number of MeSH terms lower than that in the hierarchy. Therefore, the less specific or further removed two MeSH terms are, the higher up their common ancestor MeSH term will be, and the higher the number of MeSH terms lower in the hierarchy. They tested three versions of this implementation. All MeSH terms from articles containing the co-occurrence could be pooled and the average could be taken. Or the two most similar MeSH terms could be taken, thereby calculating the highest possible score. Or the TF-IDF scores of the MeSH terms could be calculated to weigh them. They

(20)

introduced this as a new statistic named Hypothesis Reasonability. These statistics were compared to the co-occurrence frequency statistic.

They evaluated their approach by recreating both Swanson’s fish oil and migraine dis-coveries, representing results as a percentage rank of the total amount of possible ranks. Because of their idiosyncratic way of presenting their results, no conclusions can be drawn with regards to performance.

2.3.3 MeSH terms

MeSH is an hierarchic, controlled vocabulary meant to categorize and represent the contents of articles with a small number of terms [77]. The National Library of Medicine (NLM) assigns on average 12 MeSH terms to the articles in their Medline database. An article’s MeSH terms are divided into major and minor headings, with the major headings (which are marked with an asterisk) describing the article’s main topics. Disadvantages of MeSH terms are their low number per article, which might sometimes fail to capture all content of an article, their lack of specificity for genes and proteins, and their sole availability in the Medline database, which makes them unusable (in combination) with other literature databases for literature based discovery.

Assignment of MeSH terms is done by human experts, and is therefore sensitive to bias. Annotation has been found to differ between experts, especially with more specialized terms [14]. However, these disadvantages are at the same level or lower when compared to the analysis of article titles. What makes MeSH suitable for literature based discovery is its hierarchic, controlled vocabulary, and its reflection of the most important aspects of an article with a limited number of terms. One could state that using MeSH terms is using pre-applied expert knowledge. A major disadvantage of the manual assignment of MeSH terms by experts is the delay between the publication of an article, and the assignment of its MeSH terms. Therefore, there is always a delay of several months between the latest published knowledge available in the Medline database and the articles’ MeSH terms [47]. Selecting MeSH terms as a field therefore means not working with the most recent knowledge set.

The first to base their implementation solely on the MeSH field were Hristovski et al. [20]. They only included co-occurring major headings for their association rules. They reported about 95.5% sensitivity, but only a 4% specificity when testing their implementation, later named BITOLA, with multiple sclerosis [20].

Demaine et al. took a slightly different approach [13]. They boasted to have developed the first fully automated system, no longer requiring any user guidance. This lack of user guidance was thought to increase the reliability of the process, as it was no longer dependent on the skill and expertise of the user. A variation from Hristovski’s approach was their inclusion of only the MeSH terms found in the top 100 records for each term, with the underlying assumption that these would be most relevant. Moreover, they did not process the B-terms, focussing only upon the resulting C-terms. They did however use a stop list of 87 terms to remove the most abundant terms, and included only the “Drugs & Chemicals” and “Disease” semantic types. Semantic types are term categories, which will be elaborated upon in section 2.5.1. Evaluation was performed with a prior-post analysis, as explained in section 2.6, to test their system. It is interesting to note that Demaine did not cite Hristovski’s work, indicating that even within specific domains, researchers might be unaware of all developments.

In 2004 Srinivasan et al. published their article on using MeSH terms for literature based discovery [47]. Srinivasan’s system, later named Manjal, also employed MeSH terms, al-though she processed them in a different manner than Demaine et al.. She created a sort

(21)

of concept profiles of MeSH terms, stratified for semantic type. Weighting of the term vec-tors was based on the TF-IDF algorithm, and was normalized for the highest weight per semantic type. For the hypothesis generation procedure, weights of the individual C-terms were summed across the individual B-terms. It is interesting to note that in this case the categorization is applied not as a pre- or post processing step, but during the process. She tested her system by recreating Swanson’s Raynaud, migraine, and other less well known discoveries and was able to retrieve most of the relevant terms within reasonable ranks. She noted she preferred to use the top 20 results per semantic type herself. When only the top 10 were used, some relevant terms were not retrieved, while including more than 20 did not provide any additional benefit. When knowing exactly what kind of semantic types to use, the set of terms could be reduced by as much as 91%. When solely removing the obviously irrelevant semantic types, the number of terms was reduced by an average of 31%. The ability to separate by semantic type also proved to be useful, with almost all terms searched for in the top 10 of their respective semantic types. Again it was interesting to note that she did not cite Demaine et al.. A severe drawback of her approach was its recreation of Swanson’s discoveries as an evaluation methodology instead of using a prior-post evaluation methodology like her predecessors, preventing direct comparison between them.

Ultimately even Swanson himself incorporated the use of MeSH headings for knowledge discovery, although with a slightly different approach than the other researchers. A yet unmentioned aspect of the MeSH terms are their subheadings. These can differ from chemical to entomological properties. The subheadings can provide additional information about an article’s MeSH terms, further specifying them. Following his own suggestion from 1990 for a role for MeSH terms, especially the ones expressing a causal relationship, Swanson identified rigorous exercise as a neglected cause of atrial fibrillation, making clever use of MeSH terms’ subheadings [52, 53]. He subsequently used Arrowsmith to check this hypothesis and identify potential agents [57].

In their system Litlinker Yetsigen-Yildiz and Pratt also used MeSH terms, as they found conceptualizing free text as part of their approach too computationally expensive [77]. It differed from other implementations by using a statistical approach to identifying correlated terms, and by using a knowledge base to remove uninteresting connections. The connections considered uninteresting were either too general, too closely related to the starting term, or nonsensical. Filtering the too general and too closely related terms was done by excluding the two upper levels of the MeSH hierarchy and all siblings of the starting term. The nonsensical terms were processed by assigning them to their semantic types and groups, which the user could then later process himself. Once the relevant connections between the terms remained, target terms were ranked based on the number of intermediate terms connecting it. They used a prior-post analysis to check their results. Their post-period was only 21 months however, which was quite short and they admit could have influenced results [78].

Baker and Hemminger also applied MeSH terms in their implementation, Chemotext, but limited terms to specific categories: They only used the “Chemical” and “Disease Effect” types [2]. Only the article’s subject chemicals were processed, which were identified by a ranking of subheadings. If the ranking proved inconclusive, all chemicals were considered subject chemicals. By only including articles with a chemical subject, identified as described above, they were able to reduce their number of included articles from the Medline database by two thirds.

2.3.4 Abstracts

Abstracts give a general summary of an article. They are considered to be the most

(22)

been shown to contain an order of magnitude more information than MeSH terms or titles [24]. Because abstracts are longer and contain more diverse information, not every two terms found in abstracts can be considered to be a relevant co-occurrence [76]. A parameter has to be set to limit co-occurrences. This parameter can be a window size, such as Weeber described, or by limiting co-occurrences on a sentence level [70].

Terms and Phrases

In 1996 Gordon and Lindsay were the first to recreate Swanson’s fish oil – Raynaud discovery in an automated implementation [16]. They used every field of interest, including article abstracts of their pre-selected corpus [16, 25]. As terms they processed single words and two – three word phrases. Co-occurrence was limited to sentence level by incorporating boundaries like periods and other punctuation marks. As they did not conceptualize terms, a large amount of pruning of terms had to be performed when examined, manually assigning synonyms and eliminating candidates, as well as setting arbitrary cutoffs.

Another implementation using terms and phrases from abstracts is Transminer, created by Narayanasamy et al. [31]. This implementation will be further discussed in section 2.4.2.

UMLS concepts

Weeber et al. were the first to mine concepts from titles and abstracts, mapping them to the UMLS thesaurus with the MetaMap concept miner [67]. Concept recognition is not an infallible process. Not every word is mapped correctly to a concept, or even recognized, something explicitly taken into account by Miyanishi et al. [29]. Co-occurrence was limited to sentence level. Weeber listed several advantages to conceptualization. A stop list is no longer necessary, and all concepts are bio-medically relevant. Homonyms and synonyms are automatically detected, which shortens the term lists considerably. An additional advantage of using UMLS are its semantic types. As each concept belongs to one or more semantic types, they are pre-categorized. Semantic types will be elaborated upon in section 2.5.1. Ultimately they were successful in simulating both Swanson’s discoveries, even though the magnesium – migraine results had to be adjusted because the MetaMap concept miner could not differentiate between mg as an abbreviation used for both milligram and magnesium. Both fish oil and magnesium could be found on reasonable locations on the candidate lists, who were ranked based solely on term frequency. Later, they applied their system for a new discovery, suggesting thalidomide as a treatment for chronic hepatitis C, among others [71]. An extension of concepts, concept profiles as described in section A.2, were also applied for knowledge discovery. One implementation, named ANNI 2.0 was developed at the Eras-mus MC. ANNI 2.0 differs from other literature based discovery systems by not performing the text mining itself. It utilizes another system, Peregrine to recognize concepts in text, especially from abstracts and MeSH terms in the Medline database. One important distinc-tion from other implementadistinc-tions is that Peregrine also attempts to disambiguate terms and phrases. Once the concept profiles have been created they can be used to reason with. This is done by matching concept profiles, and scoring the indirect association with an Uncer-tainty Coefficient. The system offers the user a number of tools such as clustering of terms, a heatmap, and the ability to manipulate the categories of the terms searched for.

Other

Wren et al. developed their own term database for their implementation, which was later named Iridescent, as they found they could not apply MeSH terms on abstracts [76]. They also found the level of specification in MeSH for some categories, for example genes, too

(23)

low. This database was based on the OMIM, MeSH, LocusLink, and HGNC datasets and contained the synonyms and spelling variants of its terms.

Wren differentiated between co-occurrence on sentence and abstract level and investigated the difference between them. Relationships were scored based on these types and on their frequencies.

The chance that a relationship was biologically valid was determined by its veracity score, using a method named fuzzy reasoning. For the fuzzy reasoning, the prior probability that a relationship identified by co-occurrence was biologically relevant was determined. A veracity

score, which is a probability, is calculated by the formula P (related) = 1 − rn, where n is the

number of co-occurrences and r is the error rate. The error rate differs for sentences (17%) and abstracts (42%). They found that using only co-occurrences in sentences would miss 43% of the biologically relevant relationships. The strength of the relationship between two terms is calculated with S = sum(C · P (related)), where C is the number of co-mentions, summed for both abstracts and sentences to represent the connection. For implicit relationships the lowest of the two scores was selected.

When testing their approach by applying it on cardiac hypertrophy, they found their connection confirmed in a wet lab.

Another interesting, somewhat different implementation is the Alibaba network [37]. Al-ibaba uses a dictionary to recognize terms and relationships and differs from other implemen-tations by being a graph shell over PubMed, showing co-occurrences of terms in abstracts. It also has a limited capacity to include relationships between terms in its graph. As it is a shell over PubMed, it can only incorporate information found on PubMed and a small number of other selected sources [34]. The fact that it is a shell over a website influences performance. However, it is unclear if Alibaba is designed to be a literature based discovery system, as its knowledge discovery capabilities have never been evaluated. It will therefore not be further compared to the other implementations.

2.4 Analysis Method

Broadly speaking, there are two text analysis methods relevant for Swanson’s algorithm. The method described up until now is based on simple co-occurrence of terms, which has a binary output: either terms co-occur in an article, or they do not. Subsequently the term frequency can be counted, but the basic premise remains the same.

A class of more complex implementations creates matrices of document properties, for ex-ample all the terms in the literature, and calculates the distance between the terms. These will be called matrix techniques. In this example the distance between the terms in the text is then considered to be inversely proportional to the strength of the connection be-tween the terms, providing an inherent quantification to the co-occurrence. This approach is thought to resemble reality more closely. However, it is also computationally expensive and rather complex, and therefore less popular than the simpler co-occurrence techniques. Besides the matrix techniques, a network technique, which is a special implementation of the co-occurrence technique, and the Outlier approach, which is unlike all other implemen-tations, will also be discussed. But as co-occurrence is the most common approach, and is the approach used in our database as well, it is regarded as the standard in this chap-ter. Nonetheless, the so called matrix and other techniques offer interesting capabilities and deserve to be discussed.

(24)

2.4.1 Matrix techniques

The first implementation of a matrix technique for literature based knowledge discovery was tested by Gordon and Dumais by employing Latent semantic indexing (LSI) [15]. LSI can be viewed as a context oriented technique. The basic premise works as follows. A term– document matrix is created, listing all unique terms contained in the documents and their respective frequencies within them. The matrix is then separated into three matrices in a process named Single Value Decomposition: a term–term matrix, a document–document matrix, and a factor matrix.

The document–document and term–term matrices can now be used to score the distance between two documents or terms, respectively. The factor matrix is only used to recompose the term–document matrix if so required.

Using these matrices, associations between terms, for example term similarity, can be found based on context. If two different terms are often used in common context, they can be assumed to be similar or even synonymous, which will be represented by their score in the matrix. Similarly, if two documents are highly similar based on their common terms, even though key terms identifying them as such are missing, this similarity will be represented in the matrix as well. Therefore LSI can be considered to weigh similarity of both documents and terms in an approach broadly similar to clustering. A more in depth explanation is beyond the scope of this thesis.

Gordon and Dumais applied LSI to recreate Swanson’s discoveries. Ultimately they

wanted to improve the results of their previous approach described in section 2.3.4, in which they used simple co-occurrence [16]. They therefore compared the results of the LSI approach to the results of Swanson’s experiment. When comparing the ranks of the terms found, they found a strong correlation between the results of the two methodologies, with many common terms in roughly equal ranks.

Next, they tried something they dubbed “identifying discovery documents”. Using blood viscosity, originally an intermediate term in the fish oil – Raynaud discovery, as a starting term, they attempted to connect Raynaud’s disease with fish oil. The results were disap-pointing though, finding fish oil almost at the 600th place removed from Raynaud’s disease. A drawback of Latent Semantic Indexing is its high computational cost. The matrix also has to be re-created every time a new document is added to the set.

An alternative called random reflexive indexing, developed by Cohen et al. is a more computationally efficient way to perform Latent Semantic Analysis [11]. Cohen et al. take an interesting approach, in that they include a random aspect in their algorithm. Once again a detailed explanation is beyond the scope of this thesis, but evaluation was done by recreating Swanson’s magnesium – migraine and fish oil – Raynaud discoveries. As a consequence of the random aspect of their implementation there can be variability in outcome. Therefore results were represented as means with their standard deviations. This aspect diminishes the usability for scientific applications, as results are more difficult to recreate.

An example of a more straightforward matrix technique used for literature based knowl-edge discovery is described by Bruza et al. [8]. As a dataset they created an n × n matrix of a corpus with n unique article title terms, ignoring sentences and punctuation. Maximum window sizes could be set according to user preference, and in this case was set at 50 words. It should be noted that setting a window size is a science in its own right [70]. Every term within the window is considered to co-occur with each other, with the strength inversely proportional to their distance.

For every concept, an n dimensional vector can be created, with regards to the various contexts in which the concept is mentioned. They mention a concept such as “apple” as an example, which has various properties such as shape, taste, and colour. These properties

(25)

become relevant depending on the context. In some articles the apple’s taste will be the main topic, while in others it will be the apple’s colour. This technique is similar to the concept profiles. However, instead of concept-centric, this technique is word-centric. It does not try to disambiguate a word, but instead emphasizes different properties of a word, adjusting its meaning based on the context.

The benefit of a distance matrix is its ability to account for all aspects of a concept, and therefore its ability to infer context. This ability to infer properties of terms based on context could provide another dimension to Swanson’s ABC algorithm. Depending on the context, relevant properties of terms would become more prominent, offering additional information. However, to be able to perform this properly background knowledge is necessary, something simple co-occurrence techniques are unable to incorporate, in which such properties remain implicit.

In their context-centric approach, Bruza et al. note that categorization of terms is some-thing of a double edged sword. They define it as a compensation strategy for information and time, but also very fallible. Categorization does not take poetic license or associative liberty into account, aspects necessary when the information is incomplete. Humans are able to perform something called information inference, contrary to computers [8]. They feel their matrix technique is much better equipped to incorporate context in which terms are used, and will therefore produce more relevant results.

However, their results did require improvement. When their approach was tested on the fish oil – Raynaud dataset they were not able to improve on co-occurrence techniques, with two out of three of the relevant terms ranked near the 500th spot, and their best one (liver) at 50. One other term was not found at all.

2.4.2 Co-occurrence network

Narayanasamy et al. created a variation of the co-occurrence technique by using a network based approach [31]. Applying the ABC algorithm to the complex field of cancer genetics, they were not satisfied with a simple one dimensional line between A and C, but instead wanted to represent the intricacy of cancer genetics in a network, a disease in which clusters of genes work together. Genetics is a complex field, where relationships between terms often indicate either a functional relationship between them, or an evolution. To account for this complexity they developed their own tool in which a list of terms could be specified by the user. Once specified, the program would search the Medline database for co-occurrences. Documents were processed by applying a vector space model and TF-IDF to weight terms. Their system was tested by recreating Swanson’s magnesium – migraine experiments. Be-cause of the architecture of their implementation, they applied a different work-flow. As mentioned above, their system requires a set of terms instead of a single one. As a set, they used the terms stress, magnesium, migraine, platelet, depression, and calcium. One could therefore argue their work-flow requires even more expert knowledge to create the set of terms used as input. They succeeded in finding relevant terms, recovering 8 out of the 11 B-terms identified by Swanson, and with magnesium as one of their five indirectly connected terms. However, due to their divergent architecture and workflow, a direct comparison with the results of other researchers cannot be made.

(26)

2.4.3 Outlier detection

As mentioned in section 2.2, a group from Slovenia described a method in which outlier articles were searched [36]. In most domains outliers are considered measurement errors. In other domains, such as this approach, they are considered to be highly interesting. They consider outliers to be interesting, because outliers are thought to suggest undiscovered knowledge. Initially they searched for outlier articles, but in a follow up they also attempted to identify specific intermediate terms with their system named CrossBee [23].

In their hypothesis checking process, two literatures have to be supplied. All articles in these separate literatures are subsequently processed, their contents normalized and tok-enized and thrown in a “bag of words”. These bags of words are then analysed. Based on their vocabularies, articles are scored on their similarity with TF-IDF. As can be expected, most of the articles from the literatures are clustered together. But there are always outlier articles, whose vocabulary more closely resembles the vocabulary from the other literature. The outliers articles, found where the two literatures overlap are the articles the researchers are interested in [36]. This approach is highly interesting, but is still under development [42]. They also developed an approach to distil intermediate terms from this data [23]. To distil these B-terms from the outliers, they tested frequency based, TF-IDF based, similarity based, and outlier based statistics. These statistics were tested both separately, as well as combined in an ensemble statistic. The ensemble statistic ultimately proved to be superior for identifying relevant B-terms [23].

Their system does require more extensive testing however, as they only recreated Swan-son’s magnesium – migraine discovery, as well as developed a new theory about autism. Another helpful statistic which could have been included in their approach was the liter-ature cohesion score, or coh. The coh statistic will be explained further in section 2.5.2. Nonetheless, the Slovenian group offers an exciting new paradigm on literature based dis-covery.