Computational candidate gene prioritization by genomic data fusion

(1)

Computational candidate gene prioritization by genomic data fusion

Stein Aerts ^1,2,#,* , Bert Coessens ^2,# , Diether Lambrechts ^3,# , Peter Van Loo ^4,2,# , Robert Vlietinck ^5,6 , Peter Carmeliet ³ , Bart De Moor ² , Peter Marynen ⁴ , Bassem Hassan ¹ , Yves

Moreau ²

1

Laboratory of Neurogenetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

2

Bioinformatics group, Department of Electrical Engineering (ESAT-SCD), University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium;

3

The Center for Transgene Technology and Gene Therapy, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

4

Human Genome Laboratory, Department of Human Genetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

5

Department of Human Genetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

6

Department of Population Genetics, Genomics and Bioinformatics, University of Maastricht, the Netherlands.

# These authors contributed equally to this work.

* Corresponding author: Stein Aerts, Laboratory of Neurogenetics, Herestraat 49, bus 602, 3000 Leuven;

Tel: +32 16 34 71 50; Fax: +32 16 34 62 18; Email: stein.aerts@esat.kuleuven.ac.be

(2)

Abstract

Researchers are increasingly dependent on data mining methods to analyze the growing quantity of high-throughput data and the information reported in the biomedical literature.

We present here a novel approach for the prioritization of any list of candidate ‘test' genes, on the basis of multiple information sources and given a set of related ‘training' genes representing a biological case. This general approach allows us in human genetics to identify disease genes, but also in molecular biology to rank and identify members of biological pathways or genes with a particular function. The prioritization is done by comparing summarized information from the test genes with that of the training genes. The information sources that are currently consulted are textual descriptions from MEDLINE abstracts and LocusLink records, Gene Ontology annotations, Interpro protein domains, KEGG pathways, protein interactions from BIND, EST-based expression data, microarray expression data from multiple experiments, sequence similarities, transcription factor binding sites, and cis-regulatory modules.

We describe the summarization methods for the prioritization from individual information sources and the order statistics used to integrate these prioritizations into a global ranking.

The prioritization of disease genes, on the one hand, is first validated by a large-scale cross-validation on training sets for 29 diseases from the Online Mendelian Inheritance in Man (OMIM) database. Moreover, several chromosomal regions linked to monogenic and polygenic diseases are prioritized and we demonstrate that the method yields biologically meaningful rankings resulting in a high enrichment of real disease genes. The prioritization of pathway members, on the other hand, is validated by a cross-validation on three pathway training sets, and by a case study on the discovery of transcriptional target genes, accompanied with experimental verification.

In conclusion, we propose a fast and highly flexible way to prioritize large lists of genes,

which - in contrast to pre-existing methods - is based on multiple and heterogeneous

information sources. Our method provides the first statistically coherent framework for

high-throughput gene prioritization and is freely available.

(3)

Supplementary data

The supplementary data file contains the following information:

- A list of the OMIM entries that were used in the OMIM cross validation - Figure S1: boxplots showing the variation in submodel performances

- Figure S2: Spearman correlations between the different submodel rankings - Figure S3: Screenshot of the ENDEAVOUR software tool

- Note S1: Description of how the diseases and disease genes for the case studies have been selected

- Table S1, S2, S3: For each of the three heart diseases (congenital heart defects (CHD), arrhythmias (AR), and cardiomyopathies (Cms)), the tables with the training genes (a), short summarizations of each trained submodel (b), the top scoring genes of chromosome 3 after prioritization (c), and specificity measurements (d).

- Table S4: Training set of genes that are up-regulated during HL-60 cell differentiation, used for the discovery of a new cis-regulatory module, and also used as training set for the prioritization of putative module target genes.

- Table S5-S6: Top 100 scoring genes for the new cis-regulatory module, before (S5) and after (S6) prioritization

- Table S7: primers used for the qPCR validation of putative module target genes

(4)

Introduction

High-throughput post-genomic technologies often fail to deliver immediate biological clues that lead to an improved understanding of biological systems. Instead, researchers are often confronted with large lists of candidate genes or proteins that require further filtering and validation in the lab. Some examples of research approaches where this can be the case are (1) microarray and proteomics experiments that compare two conditions generating lists of differentially expressed genes; (2) microarray and proteomics experiments with multiple conditions (time-series analyses for instance) generating clusters of co-expressed genes; (3) ChIP-chip experiments and cis-regulatory sequence analyses that involve genome-wide searches of putative target genes ^1-3 .

Likewise in genetics, linkage mapping and association studies often result in relatively large chromosomal regions containing tens to hundreds of genes as putative candidate disease genes ⁴ . The positional candidate gene approach, which has been quite successful in pinning down the gene that is responsible for the linkage, selects a few of the most promising candidates by using information of the function, homology, and expression pattern of all the genes in the mapped region. The validation is then often carried out by expression studies in model organisms or fine mapping and association studies. The manual prioritization process, choosing which genes to validate lies in the hands of the expert and depends mostly on the combination of several lines of biological evidence and publications. Acknowledging that the multitude of biological information cannot be captured in the mind of a single biologist, there has been some recent effort in assisting the process of gene prioritization using computational methods. Given that multiple, sometimes contradictory sources of information need to be reconciled and that their information must be balanced, we borrow the term “data fusion” from engineering to characterize this process.

Three categories of methods can be recognized. The first category can be described as ab initio methods that predict the disease (or process) susceptibility of a gene, or that prioritize genes based on location (does it lie within a region of linkage), sequence, sequence phylogeny, annotation, or expression of the gene, or combinations thereof.

Hauser et al. combined a statistical measure of differential gene expression (measured

(5)

with SAGE) between diseased and control individuals with location (linkage) data to select the best candidate genes for a particular disease, as illustrated for Parkinson’s disease, and called this procedure genomic convergence ⁵ . Franke et al. combined linkage and association maps with microarray expression data in a tool called TEAM ⁶ . Turner et al.

prioritized candidate disease genes based on the statistical over-representation of Gene Ontology and Interpro annotations ⁷ .

A second category of methods can be defined as pure classification methods, based on logistic regression, discriminant analysis, artificial neural networks, support vector machines, etc. To our knowledge these have not been applied to classify genes to particular diseases yet, most probably because of the relative large sizes of training sets that are required, and because of the difficulty of defining negative training samples (i.e., genes guaranteed not to be involved in the process). However, classification methods have been applied to predict disease probability per se, that is, high scoring genes are likely to be a disease gene, but there is no information as to which disease they are linked.

These methods generally use all known disease genes as training set. There are two very similar reports on general disease probability prediction, both using decision trees as classification method: Lopez-Bigas and Ouzounis based their measure of disease probability purely on a gene’s sequence and its evolutionary trace ⁸ , and Adie et al.

similarly used gene features (e.g,. gene length) and phylogenetic features in their software tool Prospectr ⁹ .

A third category can be defined as prioritization methods that measure the similarity of a gene with a particular disease, or the similarity with genes causing the disease, and prioritize a gene list based on this similarity. Similarity-based prioritization has several advantages: it relies on the existing knowledge of a disease, small training sets can be used, and negative samples are not necessarily needed. The first systematic study using this type of gene prioritization was done by Perez-Iratxeta et al ¹⁰ . In a text-mining approach they used MEDLINE abstracts, MeSH vocabularies, and Gene Ontology annotations to prioritize genes based on the similarity of these descriptions to those of a disease.

Freudenberg and Propping measured the similarity of a gene with disease genes of

related diseases using GO annotation data ¹¹ . A more basic example of similarity matching

is implemented by van Driel et al. who integrated location data, expression data, and

(6)

phenotypic information in the GeneSeeker web application to filter genes based on the similarity with, or rather the co-occurrence of, user-defined query terms ¹² .

The method we propose in this work is a member of the third category. It is based on the assumption that a new candidate disease or pathway gene has properties similar to a set of other genes, for which the most obvious choice is the set of all disease susceptibility genes, respectively pathway members, that are currently known. However, if none or only a few genes are known to be involved in a certain disease or pathway, genes involved in phenotypically related diseases or pathways could also be used, as is done by Freudenberg et al. ¹¹ . The properties that we currently use are: Gene Ontology (GO) annotation, textual descriptions from MEDLINE abstracts, microarray gene expression data, EST-based anatomical expression data, KEGG pathway membership, protein domain annotation of Interpro, cis-regulatory elements, and BIND protein interaction data.

The similarity between the properties of two genes is measured using either correlation measures (for vector-based data) or meta-analysis (for attribute-based data). First, all candidate genes are ranked according to the similarities of each property separately.

Second, the resulting rankings are combined into an overall rank and p-value using order

statistics. Given that the information from numerous genomic sources is hedged to

prioritize the candidate genes, we call this genomic data fusion.

(7)

Results

Computational gene prioritization

In this paragraph we present our approach towards prioritization of candidate genes with respect to a set of training genes using multiple heterogeneous information sources.

Figure 1 overviews the different steps in the analysis.

In the first step, a training set TRAIN is compiled and all information from the following sources is gathered and prepared: textual descriptions from MEDLINE abstracts and LocusLink records, Gene Ontology annotations, Interpro protein domains, KEGG pathways, protein interactions from BIND, EST-based expression data, microarray expression data from multiple experiments, sequence similarities, transcription factor binding sites and cis-regulatory modules. In the case of Gene Ontology annotations, KEGG pathway membership, EST-based expression data, and InterPro protein domains, we determine which attributes are statistically over-represented. For the textual information and the microarray gene expression data, we take the average profile of all individual gene profiles. The BIND data is stored separately for each gene in TRAIN. The transcription factor binding site information of all training genes is compiled into one large vector. Also, the best combination of clustered transcription factor binding sites within human-mouse conserved non-coding sequences in the upstream sequences of the genes is recorded.

For the sequence similarity, we create a local BLAST database consisting of all coding sequences of the genes in TRAIN. These data-retrieval and summarization procedures are described in more detail in the “Data and Methods” section. We call all gathered data from one information source a submodel for this source and all submodels together form a model for TRAIN.

In the second step, any set of candidate genes “TEST”, containing potential disease genes or potential pathway members can be prioritized according to their similarity with TRAIN.

All genes in TEST are scored separately against each submodel by retrieving the

necessary information from the proper submodel database, and by calculating the

similarity between the test gene and the TRAIN set. In the Data and Methods section the

similarity scoring functions are described. Briefly, vector-based data are scored by the

Pearson correlation between the test vector and the training average; while attribute-based

(8)

data are scored by a Fisher’s omnibus meta-analysis, which that combines those p values for the over-representation p-values of those attributes within the training set, that also overlap with the test attributes. Taken together, a All scoreings together result in a matrix of scores, one for each gene and each submodel. Each list of scores (i.e., each column in the matrix, corresponding to one submodel) is then ranked independently from the other lists.

In the third step, we combine all the separate submodel -rankings into one global ranking.

This is done by applying the order-statistics formula (see Data and Methods), for each gene separately. The formula takes as inputs the N rank ratios (i.e., the rank divided by the number of genes that have data available for this submodel), and outputs a Q statisticq value. This Q statisticq-value represents the probability of finding this gene ranked at the observed positions by chance. The Q statistic q value is then transformed into a global p value using either beta or gamma distributions (depending on the number of submodels, see Data and Methods), for which the parameters were estimated by random sampling.

Finally, all TEST genes are ranked according to this global p value, which results in the final prioritization.

Large-scale cross validation

To test our approach for the prioritization of candidate disease genes on the one hand,

and for the prioritization of candidate members of biological pathways on the other hand,

we set up a large-scale validation. For the disease-validation, a list of 29 Online Mendelian

Inheritance In Man (OMIM) diseases was compiled for which at least nine contributing

genes were known. Automated HUGO-to-Ensembl mapping however, reduced the number

of genes for a few diseases. The smallest gene set was the one for Amyotrophic Lateral

Sclerosis (ALS) with only 4 Ensembl genes, and the largest one was the leukemia gene

set with 113 genes. In total we had 627 disease associated genes with an Ensembl

identifier and a disease gene set contained on average 19 genes. For the pathway-

validation, three lists of genes were compiled from Gene Ontology annotations: one with

Wnt pathway members (GO:0016055: Wnt receptor signaling pathway), one with Notch

pathway members (GO:0007219: Notch signaling pathway), and one with EGF pathway

members (GO:0007173: epidermal growth factor receptor signaling pathway). The latter

(9)

two GO categories contained only a limited number of associated human genes so we decided to use the Drosophila gene associations, and then selected the human orthologous genes of the fly pathway members, using Ensembl’s orthologue mappings.

Figure 2 describes the different steps in our cross-validation procedure. For each gene set, a leave-one-out cross-validation is performed: at each run, one gene is left out and the remaining genes are used as training set. Each of the available data types (see Data and Methods) is used to train a submodel. For the OMIM study we did not include the text model; indeed, the performance of that model is artificially high (93%, see below) because the disease associations are explicitly present in the abstracts of the text model. For the pathway study we excluded the GO model, the KEGG model, and the text model for similar reasons.

After training, the left-out gene and 99 random genes are used as test set (also 9 and 49 random genes were tested and gave satisfactory results; data not shown). For each size we constructed 100 random sets, out of which we randomly selected one set for each left- out gene tested. The rankings for each separate submodel, as well as the combined ranking for all submodels based on the order statistics are recorded.

Performance of the combined model

We assessed the performance of the method, in terms of the specificity and sensitivity with

which a known disease gene (or known pathway member) can be found back in the top of

the prioritized list by recording the final ranking of each left-out gene in a Rank ROC curve

(see Data and Methods). The area under the curve (AUC) measures the global

performance of the model. In our case the AUC would equal 100% if every left-out gene

were ranked first. Figure 3 shows the Rank ROC curves, for the rankings of all leave-one-

out cross validations for the OMIM and GO-pathway study. In the same figure, the Rank

ROC curve of the same leave-one-out cross validation using random training sets is also

plotted. Our validation experiment results in a biologically meaningful prioritization that is

significantly better than random prioritizations. Overall, the left-out gene ranks among the

top 50% of the test genes in 85% of the cases in the OMIM study and in 95% of the cases

in the GO study. In about 50% of the cases (60% for the pathways), the left-out gene is

found among the top 10% of the test genes.

(10)

Performance of individual submodels

Figure 4.A shows the AUC values of all submodels individually, using test sets of 100 genes (99 random genes plus one left-out gene). Every single model performs better on real data than on randomized data. The best performing model for OMIM is the text model (93%), because of explicit co-occurrences of a gene and a disease in the same abstract, and which is therefore an artificially high percentage (even higher than the overall rank performance. In a real disease prioritization case, the text model can, in some cases, still capture the knowledge that is building up to the discovery of the disease association (see further). It is apparent that the text performance is lower in the pathway-study, pointing at a far less explicit mentioning of the pathway itself when the function of a pathway member is described. Expression-based data (both EST and microarray) generally perform well, both for diseases and for pathways. This however is only measured with general microarray data (normal human tissues), and it is expected that the microarray performance increases when pathway-specific (e.g., developmental time course) expression data is used. Protein domains (InterPro) and sequence similarities (BLAST) are reasonably useful for diseases, but more for pathways. The latter is caused by the high number of paralogous pathway genes. Notably, also the motif models (both single motifs and cis-regulatory modules) are performing better for the pathways. It is indeed expected that members of the same pathway are more tightly co-regulated than genes that are linked to the same disease.

The bad performance of the KEGG model is mainly attributable to the high numbers of missing values. Actually, if missing values are not taken into account in the performance calculations, the performance rises from 20,43% to 89,53%, meaning that if data is present, the prioritization is good. The bad performance of the BIND model (small difference with the random sets) could be caused by high levels of noise in protein interaction data (e.g., from yeast-2-hybrid experiments). Models like KEGG and BIND are typically expected to become better as better annotation and better high-throughput interaction data becomes available in the future.

When combining some of the submodels with low absolute performances (KEGG, BIND,

Motif, CRM, InterPro) the AUC of the cross-validation is 77.1% (data not shown). In other

words, when submodels with bad individual performances are used, the performance of

(11)

the combined ranking can still be significantly better than random. It is therefore useful to include all models in the prioritization process. There can be several reasons why a submodel has a low AUC in this large scale study, but still contributes significantly to the combined ranking. One reason is that the AUCs are averages across 29 OMIM diseases;

some of which may be modeled well by certain submodels but less well by others (see Supplementary Figure 1). For example, a cross-validation on Alzheimer’s disease alone yields an AUC of 76.3% for the Atlas microarray data, which is much higher than the average AUC of all diseases. Furthermore, when we compare the performance of the submodels between the test sets and random training sets (see Figure 4), the AUCs are always much higher for the former.

Finally, we calculated the pairwise dependencies between submodels. We found a correlation between each submodel and the overall rank, but almost no pairwise correlations between the different submodels. Only between the related submodels (microarray and EST expression data; InterPro and Blast; CRM and motif) there is a small correlation (see Supplementary Figure 2). This lack of correlation between submodels is also important for the application of the order statistics formula, which requires independent rank ratios, at least if the p-values are used to apply a significance threshold on the prioritized list.

Bias towards known genes

When scientists select candidate disease genes manually, there is a distinct bias towards well-characterized genes ⁶ . We expected that our approach should at least partly alleviate this bias, and should allow for unknown or less known genes to be ranked highly, because (1) genes are prioritized based on multiple information sources instead of one or a few;

and (2) not only functional information (for instance, GO, text, pathways) is used, but also data sources that are equally valid for unknown genes, namely microarray gene expression data, EST-based gene expression data, protein domain predictions from InterPro, protein interactions, sequence similarities, and cis-regulatory data.

To establish the magnitude of the bias towards well-characterized genes we first

investigated the influence of the number of information models for which a certain gene

has data available, on the possibility for this gene to get a good rank. Figure 5.A shows for

(12)

each number of available information models the percentage of genes that is ranked between 0 and 10, 10 and 20, and so on in a list of 50 random test genes that are prioritized according to our 29 disease models. We observe only a slight trend of higher rankings for genes with more available information. A second indicator of how well a gene is characterized can be the presence of a HUGO gene symbol. Therefore we plotted the same information in Figure 5.B for genes with and without HUGO gene symbol. Again only a slight bias can be seen towards known genes. It is apparent that even genes with very little information (from three submodels onwards), can be ranked in the top 10 in a test set of 50 genes. In particular, this can be an advantage for the prioritization of disease susceptibility genes, as unknown genes are not easily considered as valuable disease candidates, if they are manually prioritized.

Case studies

Additional validation of the Ttext subMmodel

The artificial high performances observed in the OMIM cross-validation above isare

caused by the usage of explicit literature information in the tText submModel and led to the

exclusion of the tText submModel from the cross-validation. We suspected however that

the tText submModel can was able to discover implicit linkages between a gene and a

disease as well, making it a valuable submodel for prioritization even based on literature

information pre-dating the original before the gene being validated was report that the

gene indeed ed as a causesative gene for the disease. To prove this point, we selected 8

Mendelian diseases for which a disease-causing gene had only recently been identified

and we rolled -back our literature corpus back tountil the year beforein which mutations for

each gene were published (Table 1; an overview ofn how these genes and their training

sets were selected is given in Ssupplementary Note S1). After prioritization by the tText

submModel alone, a number of genes, CACNA1C, CRELD1, GBA and CAV3 still received

a very high ranking, respectively on position three, one, two, and one, confirming our

hypothesis that literature information helps in the prioritization process that builds up

towards the ultimate gene identification. Note that in these cases, the addition of all other

submodels did not substantially influence this high ranking - except pherhaps forin case

GBA, which was ranked at positionon place 30 after addition of the other submodels

(13)

(Table 1). In those cases where the textual model could not detect any evidence for similarity with the training set (, i.e., in case of DCTN1, VG5Q, ABCC9, and DNM2), the addition of the other submodels led to a much higher ranking of the disease gene (72±18 with tText alone versus 11±6 with all submodels). Again, the addition of the (badly performing) tText submModel, did not substantially obscure the (good) ranking obtained by the other models (10±6). This further illustrates that the prioritization based on the integration of multiple, heterogeneous data types is robust.

Disease gene hunting on cardiac abnormalities

To provide the reader with further insights on how these prioritizations were conducted, we here describe the prioritization of three related disease genes, each which affect cardiac function; (1i) an atrioventricular septal defect, which is a congenital heart defect (CHD) arising from insufficient atrioventricular septation during heart development ¹³ ; (2ii) a novel disorder characterized by multiorgan dysfunction and lethal arrhythmias ¹⁴ (ARs), and (3iii) a familial form of hypertrophic cardiomyopathy ¹⁵ (CM). The causative genes for these abnormalities on chromosome 3 have only recently been identified, namely, CRELD1, CACNA1C, and CAV3.

We first trained the model with the different training sets (Table S1a, S2a and S3a), which produced a list of representative attributes for each trained submodel (Table S1b, S2b and S3b). Some examples of such key attributes are the following.: T-box or Zn-finger transcription factor domains are over-represented InterPro domains in the CHD gene set .;

‘Ttranscription’, ‘heart development’, and ‘DNA binding’ are over-represented textual terms

in the CHD set.; Tthe EST expression model shows that each of the sets show a

significantly over-represented expression in the heart.; ‘Rregulation of heart rate’, ‘muscle

contraction’, ‘circulation’, and ‘cation channel activity’ are over-represented GO terms in

the AR set.; Myogenin binding sites, which are required for differentiation of cardiac

muscle ¹⁶ , are part of the cis-regulatory module found in the CM genes.; Ccertain sequence

motifs were present within the CM training sets, such as the binding sites for nuclear-

factor-of-activated-T-cells (NFATs) and MEF2 transcription factors, involved in

cardiomyocyte differentiation and cardiac hypertrophy ¹⁷ , were present within the CM

training sets; Finally, Atlas microarray data for the CM set identifies heart, smooth muscle,

(14)

skeletal muscle, and cardiac myocytes as common expression locations. Overall, the summarized information for the three training sets was in close agreement with the current understanding of the genetic pathways involved in their diseases, indicating that the data summarizations are able to extract biologically relevant information for a set of disease- causing genes.

A set of 200 test genes was then selected for each disease by taking the most recently identified gene together with its 199 flanking genes on chromosome 3. These test sets were then prioritized based on their similarity with the three training sets that did not include the actual disease causing genes. Surprisingly, CRELD1, CACNA1C, and CAV3 ranked extremely well in the CHD, AR, and CM-based prioritizations respectively, namely third, fourth, and second best. The p-value associated with these positions were always significant (, i.e., p=0.02, p=0.006, and p=0.01). When excluding, in the tText submModel, all citations published after the identification these three genes -– in the case of CRELD1 all publications before 2003 and in the case of CAV3 and CACNA1C all publications before 2004 -– the ranking positions for CRELD1, CACNA1C, and CAV3 were not changed (Table 1). This again illustrates that the tText submodel is able to extracts useful literature information even before the actual disease-gene linkage is published (i. .e, reflecting a realistic-life situation). Without the tText submodel, the rankings decreased slightly: CRELD1, CACNA1C, and CAV3 were respectively on position six, eight, and four, indicating that the tText submModel alone is not sufficient for, but rather complementary to, the overall performance of the prioritization model.

By applying a similar prioritization approach to the five other Mendelian-inherited disease genes (Table 1), an average ranking of 10±4 (AUC-value of 95%; Table 1) was obtained.

This actually means that with 200 genes included in the test set, the average rank of a disease gene was within the top 5% of all prioritizsed genes. Our prioritization method thus turns out to be highly efficient for the prediction of Mendelian disease genes. As a negative control, we randomly attributed took each training set asto a test set from another disease.

These prioritizations yielded an overall ranking of 100 (AUC=50%), indicating that the

prioritizations generated by our methodology are also specific for each training set, and

therefore specific for each disease. Encouraged by the promising outcome of these

(15)

findings, we will continue to use the herein described methodology to identify novel modifier genes in other genomic regions linked other diseases.

Prioritization of large chromosomal regions

To show that a prioritization of a larger number of test genes is also feasible, we ranked a test set consisting of all 1048 genes from chromosome 3 using the CHD, AR , and CM training sets. In these analyses, the disease-causing genes (i.e., CRELD1, CACNA1C, and CAV3) were included into the training set and they ranked first (i.e., in fact a positive control). Many other genes with a highly-ranked position were relevant for the disease as well. For example, SHOX2 and ZIC1 cause respectively Turner syndrome and Dandy- Walker malformations, which are syndromatic disorders frequently associated with CHDs ^18,

19 . RARB is a receptor for retinoic acid and either a deficit or an excess of retinoic acid may result in congenital birth defects ²⁰ . Other highly-ranked genes give rise to an identical disease phenotype when mutated in mice – for example, deficiency of FOX1P results in cardiac outflow tract and endocardial cushion defects in mice, whereas Cav1.3-knockout mice suffer from disturbed atrio-ventricular conduction and abnormal contractile function ²¹ . This is a particularly striking observation, as all the used information, except perhaps that from the GO submodel, originates from experiments performed in a human experimental setting. In some instances, genes received a high rank because they scored very high in one of the submodels - for example, because they belong to the same protein family or contain the same InterPro domain as the training genes. SCN10A, SCN11A and SCN5A belong to the the type alpha subunit sodium channel (ENSF00000000129), whereas EOMES, TBX-1, and TBX-5 are all transcription factors characterized by the presence of a T-box transcription factor domain (IPR001699). Also genes homologous to one of the training genes, such as semaphorine 3B (SEMA3B), which is homologous to the SEMA3E training gene of the CHD set, received a high ranking in the BLAST submodel.

Remarkable is the fact that plexin-A1, a receptor for many of the semaphorine family

members, also received a high rank in the CHD Chromosome 3 list. A recent study even

reported that plexin-A1 is essential for cardiac morphogenesis during chick development ²² .

Many of the high-ranking CHD genes, such as EVI1 and semaphorins were also involved

in neural crest guidance ²³ . Defects in the migration of neural crest cells, which orchestrate

the complex remodeling processes of the cardiac outflow tract, are well known to lead to a

(16)

typical spectrum of cardiovascular and facial abnormalities ²⁴ . So, in conclusion, our methodology was not only able to assign a high-ranking priority to disease-causing genes, which did not belong to the training set. In addition, the methodology was able to identify genes interacting with one of the training genes or involved in the biological processes of these genes. Within this regard, these chromosomal prioritizations also seem to confirm the pathway cross-validation. In a few instances, however, it was not clear why certain genes received such a high position (e.g., OSR1, NKTR, RPL32, MYD88). This does not necessarily imply that these genes are false positives. Interestingly, among the high- ranking genes, there were also some genes for which there was only an Ensembl Gene ID, but no gene name or any other reference in the MEDLINE citation database available.

This indicates that, despite the presence of a slight bias towards known genes, our methodology can be used for the discovery of novel putative disease-causing genes.

Disease gene hunting on complex diseases

We already described the successful prioritization of chromosomal regions mapped to Mendelian forms of cardiac abnormalities. In many cases, however, human disease is not monogenetic in nature, but results from complex interactions between heterogeneous modifier genes. We therefore prioritized 4 recent genes identified as susceptibility genes for 4 complex disorders (Table 2; Note S1). The average rank for these genes was 43±12.

This is obviously a much weaker overall performance as obtained for the prediction of

monogenetic disease genes and most probably relates to the more heterogeneous nature

of the latter disorders. It should be stressed, however, that our prioritizations were

performed on relatively large chromosomal regions of 200 genes. TNFSF4 and LRKK2

were both identified by a direct candidate gene approach ^{25, 26} , whereas the disease region

for Crohn’s OCTN1 gene was confined to only five genes ²⁷ . The PTPN22 region, which

was linked to chromosome 1, consisted of only 140 genes ²⁸ . When, for instance, the

prioritization for the PTPN22 was performed on the chromosomal region, which initially has

been linked with RA (, i.e., 1p13.3-1p12),, PTPN22 ranked third. Other high-ranking genes

within the RA-prioritized list included cytokines, such as CSF, and surface antigens, such

as the CD2, CD53, and CD58 antigens. These genes are relevant for RA and at least

some of them have already been implicated in RA. We therefore believe that our

(17)

In vitro case study on pathway gene prioritization

In addition to the case studies of disease gene prioritization described above, we performed a final case study to validate the usefulness of the prioritization method, this time to find new genes involved in a biological process. We have used in vitro experiments to provide complementary evidence to the in silico cross-validation of the three GO- pathways described above. This experiment originates from the observation that many high-throughput genomic experiments generate lists of candidate genes that may contain high numberamounts of false positives. In that case, it would be useful to prioritize such a list prior to wet -lab experiments, based on the likelihoodness thatof the candidates areto be involved in the process under study, thereby removing false positive candidates. Based on this idea, we have prioritized a list of computationally predicted target genes of a de novo discovered cis-regulatory module (CRM), which is predicted to be a node in the gene regulatory network ²⁹ that controls myeloid differentiation.

The new CRM-model was found within a set of 18 up-regulated genes during myeloid differentiation, by applying the ModuleSearcher algorithm with different sets of parameters, and selecting the model with the highest specificity in a genome-wide search. This CRM -model contains the TRANSFAC ³⁰ position weight matrices M00108-V$NRF2_01 (nuclear respiratory factor 2), M00131-V$HNF3B_01 (hepatocyte nuclear factor-3beta), M00644- V$LBP1_Q6 (leader binding protein-1), M00722-V$COREBINDINGFACTOR_Q6 (core binding factor), and M00925-V$AP1_Q6_01 (activator protein 1), for which the predicted binding sites cluster within 200 bp. Using this CRM -model, the ModuleScanner algorithm ³ within TOUCAN ³¹ was used to predict novel target genes in the genome. The top 100 candidate target genes were prioritized for their similarity with the training set (i.e., the same set of 18 up-regulated genes). The submodels we used were: GO, EST expression, KEGG, InterPro, tText mining, and three different sources of microarray data ^32-34 .

We used real-time quantitative PCR to measure the expression of candidate target genes,

before and after the prioritization (see Figure 6). Before prioritization, 42 % of the genes

are upregulated, on average by a factor of 3.30 (geometric mean). After prioritization, 58 %

of the genes were found to be upregulated, on average by a factor of 5.52.

(18)

Software availability

We have developed a software tool called ENDEAVOUR that implements the described methodology in a fully automated way. All described submodels can be automatically trained on a user-defined training set. Disease- or process-specific microarray data—from in-house experiments or downloaded from public repositories—can be added as submodels by the user. Also, rankings obtained independently by other programs can be added by the user. As an example of the latter, two kinds of disease probabilities can be included in the prioritization by default, namely those obtained by Lopez-Bigas et al. ⁸ and the Prospectr scores ⁹ .

The tool can be launched from: http://www.esat.kuleuven.ac.be/endeavour.

(19)

Discussion

Given the growing number of publicly available, post-genomic databases containing information about human genes and proteins, we demonstrate how to integrate all this information and prioritize a set of genes based on how similar they are to a group of other genes. Such a prioritization is particularly useful for gene hunting in complex human diseases, by ranking a set of test genes based on how similar they are to training genes, which are known to be involved in a certain disease. We used a strategy that integrates any number of heterogeneous data types, consisting of either attribute or vector-based annotations. For a set of training genes, attribute-based gene annotation (for instance, Gene Ontology terms), were transformed into statistical over-representation values, as compared to their average genomic occurrences, while vector-based gene annotations (for instance, microarray expression profiles) were averaged within the set itself. Certain genes under study, like for instance candidate disease genes, are ranked according to these transformed annotations, and all individual rankings are combined into an overall ranking using order statistics.

The cross-validation yielded performances of around 85%. Such encouragingly high numbers should stimulate the use of this kind of prioritization methods in a laboratory environment. Indeed, if all test genes are eventually validated in the lab, by starting from the top scoring one and continuing until the hit is found, then the prioritization does not confer any risk at all. Instead, it increases the chance of finding the hit several fold, thereby significantly reducing the required resources. In comparison with prioritizations done by experts, the computational prioritization is also orders of magnitude faster (e.g., one hour versus several weeks), and much more information can be added to the analysis.

Another advantage, which is probably as important as the previous, is the statistical nature

of the result. Far too often, the identification of (disease-causing) genes remains based on

a serendipitous fishing expedition in a large pool of candidate genes, whereby each of the

genes is assessed as a true disease gene, tested against a wide variety of data sources

and fitted into a multitude of hypothetical mechanisms underlying the disease. Besides its

time-consuming aspect, this manual prioritization method soon becomes complex and

chaotic, frequently resulting in genes that become selected based on minimal biological

(20)

evidence. The method presented here offers a rigorous statistical solution for gene prioritization procedures as it ranks genes based upon biological similarity to a training set, yet provides a quantitative order. The automatic system can also be used to guide the expert to investigate certain annotations or properties, when he or she uses it in parallel with his or her own prioritization methodology. Moreover, it is also interesting for a geneticist to integrate his or her own data sources as well. The modular design of our software and the extensive usage of SOAP web services ³⁵ could make this possible at two levels: (1) at the data level if a web service can be built to return either data vectors (microarray-like) or p values (GO-like) for a set of genes and (2) at the rank level if a text file with the geneticist’s own ranking of the test genes can be provided (which can be integrated directly into the order statistics formula). It is, for example, interesting to incorporate phylogenetic information into the prioritization. A straightforward way to achieve to include this particular data type is to use the “disease probabilities” of Lopez- Bigas and Ouzounis ⁸ , or the Prospectr values obtained by Adie et al.(ref). These are calculated from sequence and phylogeny information and they form a good summarized measure of a gene’s phylogeny.

We have shown that all the used information sources contribute to the overall ranking, and

that the correlation between individual models is minimal. This minimal independence

justifies the selection of putatively relevant genes based on the p value of the order

statistics, using a significance level. The equal weighting of all submodels at this stage did

not cause any problems, but an automatic (or manual) weighting of individual submodels

could be an interesting perspective for future work (or similarly, the automatic exclusion of

submodels that do not appear sufficiently informative). This however would require a re-

thinking of the order statistics technique, because the current formula does not allow

weighing the different sources differently. The observed bias towards unknown, or less

characterized genes was kept to acceptable levels, and this bias will decrease further in

the future, when new and better high-throughput data becomes available, and when the

genome annotation and curation processes reach their finalization. The results of our case

studies show how the prioritization protocol provides a high enrichment of real disease

genes for monogenetic disorders, and also for complex disorders. In fact, the true disease-

causing genes from these case studies always received a ranking position within the top

(21)

4% of the genes tested. In addition we show how the prioritization approach handled very large chromosomal regions and assigned high-ranking priorities to genes interacting with one of the training genes or involved in the biological processes of these genes.

In conclusion, we have presented for the first time a bioinformatics method for fast and

automatic gene prioritization that integrates many data types of different origins. The

performance was unexpectedly encouraging to us and the strategy seems ready for use in

real gene hunting cases. Therefore our software tool called ENDEAVOUR is freely

available, and we are using this software ourselves in further collaborative projects with

human geneticists.

(22)

Data and Methods

Text-mining: LocusLink and MEDLINE

TXTGate ³⁶ is a text-mining application to analyze the textual coherence of groups of genes. We use TXTGate’s textual profiles of all genes in LocusLink, based on a Gene Ontology (GO) vocabulary (for a detailed description of the indexing we refer to Glenisson et al. ³⁶ ). A textual profile contains for each (stemmed) term of the GO vocabulary a weight describing the relative importance of this term for this gene. The textual data, which were used to calculate these weights, consist of all the MEDLINE abstracts linked to this gene via the PUBMED identifiers recorded in the LocusLink record. Gene prioritization based on textual data is done by calculating the Pearson correlation between a test gene’s textual profile and the average textual profile of the training genes. A high similarity between a test gene and the training genes means that the core of the literature abstracts that describe both, have a lot of terms in common, and they thus talk—in a general sense—about the same subject, no matter what the detailed message in the abstract is. That is, textual profiles with contrasting statements that use the same words will still be similar. For example, if an abstract on gene x states that ‘protein X stabilizes tau plaques’ and an abstract on gene y states that ‘protein Y solubilizes tau plaques’, the textual profiles of gene x and gene y could be similar due to the common occurrence of the phrase ‘tau plaques’. In the context of this approach (allowing prioritization based on general similarity to an heterogeneous set of disease genes), this general measure of similarity may be an asset rather than a drawback.

Functional annotation: GO and KEGG

In those cases where the textual profiles could suffer from the noise in the literature data,

the curated Gene Ontology (GO) data brings salvation. GO is a manually curated

vocabulary that is used for the functional annotation of genes ³⁷ and is structured as a

directed acyclic graph (DAG). Prioritization is done by comparing the GO annotation of a

test gene with the statistically over-represented GO terms in the training set (in fact, the

set of actually annotated terms is extended and also incorporates all parents of the

annotated terms up to the root of the tree). For example, if the proteins of most genes in a

(23)

membrane, then GO terms like ‘cytoskeleton’ (GO:0005856), ‘cytoskeletal anchoring’

(GO:0007016), and so on, could be over-represented. If one of the test genes is annotated with any of these terms, it will get a high ranking according to the GO data. The KEGG database ³⁸ is an even more structured source of functional annotation. It contains the members of known biological pathways. Similarly as for GO, we calculate whether certain pathways are over-represented in the training set and will give a good score (i.e., a low rank) to those test genes that are involved in one of the pathways that is important for the training set.

Protein information: InterPro and BIND

InterPro is a database of protein families, domains and functional sites ³⁹ . For each training gene the InterPro attributes are retrieved from the Ensembl Mart database (currently the ensembl_mart_22_1 database hosted at ensembldb.ensembl.org). An example of an InterPro attribute is IPR000418 (Name=Ets-domain) for which there are nineteen human proteins known to carry this domain. Scoring test genes using the InterPro protein domains is done by meta-analysis (see further). If a certain protein domain is over-represented in the training set as compared to the full genome, and if a test gene also carries this particular domain, then it will get a good ranking according to the InterPro data.

Another interesting data type that we use to score test genes is protein interaction data, for which we take data from the Biomolecular INteraction Database (BIND) ⁴⁰ . BIND contains interaction data from high-throughput experiments (for example, yeast two-hybrid assays) and from hand-curated information gathered from the scientific literature. The idea behind using protein interaction data for gene prioritization is that one can expect a test gene to be more related to the training set if its protein directly interacts with one of the proteins of the training genes, or if it has a common interaction partner with one of them. In practice, all the proteins of the training genes and all their interaction partners are collected and the overlap between this set and the set containing a protein (encoded by a test gene) and its interaction partners is used to calculate the similarity score.

Gene expression: microarray data and ESTs

Microarray data has been used previously for gene prioritization ^{5, 6} . We use the Atlas gene

expression data of 47 normal human tissues ³² . However, it is obvious that disease- or

(24)

process-specific microarray data are more informative for particular training and test sets.

For example, if a geneticist has performed his or her own microarray experiment that measures gene expression in healthy versus diseased patients (or if such data are available in public repositories, such as ArrayExpress or GEO), then a prioritization based on these data is more likely to give good performance. Our modular implementation allows easy inclusion of any microarray data set into the prioritization methodology (see also the Software availability section). Next to microarray-based gene expression data, we reasoned that the large repositories of EST-based anatomical expression in the human body also contain valuable information that can be used for gene prioritization. We use the EST-based expression data available via the Ensembl Mart database. As is done for GO, model training consists of calculating a p-value for each anatomical site that measures its statistical over-representation within the training set. Scoring a test gene with this EST- based model is done by meta-analyis (see below).

Cis-regulatory elements

In the prioritization process we currently use cis-regulatory information in two different ways. Firstly, we compare all (offline) predicted instances of a library of transcription factor binding models (position weight matrices or PWMs), in all human-mouse conserved non- coding sequences (CNS) upstream of a test gene (10 kilobases), with the averaged instances of the training set. More information on this data set can be found in ^{3, 31} . The predicted binding sites of all available transcription factors are recorded in a vector (for instance, of length 400 if there are 400 PWMs), where each element represents the best score of this PWM in all human-mouse conserved sequence blocks upstream of that gene.

Comparison with the training vector is done by calculating the Pearson correlation.

Secondly, the best combination of five transcription factors within maximally 200 bp in the

set of human-mouse CNSs is searched in the training set using the Genetic Algorithm

version of our ModuleSearcher algorithm ^{3, 31} , using 100 generations. Scoring of a test gene

is done by our ModuleScanner algorithm ^{3, 31} that essentially sums up the best scores in all

test gene CNSs of the three PWMs of the trained model.

(25)

Sequence similarity: BLAST

There are examples of diseases that can be caused by proteins of the same family, for example Presenilin 1 and Presenilin 2 in Alzheimer’s disease. The e-value of the BLAST between the (longest) coding sequences of these two genes is 10 ^-133 , thus they are highly similar. One can imagine that a researcher would perform a BLAST search of a number of test genes and use his or her expert knowledge to judge whether the hits make sense. In the same sense we use a BLAST search to score test genes against a set of training genes. Judging whether a hit is relevant is done automatically by restricting the BLAST search on an ad hoc created BLAST database consisting of all coding sequences of the training set. Test genes that have a low significance value of the BLAST are similar to one of the training genes and will get a low rank.

Scoring functions

For information sources summarized using a vector representation, we use the Pearson correlation (in the case of microarray gene expression data, textual information, and transcription factor binding sites).

For attribute-based information sources (GO, KEGG, EST, and InterPro) we use the following meta-analysis to calculate a similarity score for a test gene compared to a set of training genes. For each gene in a set of training genes, all relevant attributes are collected (see before). Next, for each attribute a p value is calculated using a binomial statistic that represents the statistical over-representation of this attribute within the training set. Coherent training sets will contain statistically significant p-values. When a group of test genes is scored using these data types, the p values corresponding to the annotated attributes of a test gene are combined using Fisher’s method (-2logp i ), generating a new p-value using the  ² -distribution. The test genes are then ranked according to this new p-value.

Order statistics

Given the heterogeneity of the scoring results of individual information models

(correlations, p-values, and counts), a meta(-meta)-analysis is not trivial since it would

require a p-value for each submodel, and the calculation of p-values from correlation

(26)

measures cannot be achieved straightforwardly. However, the scoring with each individual model results in N different rankings r 1 , r 2 , ..., r N , one for each of the N data types used.

Instead of directly combining the results of each submodel, we can combine the ranking according to the submodels using order statistics. The ranks are divided by the total number of ranked genes (excluding genes with no rank because of missing values) and a Q statistic is calculated that represents the probability of getting the observed rank ratios by chance. This Q statistic is calculated using the joint cumulative distribution of an N- dimensional order statistic as was also done by Stuart et al. ²⁷ (see http://www.math.uah.edu/stat/urn/urn4.xml for a description):







^N

N

r

s N N

r s r

N N à ds ds ds

r r r Q

1 2

1 1

1 1 2

1 , ,..., ) ! ... ...

(

They propose following recursive formula to compute the above integral:

        

 _i ^N _N _i _N _i _N _i _N _i _N

N N r r Q r r r r r

r r r

Q ( ₁ , ₂ ,..., ) ! ₁ ( ₁ ) ( ₁ , ₂ ,..., , ₂ ,..., ) ,

where r i is the rank ratio for submodel i, N is the number of submodels used, and r 0 =0. We noticed however that this formula is highly inefficient for moderate values of N, and even intractable for N > 12 because its complexity is O(N!). We propose a much faster alternative formula with complexity O(N ² ):

  ^     

 ₁ ¹ ¹ ₁

) ! 1

k (

i

i k N i k i

k r

i V V

where V N the solution of the integral of the first equation, V 0 =1, and r i is the rank for

submodel i. Since the Q statistic calculated this way are not uniformly distributed, we have

to fit a distribution for every possible number of ranks and use this distribution to estimate

a p-value. We found that the Q statistics for N  5 randomly and uniformly drawn ranks are

approximately distributed according to a Beta-distribution. For N > 5 the distributions can

be modeled by a Gamma-distribution. The cumulative distribution function of these

distributions provides us with a p-value for every Q statistic from the order statistics. Next

to the original N rankings, we now have an (N + 1) ^th that is the combined ranking resulting

from our genomic data fusion.

(27)

Handling missing values

A major issue for gene prioritization are missing values, because how can one judge the similarity between a test gene and a training set, if there is data missing? The order statistics provides us with a solution, because it is based on rank ratios rather than absolute ranks and because each probability can be computed with its particular value of N equal to the number of genes with information regarding a certain submodel. For genes without information regarding a certain submodel, no comparison can be made between a test gene and the submodel and thus no rank can be given. Hence they are not taken into account when applying the order statistics. Genes from TEST for which there is data available, but for which there is no similarity with the training set also have to be ranked with caution. Such genes have the highest (i.e., worst) possible score for a particular submodel (for instance, 1.0 for the GO submodel), and are very dissimilar to TRAIN according to this submodel. Imagine a case where all test genes have the same extreme score of 1.0. Since they all have the same score they will all get the same (best) rank of 1 and the order statistics will put them high in the overall ranking. To avoid this problem all genes with maximal dissimilarity get the maximum rank (which is equal to the number of genes in TEST).

Rank ROC curves

The results of the cross-validation can be visualized in a Rank ROC (Receiver Operating Characteristic) curve, where the y-axis represents the sensitivity (i.e., the proportion of true positives) and the x-axis represents one minus the specificity (i.e., the proportion of true negatives):

sensitivity =

FN TP

TP

 specificity =

FP TN

TN



We have called this Rank ROC because we do not perform a traditional classification with

one model, but multiple prioritizations with different models. The values in the above

formulas are calculated from all iterations of all diseases with test sets consisting in one

(28)

left-out disease gene plus random genes. Their interpretation is the following: (1) the number of true positives (TP) is the number of times that the left-out gene is ranked above the cutoff; (2) the false positives (FP) are all the random genes that are ranked above the cutoff (these can be thought of as being retained for further evaluation but they are probably not disease associated); (3) the true negatives (TN) are the random genes that are ranked below the cutoff; and (4) the number of false negatives (FN) is the number of times that the left-out gene is ranked below the cutoff (in these cases the real disease associated gene is not retained for potential further analysis). In a Rank ROC curve as in Figure 3, the sensitivity and (1-specificity) are plotted for each possible cut-off value. On the one hand, such curves can be used to choose a cut-off value (giving a desirable balance between FP and FN) and on the other hand they can be used to compare different kinds of prioritizations. The area under the curve (AUC) is a measure of the performance, which in our case would equal one if every left-out gene (i.e., the wanted disease gene in each test set) is ranked first for all tested diseases. The AUC would be 0.5 if the prioritization is not better than ranking the genes randomly. Although the Rank ROC is not an ROC per se, it is an appropriate measure of the proportion of genes correctly (incorrectly) included (or left-out) from a list of follow-up genes—as a function of the length of such a follow-up list.

Cell culture, RNA isolation and RT-PCR

HL-60 cells were grown in RPMI 1640, supplemented with 10 % fetal calve serum.

Differentiation was induced by 10 nM phorbol 12-myristate 13-acetate (PMA), when cells were grown to a density of 7.10 ⁵ /mL. Prior to induction and 24 h post induction, cells were harvested by centrifugation and RNA was isolated using the trizol reagent (Invitrogen), and subsequently treated with Turbo DNA-free Dnase (Ambion). First strand cDNA was synthesized using Superscript II reverse transcriptase (Invitrogen).

Real-time quantitative PCR

Real-time quantitative PCR was performed using the qPCR core kit for SYBR green

(Eurogentec), on an ABI PRISM 7700 SDS (Applied BioSystems). The mRNA levels were

(29)

al., 2002): ACTB, GAPDH, UBC and HPRT1. All primers used are depicted in Supplementary table 4.

Acknowledgements

We wish to thank all groups and consortia that made their data freely available: Ensembl,

NCBI (LocusLink and MEDLINE), Gene Ontology, BIND, KEGG, Atlas, InterPro, BioBase

(public release of TRANSFAC at www.gene-regulation.com), and the Disease Probabilities

from Lopez-Bigas and Ouzounis ⁸ and Adie et al (ref). We also thank Patrick Glenisson for

help with the text-mining component, Joke Allemeersch and Gert Thijs for their advice on

the order statistics. SA is postdoctoral researcher of the K.U.Leuven. DL and PVL are

supported by Instituut voor de aanmoediging van Innovatie door Wetenschap en

Technologie in Vlaanderen (IWT) (STWW-00162, STWW-Genprom, GBOU-SQUAD-

20160), Research Council KULeuven (GOA Mefisto-666, GOA-Ambiorics, IDO genetic

networks), FWO (G.0115.01 and G.0413.03), IUAP V-22 to SA, BC, YM, BD; by FWO

grants (3.0269.97, G.0383.03), the Dutch Diabetes Fund and the CARIM, NUTRIM,

GROW and CAPHRI Research Institutes of the University of Maastricht to RV.

(30)

Table 1: Prioritizations of eight recently-identified monogenetic disease genes. The nature of the genetic variation in these genes was always a mutation and was inherited in a Mendelian fashion. The name of the disease, the gene name and Ensembl ID and the publication date of the gene-to-disease article are given, together with the rank at which they have been prioritised by the rolled-back TextModel, by all submodels and by all submodels without Text. The average rank (mean±SEM) for each prioritization is indicated.

Disease Gene Ensembl ID Publication

date Rolled-

back Text All All, no Text

Arrhythmia Ca(V)1.2 ENSG00000151067 October 2004,

¹⁴

3 4 4

Congenital Heart

disease CRELD1 ENSG00000163703 April 2003,

¹³

1 3 6

Parkinson’s Disease GBA ENSG00000177628 November 04,

⁴¹

2 30 81

Cardiomyopathy 1 CAV3 ENSG00000182533 January 2004,

¹⁵

1 2 8

Charcot-Marie-Tooth DNM2 ENSG00000079805 March 2005,

⁴²

100 14 12

Amyotrophic Lateral

sclerosis DCTN1 ENSG00000135406 August 2004,

⁴³

97 27 23

Klippel-Trenaunay

Disease VG5Q ENSG00000164252 February 2004,

⁴⁴

39 3 3

Cardiomyopathy 2 ABCC9 ENSG00000069431 April 2004,

⁴⁵

51 1 1

Average Rank (AUC -value)

37±16 (77%)

11±4 (91%)

17±9

(87%)

(31)

Table 2: Prioritizations of four recently-identified complex disease genes. The nature of the genetic variation in these genes was always a polymorphism, which typically was inherited as a risk factor for the respective disease. The name of the complex disease in which these genes were identified, their gene name, Ensembl ID and the publication date at which the disease gene was reported as a causative gene are given, together with the rank at which they have been prioritised by all submodels without the TextModel, and by all Submodels with the Textmodel. The average rank (mean±SEM) for each prioritization is indicated.

Disease Gene Ensembl ID Publication Date All, no Text All Atherosclerosis TNFSF4 ENSG00000117586 April 2005,

²⁵

111 54

Parkinson’s

Disease OCTN ENSG00000197208 May 2004,

²⁶

87 71

Crohn’s Disease LRKK2 ENSG00000188906 November 2004,

²⁷

42 34

Rheumatoid

Arthritis PTPN22 ENSG00000134242 August 2004,

²⁸

22 11

Average Rank

(AUC -value) 66±18 43±12

Computational candidate gene prioritization by genomic data fusion

Computational candidate gene prioritization by genomic data fusion

Stein Aerts 1,2,#,* , Bert Coessens 2,# , Diether Lambrechts 3,# , Peter Van Loo 4,2,# , Robert Vlietinck 5,6 , Peter Carmeliet 3 , Bart De Moor 2 , Peter Marynen 4 , Bassem Hassan 1 , Yves

Moreau 2

Laboratory of Neurogenetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

Bioinformatics group, Department of Electrical Engineering (ESAT-SCD), University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium;

The Center for Transgene Technology and Gene Therapy, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

Human Genome Laboratory, Department of Human Genetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

Department of Human Genetics, Flanders Interuniversity Institute for Biotechnology, University of Leuven, Belgium;

Department of Population Genetics, Genomics and Bioinformatics, University of Maastricht, the Netherlands.

# These authors contributed equally to this work.

* Corresponding author: Stein Aerts, Laboratory of Neurogenetics, Herestraat 49, bus 602, 3000 Leuven;

Tel: +32 16 34 71 50; Fax: +32 16 34 62 18; Email: stein.aerts@esat.kuleuven.ac.be

Abstract

Researchers are increasingly dependent on data mining methods to analyze the growing quantity of high-throughput data and the information reported in the biomedical literature.

We describe the summarization methods for the prioritization from individual information sources and the order statistics used to integrate these prioritizations into a global ranking.

In conclusion, we propose a fast and highly flexible way to prioritize large lists of genes,

which - in contrast to pre-existing methods - is based on multiple and heterogeneous

information sources. Our method provides the first statistically coherent framework for

high-throughput gene prioritization and is freely available.

Supplementary data

The supplementary data file contains the following information:

- A list of the OMIM entries that were used in the OMIM cross validation - Figure S1: boxplots showing the variation in submodel performances

- Figure S2: Spearman correlations between the different submodel rankings - Figure S3: Screenshot of the ENDEAVOUR software tool

- Note S1: Description of how the diseases and disease genes for the case studies have been selected

- Table S4: Training set of genes that are up-regulated during HL-60 cell differentiation, used for the discovery of a new cis-regulatory module, and also used as training set for the prioritization of putative module target genes.

- Table S5-S6: Top 100 scoring genes for the new cis-regulatory module, before (S5) and after (S6) prioritization

- Table S7: primers used for the qPCR validation of putative module target genes

Introduction

Hauser et al. combined a statistical measure of differential gene expression (measured

prioritized candidate disease genes based on the statistical over-representation of Gene Ontology and Interpro annotations 7 .

similarly used gene features (e.g,. gene length) and phylogenetic features in their software tool Prospectr 9 .

Freudenberg and Propping measured the similarity of a gene with disease genes of

related diseases using GO annotation data 11 . A more basic example of similarity matching

is implemented by van Driel et al. who integrated location data, expression data, and

phenotypic information in the GeneSeeker web application to filter genes based on the similarity with, or rather the co-occurrence of, user-defined query terms 12 .

The similarity between the properties of two genes is measured using either correlation measures (for vector-based data) or meta-analysis (for attribute-based data). First, all candidate genes are ranked according to the similarities of each property separately.

Second, the resulting rankings are combined into an overall rank and p-value using order

statistics. Given that the information from numerous genomic sources is hedged to

prioritize the candidate genes, we call this genomic data fusion.

Results

Computational gene prioritization

In this paragraph we present our approach towards prioritization of candidate genes with respect to a set of training genes using multiple heterogeneous information sources.

Figure 1 overviews the different steps in the analysis.

In the second step, any set of candidate genes “TEST”, containing potential disease genes or potential pathway members can be prioritized according to their similarity with TRAIN.

All genes in TEST are scored separately against each submodel by retrieving the

necessary information from the proper submodel database, and by calculating the

similarity between the test gene and the TRAIN set. In the Data and Methods section the

similarity scoring functions are described. Briefly, vector-based data are scored by the

Pearson correlation between the test vector and the training average; while attribute-based

In the third step, we combine all the separate submodel -rankings into one global ranking.

Finally, all TEST genes are ranked according to this global p value, which results in the final prioritization.

Large-scale cross validation

To test our approach for the prioritization of candidate disease genes on the one hand,

and for the prioritization of candidate members of biological pathways on the other hand,

we set up a large-scale validation. For the disease-validation, a list of 29 Online Mendelian

Inheritance In Man (OMIM) diseases was compiled for which at least nine contributing

genes were known. Automated HUGO-to-Ensembl mapping however, reduced the number

of genes for a few diseases. The smallest gene set was the one for Amyotrophic Lateral

Sclerosis (ALS) with only 4 Ensembl genes, and the largest one was the leukemia gene

set with 113 genes. In total we had 627 disease associated genes with an Ensembl

identifier and a disease gene set contained on average 19 genes. For the pathway-

validation, three lists of genes were compiled from Gene Ontology annotations: one with

Wnt pathway members (GO:0016055: Wnt receptor signaling pathway), one with Notch

pathway members (GO:0007219: Notch signaling pathway), and one with EGF pathway

members (GO:0007173: epidermal growth factor receptor signaling pathway). The latter

two GO categories contained only a limited number of associated human genes so we decided to use the Drosophila gene associations, and then selected the human orthologous genes of the fly pathway members, using Ensembl’s orthologue mappings.

Performance of the combined model

We assessed the performance of the method, in terms of the specificity and sensitivity with

which a known disease gene (or known pathway member) can be found back in the top of

the prioritized list by recording the final ranking of each left-out gene in a Rank ROC curve

(see Data and Methods). The area under the curve (AUC) measures the global

performance of the model. In our case the AUC would equal 100% if every left-out gene

were ranked first. Figure 3 shows the Rank ROC curves, for the rankings of all leave-one-

out cross validations for the OMIM and GO-pathway study. In the same figure, the Rank

ROC curve of the same leave-one-out cross validation using random training sets is also

plotted. Our validation experiment results in a biologically meaningful prioritization that is

significantly better than random prioritizations. Overall, the left-out gene ranks among the

top 50% of the test genes in 85% of the cases in the OMIM study and in 95% of the cases

in the GO study. In about 50% of the cases (60% for the pathways), the left-out gene is

Stein Aerts ^1,2,#,* , Bert Coessens ^2,# , Diether Lambrechts ^3,# , Peter Van Loo ^4,2,# , Robert Vlietinck ^5,6 , Peter Carmeliet ³ , Bart De Moor ² , Peter Marynen ⁴ , Bassem Hassan ¹ , Yves

Moreau ²

prioritized candidate disease genes based on the statistical over-representation of Gene Ontology and Interpro annotations ⁷ .

similarly used gene features (e.g,. gene length) and phylogenetic features in their software tool Prospectr ⁹ .

related diseases using GO annotation data ¹¹ . A more basic example of similarity matching

phenotypic information in the GeneSeeker web application to filter genes based on the similarity with, or rather the co-occurrence of, user-defined query terms ¹² .

muscle ¹⁶ , are part of the cis-regulatory module found in the CM genes.; Ccertain sequence

cardiomyocyte differentiation and cardiac hypertrophy ¹⁷ , were present within the CM