Abstract Due to the complexity of the underlying processes, the scarcity of the data and the high level of noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands methodologies that incorporate domain literature with gene

(1)

[

literature in gene clustering: representations and methods

PeterAntal

Geert Fannes

Tamas Meszaros

#

Patrick Glenisson

Bart De Moor

Yves Moreau

Tom Boonefaes +

Pieter Rottiers +

JohanGrooten +

]

Abstract

Due to the complexity of the underlying processes, the scarcity of the data and the high level of

noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands

methodologies that incorporate domain literature with gene clustering. These methodologies should

moreand morecomply withthecircular nature oftheoverall clusteringprocess, in which thehuman

expertise iterates between the measured statistical data and the domain literature. In this paper

we propose Gene*, a genetic literature data structure representing the relevant literature for each

gene based on its GeneCard and Medline abstracts. The Gene* representation is appropriate for

information retrieval and constructing textual proles of a gene cluster. Based on this, we describe

a systemproviding a textual prolefor agene cluster and supporting the corresponding information

retrievalandrevisionstepsaccordingtothecyclicnatureoftheclusteringprocess. Wepresentresults

aboutclusteringofgenes basedsolelyontheirGene* literaturerepresentation andonthedual usage

ofthisliterature-basedclusteringwithgeneclusteringbasedonexpressiondata.

1 Introduction

Thedisseminationofmicroarraytechnology,providingthepossibilityofsimultaneousmeasurementoftheactivity

of largenumbersofgenes,gavenewstimulitowardsabroaderunderstandingoftheirfunctionalroles andtheir

interactivenetworking.Thecomplexityofthesefundamentalbiologicalquestionsthough,demandmethodologies

that allowa exibleorganizationofhumanexpertise,themeasuredstatistical dataandthedomainliterature.

Oneofthemainresearchdirectionsinthepost-genomicageistheclusteringofgenestounderstandtheregulatory

patternsintheirinteractions,thepathwaystheybelongtoorthegeneticnetworktheyconstitute(foranoverview,

wereferto[8]). Todate,theinterpretationandvalidationofgeneexpressionclustersremainsatime-consuming

manual task based on browsing the genetic literature and hypothesizing about the relationships within the

clusters based on the data. While it is feasible for a small set of genes, this practicedoesnot scale up to a

medium orgenome-wideset.

However, the availability of large amounts of electronic domain literature has changed the traditional wayof

data analysis. Nowadaysahugeamountof domainknowledgeisavailablein either unstructuredorstructured

electronic format (e.g., natural languagepapersand structured documents, databasesor knowledge bases re-

spectively). Many eorts in the bioinformatics communityare currently directed towards the developmentof

machine-friendlyrepresentations,standardizationofthestate-of-the-artbiologicalinformationresourcesandthe

establishmentofcross-linkedorcombinedinformationsources. Despitetheseeorts,muchofthedomain-specic

(

)Dept. of Electrical Eng., Katholieke Universiteit Leuven, KasteelparkArenberg 10, B-3001 Heverlee (Leuven),

Belgium, ( +

)Dept. ofMolecular Biology,Ghent University andFlandersInteruniversityInstitutefor Biotechnology,K.

L. Ledeganckstraat35, B-9000 Ghent,Belgium, (

#

)Dept. of Measurement andInf. Systems,Budapest Univ. ofTech.

(2)

rapid developmentofthedomain.

Theexpert'senvironmentcanbecharacterizedassplitupinthedataworld,whichcontainsthehigh-throughput

dataandstatisticalanalysismethods,andtheknowledgeworld whichcontainsthedomainknowledgedominantly

presentin free-textform. Withinthis terminology,thedataanalysis canbeseenasdeepinteraction ofhuman

expertisewith thosetwoworlds. ToincreasetheeÆciency ofthis interaction,post-genomicintelligentsystems

shouldovercomethecurrentarticialseparationofthetwoworlds(i.e.,separationbetweentoolsfordataanalysis

and thoseforinformationretrieval)byfollowingintegratedapproaches. Gradual stepsin removingthebarriers

betweentheexpressiondataanalysisandliteratureorientedmethodscanbethefollowing:

1. Intelligent support. More customized literature-based methods (e.g., information retrieval methods using

contextualinformationfrom theuser).

2. Semi-automaticsupport. Semi-automaticliterature-basedmethodsforinterpretingtheresultsofdataanal-

ysis(e.g., automaticderivationofatextualproleforageneclustercomingfromthedataanalysis).

3. Integratedapproach. Integrationofmethodsbasedontheanalysisofexpressiondataanddomainliterature

(e.g., comparisonofclusteringsofgenesbasedonthemeasureddataand thecorrespondingliterature).

4. Literature-data approach. Methods using the domain literature in the sameway as the expression data,

possibly after complextransformation ofthe textual domain knowledge into amorenumerical, statistical

format(e.g.,clusteringofgenesbasedonadualvectorrepresentationforeachgene,composedofrelevance

factorsfordomaintermsandtheexpressionlevelsinmeasurements).

Thisarticleaddressestherstthreeissuesfromanintegratedpointofviewonthedataanddomainknowledge,

maintaining exibilitytowardsboththerapidlyevolvingeld andexpertinterventions.

Thepaperisorganizedasfollows: Section2summarizestheexistingliterature-basedapproachesforsupporting

the expressiondataanalysis. InSection3, aniterative,cyclic model oftheoverallclustering process isshown.

Section 4presentsGene*, theso-called'geneticliteraturedata'representationofthedomainliteratureandour

semi-automaticsystembasedontextualprolingofgeneclusterstosupportthiscircular,iterativenatureofthe

clusteringprocess. Section5containsresultsaboutclusteringthederived'geneticliteraturedata'anditsrelation

to clusteringexpressiondata. Finally, Section6concludes anddescribesourongoingworkonthissubject.

2 Literature-based microarray data analysis

Thedominantstrategiestosupporttheexpressiondataanalysisbyliterature,relyonthefactthatthestatistical

dataanalysisandthebiologicalinterpretationoftheresultsareseparated(i.e.,theusageofstatisticalpackages

versusdatabase or information retrievaltools). Consequently, most of theeorts are devoted to the compila-

tion of the available genetic literatureinto cross-linked databasesto support the structured and unstructured

information retrieval. Forthebiomedicalsciencesin general,theUS NationalLibraryofMedicine'sMedlineis

acommon bibliographicsourceofcitations andabstracts rangingfrom 1966to present. Morerecentlycreated

and focusedonthehumangenome, theGeneCardsystem[6] oersaweb-based,comprehensiveandconstantly

updated knowledge repositoryforeach humangene. These curated databasesform basictoolsforthe manual

interpretationofexpressiondata.

Arststeptoautomatethemanualinvestigationofvariousdomainknowledgerepositoriesisserializingfrequent

subsequentstepsasreportedin [4]. ItinvolvestheextensionofquerieswithtermscomingfromtheGeneCards

repositorybeforesubmittingthemtotheMedlinesearchengine.Followingthefourgroupingsabovesuchmethods

canbeclassiedasintelligent supportforliterature-basedexpressiondataanalysis.

Another information retrieval-based approach is the use of domain-specic concepts and term hierarchies (or

ontologies). Throughtheirstructure, ontologies canrepresentthesemanticsof thedomain. In [2] forexample,

theGeneOntology(GO)ConsortiumdemonstratesthepotentialofGOforinterpretinghigh-throughputgenomic

data. Masys et al. [3] designed an interface that combines the Medline query engine with both the MeSH

keyword list and the UMLS ontologyto infer the function of a group of genes. The MeSH (Medical Subject

(3)

conceptsdescribedinitscontrolledvocabularies(moredetailsavailableathttp://www.nlm.nih.gov/databases/).

Theinterfacein [3]reportsthequantitativesignicanceof each resultand provideslinks todierentdatabases

allowing further browsing. Such methods can be interpreted as semi-automatic support for literature-based

expressiondataanalysis.

The most ambitious methods, the integrated approaches, tryto directly inferfunctional relationships between

genesandtheirproductsfromtextualinformation. Wecandistinguishtwoimportantapproaches. Rind eschet

al.[7] followthedeep-linguistic approach whereNatural LanguageProcessingforms thebasisof theknowledge

discovery process. An alternative way is to adopt the shallow statistical stance which basically discards the

linguistic structuralinformationand usesonly statisticsofindividual termsindocumentsand thecorpus. Our

focus will be directed towards this approach. Shatkay et al. [9] for example derivea gene-to-gene similarity

measurebyselectingtheN mostrelevantdocumentsfromacorpusofMedlineabstractsforeachgene. PubGene

[5] onthe otherhand is therst on-linesystemthat generates genenetworks basedonthe full set of Medline

abstracts, all indexed in a simplied, but scalable way. Finally, Stapley et al. [10] report on a similar, but

prototypicalapplicationthatisbasedonabinarygene-by-documentmatrix. Itoutputsthesimilaritiesbetween

thegenesthat arerelatedto auser-speciedMeSHterm.

3 An integrated approach for using data and literature in gene clustering

As the previoussurveydemonstrates, oneof themain current trendsis theincreasingly closeincorporationof

theliteratureinthegeneclusteringprocess.

TodesigneÆcientintelligent,semi-automated,orintegratedmethods,requirestheidenticationofasimple,but

powerfulmodel of the clustering process. Fig. 1 illustrates the overall cyclic process involving (1) clustering

basedonthedata,(2)interpretingtheclusters,(3)evaluatingcluster membersandpotentialnon-membersand

nally (4) modifying the experimental setup, the parameterization of the clustering algorithm, and so on to

returnto therststep.

Figure1: Theinteractionloopbetweenthestatistical dataworldandtheliteratureworld

Theresearchintoclusteringstrategiesforhigh-throughputdata,mostnotablymicroarraydata,isatoppriority

in the current post-genomic stage. The algorithms and their evaluation are increasingly taking into account

conceptsasstatisticalclusterquality,aprioricondencelevelsandpredenedcausalrelations(cfr. [8]). However,

attaching biological context to the computed clusters remains a major bottleneck like ruling out biological

discrepancies,peculiaritiesorexperimentalerrors. Althoughcurrenteortstocollectrelatedgeneticinformation

(or links to them) into centralized repositories has proved extremely valuable, they (being a closed system)

lack the interactive component needed to ne-tune the retrievalor discoveryprocess. To our knowledge, the

existing intelligent andsemi-automatedsystemsoperateonlyunidirectionally,lackingthepossibilitytosupport

thecircular, iterativenatureof theclustering process. Asacontinuation,therationalebehindourapproachto

(4)

automatic derivation of a textual prole for a set of genes and clustering of genes based on this literature

representation. The 'textual prole' of agene cluster consists ofrelevant termswith relevance and astability

factorscorrespondingto thiscluster(seelater).

Thepotentialofthesefunctionalitiesintheiterativeprocesscanbedemonstratedonthefollowingparadigmatic

examples:

1. Investigationofageneclusterofinteresttotheresearcher

2. Investigationofageneclusteridentiedbyanexpression-basedclustering algorithm

3. SimultaneousInvestigationofmultiple geneclustersidentiedbyanexpression-basedclusteringalgorithm

Firstly,analyzingaclusterofgenesforexploration aimsatretrievinghighlyrelevantkeywordsforthesetofgenes

of interesttotheresearcher. Thereturnedproleidentieseither themostimportantormoststable termsfor

thegenecluster fromaselectedvocabulary. Sincethese canbeanexcellentstartingpointforfurther literature

exploration, the gene cluster proling is directly integrated with an information retrieval system capable of

acceptingqueriesbasedontheresultinggeneclusterprole. Thesubsequentbrowsingofthereturneddocuments

of theinformationretrievalsystemdrivestheusertoabetterunderstandingoftheclusters.

Secondly,analyzingaclusterofgenesdirectlyobtainedfromexpressionanalysiscouldaimatprovidingfeedback

fortheclusteringalgorithm. Whileinthepreviouspartgeneswerenotnecessarilyconnectedto thedataworld

(i.e., no expressioninformation wasconsidered), heretheset of genes isdened by aclustering algorithm. As

thenumberofmembersperclustertypicallyliearoundtheone-hundredrange,thescientistwillratherskimthe

reportedproles, thanthoroughly check theliteraturefor each gene. Thisglobal inspection oftextual proles

enablestodetectoutliers{genesassignedtoaclusterinthedataworld,butexcludedfromitbytheliterature.

Such outlyinggene cansuggesta awin theparameterizationof thecluster algorithm orcanrepresentanew

discovery. Dependingontheoutcome,theusercaniteratebacktothedataworldtorepeattheanalysisorcheck

certainndingsand,ifnecessary,theloopcanbeenteredagain.

Thirdly, analyzing several clusters in general aims at performing a'dual' clustering: to compare a clustering

based ontheexpression data versusdomain knowledge. Till now,the domainknowledge-basedclustering was

implicitandbasedonhumanexpertise. Ourgoalatthispointistoprovideanexplicit,literature-basedclustering

of genes;basedonthecomparisonofthe clusteringsderivedfrom dataandliterature, clusterscanbeadapted

andalternativecluster strategiesorparameterizationscanbeconsidered(seealsoFig. 1).

4 Automatic textual analysis of gene clusters

Inthis sectionwewill brie ysummarizethemain featuresofourexperimental systembasedontheintegrated

approachfortheusageofexpressiondataandgeneticliterature.

4.1 A'genetic literaturedata' representationof genetic literature: Gene*:

Ourlong-termpurposeisto exploitthegeneticliteratureinsidevariousstatistical methods,herebygivingitan

equalstatusasexperimentaldata. Thereforewecallrepresentationssuitabletothisendgeneticliteraturedata.

ThemostdiÆcultquestioninderivinggeneticliteraturedataishowtocollect,foragivengene,themostrelevant

documentsfrom a largecollection of related documents to construct its literature representation. We used a

kernel-basedmethodology,asimpliedversionof[9],whereforeach geneitscorrespondingGeneCardcard was

chosenastheliteraturekernel. This genekernelfunctions asaquerydocumentforselectingthemostrelevant

Medlineabstracts.

Ineachphase,thedocuments(eitherMedlineabstractsorGeneCards)arerepresentedbytherelevancefactors

of 70.000 terms stemming from controlled vocabularies including the GO [2] and the MeSH ontologies. The

relevancefactorsarecomputedbytheTF-IDFweightingscheme[1]andthecosinesimilaritymeasurewasused

(5)

the N mostrelatedMedline abstracts. Basedonthis representation,similaritymatrices were generated which

wereusedasinputfortwowell-knownclusteringalgorithms: hierarchicalandk-medoids.

Table1summarizespossiblealternativesandparameterizationsfortherepresentation,actuallypresentinGene*.

Almosteachoftheshownparameterizationscanbecombinedwitheachother.

Representation TopN Similarity Clustering

binary 0 matching coeÆcient hierarchical

DF-IDF 5 cosine k-medoids

20

Table1: ParameterizationsforthegenerationofGene

, ageneticliteraturedata. Listedarethespecicvector

representation, thenumberofincorporatedMedlinedocuments,thesimilaritymeasureused to selectthemost

relevantMedlinedocumentsandtocomputethegene-to-genesimilaritymatrixandtheclusteringmethodapplied

onthesimilaritymatrix.

Since itsupports boththegeneration oftextual prolesforgene clusters andthegeneration of clusterspurely

ontheliterature,theGene* representationcanbeusedoptimallyforanintegratedusageofdataandliterature

in thegeneclusteringprocess.

4.2 Aniterativescheme basedon textual proles ofgene clusters

TherstapplicabilityofGene* intheoverallgeneclusteringprocessistogenerateprolesforgeneclustersas

described in Section 3. A more elaboratediterativescheme is presented in Fig. 2, which followsthe circular

nature ofthisprocessshownpreviouslyinFig. 1.

Figure2: Gene*-basedinteractionbetweenstatistical dataworldandliteratureworld

BasedonGene*,thesystemcomputesthemeanandvarianceoftherelevancefactorsofthetermscorresponding

totherepresentationsofthegenesinthegivencluster.Takingintoaccounteithertheaverageorthevariability

of therelevancefactorsof theterms, theuser canselect asubsetof keywordsto querythe Medlinecollection,

theGeneCardsor-again-theGene*. Twoadvantagesofusinganon-binaryrepresentationhere,areitsability

to generatemore detailed term statisticsand its exibility to encapsulate thebinary approach. However, the

systemremainsopen inthe sensethat theusercanselecttheunderlyinggeneticliteraturedata representation

from thealternativespresentedin Table1.

Preliminary,qualitativeresultsabouttheusefulnessof thissystemare encouraging. Forexample,in anexper-

iment, four highly relevantgenes for hypoxia (oxygen deprivation) were selected (being HIF1A, VHL, VEGF

(6)

athorough quantitativeevaluation,soitis notcoveredfurtherin this paper. Wecontinuewith theevaluation

of theapplicabilityofGene* forgeneratinggeneclustersbasedontheliterature.

5 Towards full integration: clustering the 'genetic literature data'

Beside the full GeneCard database, we used a collection of 30.000 Medline abstracts dated from 01/1992 to

11/2000. The abstracts come from ten journals selected accordingto their impact factor and their relevance

asassessedby abiologistin theeld ofMolecularOncology. Therationalebehind usingarestrictedsubsetof

journals,butwideningthetemporalscope,wasbothtokeepourprototypesysteminthemedium-scalerangeand

to eliminatenoiseexpectedfromunrelatedinformationsourcesbykeepingtheliteraturemoredomain-specic.

We looked into the performance of clustering methods based on various Gene* parameterizations, asa rst

attempttoanswerquestionssuchas"whichGene* representationprovidesanoptimalsimilaritymatrixforgene

clustering?". Thoughthisquestionneedsfurtherinvestigation,wefoundtheGene* representationincorporating

the20mostrelevantMedlineabstracts,tohavethebestperformance. Thereforethesubsequentreportedresults

are computed based on this representation, denoted as Gene

20

. The similarity matrices derived from Gene

representations were used in ahierarchicaland k-medoids clustering algorithm. InFig. 3wedepict a typical

resultafterapplication ofhierarchicalclustering.

Figure3: HierarchicalclusteringofgenesbasedonGene

20 .

ToevaluatethebiologicalusefulnessofaGene*-basedliteratureclustering,wecomparedtheresultingclustersto

thesetsofgenesforwhichthefunctionalassociationsarewell-established. Thereasonnottostartimmediately

fromexpression-basedgeneclusters,isthatrealdata-basedclustersthemselvescanbeunreliable,sotheobtained

results are hard to interpret. Consequently, in arst approximation we simulated thebehaviour of an'ideal'

expression data cluster algorithm byhavingamolecular biologist constructtwodisjunct groupsof genes; The

rstgroup('hypoxiagroup')consistedof54genesknowntoberegulatedbyhypoxia(oxygenshortage)ortobe

involvedin thehypoxicsignaltransduction;asecond group('TNF group')included58genesmediatingtumor

necrosisfactorsignaling. Fromabiologicalpointofview,thetwoprocessesunderconsiderationarenotentirely

independent, asboth hypoxia and TNF are able to induce apoptosis (programmed cell death), even utilizing