[
literature in gene clustering: representations and methods
PeterAntal
Geert Fannes
Tamas Meszaros
#
Patrick Glenisson
Bart De Moor
Yves Moreau
Tom Boonefaes +
Pieter Rottiers +
JohanGrooten +
]
Abstract
Due to the complexity of the underlying processes, the scarcity of the data and the high level of
noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands
methodologies that incorporate domain literature with gene clustering. These methodologies should
moreand morecomply withthecircular nature oftheoverall clusteringprocess, in which thehuman
expertise iterates between the measured statistical data and the domain literature. In this paper
we propose Gene*, a genetic literature data structure representing the relevant literature for each
gene based on its GeneCard and Medline abstracts. The Gene* representation is appropriate for
information retrieval and constructing textual proles of a gene cluster. Based on this, we describe
a systemproviding a textual prolefor agene cluster and supporting the corresponding information
retrievalandrevisionstepsaccordingtothecyclicnatureoftheclusteringprocess. Wepresentresults
aboutclusteringofgenes basedsolelyontheirGene* literaturerepresentation andonthedual usage
ofthisliterature-basedclusteringwithgeneclusteringbasedonexpressiondata.
1 Introduction
Thedisseminationofmicroarraytechnology,providingthepossibilityofsimultaneousmeasurementoftheactivity
of largenumbersofgenes,gavenewstimulitowardsabroaderunderstandingoftheirfunctionalroles andtheir
interactivenetworking.Thecomplexityofthesefundamentalbiologicalquestionsthough,demandmethodologies
that allowa exibleorganizationofhumanexpertise,themeasuredstatistical dataandthedomainliterature.
Oneofthemainresearchdirectionsinthepost-genomicageistheclusteringofgenestounderstandtheregulatory
patternsintheirinteractions,thepathwaystheybelongtoorthegeneticnetworktheyconstitute(foranoverview,
wereferto[8]). Todate,theinterpretationandvalidationofgeneexpressionclustersremainsatime-consuming
manual task based on browsing the genetic literature and hypothesizing about the relationships within the
clusters based on the data. While it is feasible for a small set of genes, this practicedoesnot scale up to a
medium orgenome-wideset.
However, the availability of large amounts of electronic domain literature has changed the traditional wayof
data analysis. Nowadaysahugeamountof domainknowledgeisavailablein either unstructuredorstructured
electronic format (e.g., natural languagepapersand structured documents, databasesor knowledge bases re-
spectively). Many eorts in the bioinformatics communityare currently directed towards the developmentof
machine-friendlyrepresentations,standardizationofthestate-of-the-artbiologicalinformationresourcesandthe
establishmentofcross-linkedorcombinedinformationsources. Despitetheseeorts,muchofthedomain-specic
(
)Dept. of Electrical Eng., Katholieke Universiteit Leuven, KasteelparkArenberg 10, B-3001 Heverlee (Leuven),
Belgium, ( +
)Dept. ofMolecular Biology,Ghent University andFlandersInteruniversityInstitutefor Biotechnology,K.
L. Ledeganckstraat35, B-9000 Ghent,Belgium, (
#
)Dept. of Measurement andInf. Systems,Budapest Univ. ofTech.
rapid developmentofthedomain.
Theexpert'senvironmentcanbecharacterizedassplitupinthedataworld,whichcontainsthehigh-throughput
dataandstatisticalanalysismethods,andtheknowledgeworld whichcontainsthedomainknowledgedominantly
presentin free-textform. Withinthis terminology,thedataanalysis canbeseenasdeepinteraction ofhuman
expertisewith thosetwoworlds. ToincreasetheeÆciency ofthis interaction,post-genomicintelligentsystems
shouldovercomethecurrentarticialseparationofthetwoworlds(i.e.,separationbetweentoolsfordataanalysis
and thoseforinformationretrieval)byfollowingintegratedapproaches. Gradual stepsin removingthebarriers
betweentheexpressiondataanalysisandliteratureorientedmethodscanbethefollowing:
1. Intelligent support. More customized literature-based methods (e.g., information retrieval methods using
contextualinformationfrom theuser).
2. Semi-automaticsupport. Semi-automaticliterature-basedmethodsforinterpretingtheresultsofdataanal-
ysis(e.g., automaticderivationofatextualproleforageneclustercomingfromthedataanalysis).
3. Integratedapproach. Integrationofmethodsbasedontheanalysisofexpressiondataanddomainliterature
(e.g., comparisonofclusteringsofgenesbasedonthemeasureddataand thecorrespondingliterature).
4. Literature-data approach. Methods using the domain literature in the sameway as the expression data,
possibly after complextransformation ofthe textual domain knowledge into amorenumerical, statistical
format(e.g.,clusteringofgenesbasedonadualvectorrepresentationforeachgene,composedofrelevance
factorsfordomaintermsandtheexpressionlevelsinmeasurements).
Thisarticleaddressestherstthreeissuesfromanintegratedpointofviewonthedataanddomainknowledge,
maintaining exibilitytowardsboththerapidlyevolvingeld andexpertinterventions.
Thepaperisorganizedasfollows: Section2summarizestheexistingliterature-basedapproachesforsupporting
the expressiondataanalysis. InSection3, aniterative,cyclic model oftheoverallclustering process isshown.
Section 4presentsGene*, theso-called'geneticliteraturedata'representationofthedomainliteratureandour
semi-automaticsystembasedontextualprolingofgeneclusterstosupportthiscircular,iterativenatureofthe
clusteringprocess. Section5containsresultsaboutclusteringthederived'geneticliteraturedata'anditsrelation
to clusteringexpressiondata. Finally, Section6concludes anddescribesourongoingworkonthissubject.
2 Literature-based microarray data analysis
Thedominantstrategiestosupporttheexpressiondataanalysisbyliterature,relyonthefactthatthestatistical
dataanalysisandthebiologicalinterpretationoftheresultsareseparated(i.e.,theusageofstatisticalpackages
versusdatabase or information retrievaltools). Consequently, most of theeorts are devoted to the compila-
tion of the available genetic literatureinto cross-linked databasesto support the structured and unstructured
information retrieval. Forthebiomedicalsciencesin general,theUS NationalLibraryofMedicine'sMedlineis
acommon bibliographicsourceofcitations andabstracts rangingfrom 1966to present. Morerecentlycreated
and focusedonthehumangenome, theGeneCardsystem[6] oersaweb-based,comprehensiveandconstantly
updated knowledge repositoryforeach humangene. These curated databasesform basictoolsforthe manual
interpretationofexpressiondata.
Arststeptoautomatethemanualinvestigationofvariousdomainknowledgerepositoriesisserializingfrequent
subsequentstepsasreportedin [4]. ItinvolvestheextensionofquerieswithtermscomingfromtheGeneCards
repositorybeforesubmittingthemtotheMedlinesearchengine.Followingthefourgroupingsabovesuchmethods
canbeclassiedasintelligent supportforliterature-basedexpressiondataanalysis.
Another information retrieval-based approach is the use of domain-specic concepts and term hierarchies (or
ontologies). Throughtheirstructure, ontologies canrepresentthesemanticsof thedomain. In [2] forexample,
theGeneOntology(GO)ConsortiumdemonstratesthepotentialofGOforinterpretinghigh-throughputgenomic
data. Masys et al. [3] designed an interface that combines the Medline query engine with both the MeSH
keyword list and the UMLS ontologyto infer the function of a group of genes. The MeSH (Medical Subject
conceptsdescribedinitscontrolledvocabularies(moredetailsavailableathttp://www.nlm.nih.gov/databases/).
Theinterfacein [3]reportsthequantitativesignicanceof each resultand provideslinks todierentdatabases
allowing further browsing. Such methods can be interpreted as semi-automatic support for literature-based
expressiondataanalysis.
The most ambitious methods, the integrated approaches, tryto directly inferfunctional relationships between
genesandtheirproductsfromtextualinformation. Wecandistinguishtwoimportantapproaches. Rind eschet
al.[7] followthedeep-linguistic approach whereNatural LanguageProcessingforms thebasisof theknowledge
discovery process. An alternative way is to adopt the shallow statistical stance which basically discards the
linguistic structuralinformationand usesonly statisticsofindividual termsindocumentsand thecorpus. Our
focus will be directed towards this approach. Shatkay et al. [9] for example derivea gene-to-gene similarity
measurebyselectingtheN mostrelevantdocumentsfromacorpusofMedlineabstractsforeachgene. PubGene
[5] onthe otherhand is therst on-linesystemthat generates genenetworks basedonthe full set of Medline
abstracts, all indexed in a simplied, but scalable way. Finally, Stapley et al. [10] report on a similar, but
prototypicalapplicationthatisbasedonabinarygene-by-documentmatrix. Itoutputsthesimilaritiesbetween
thegenesthat arerelatedto auser-speciedMeSHterm.
3 An integrated approach for using data and literature in gene clustering
As the previoussurveydemonstrates, oneof themain current trendsis theincreasingly closeincorporationof
theliteratureinthegeneclusteringprocess.
TodesigneÆcientintelligent,semi-automated,orintegratedmethods,requirestheidenticationofasimple,but
powerfulmodel of the clustering process. Fig. 1 illustrates the overall cyclic process involving (1) clustering
basedonthedata,(2)interpretingtheclusters,(3)evaluatingcluster membersandpotentialnon-membersand
nally (4) modifying the experimental setup, the parameterization of the clustering algorithm, and so on to
returnto therststep.
Figure1: Theinteractionloopbetweenthestatistical dataworldandtheliteratureworld
Theresearchintoclusteringstrategiesforhigh-throughputdata,mostnotablymicroarraydata,isatoppriority
in the current post-genomic stage. The algorithms and their evaluation are increasingly taking into account
conceptsasstatisticalclusterquality,aprioricondencelevelsandpredenedcausalrelations(cfr. [8]). However,
attaching biological context to the computed clusters remains a major bottleneck like ruling out biological
discrepancies,peculiaritiesorexperimentalerrors. Althoughcurrenteortstocollectrelatedgeneticinformation
(or links to them) into centralized repositories has proved extremely valuable, they (being a closed system)
lack the interactive component needed to ne-tune the retrievalor discoveryprocess. To our knowledge, the
existing intelligent andsemi-automatedsystemsoperateonlyunidirectionally,lackingthepossibilitytosupport
thecircular, iterativenatureof theclustering process. Asacontinuation,therationalebehindourapproachto
automatic derivation of a textual prole for a set of genes and clustering of genes based on this literature
representation. The 'textual prole' of agene cluster consists ofrelevant termswith relevance and astability
factorscorrespondingto thiscluster(seelater).
Thepotentialofthesefunctionalitiesintheiterativeprocesscanbedemonstratedonthefollowingparadigmatic
examples:
1. Investigationofageneclusterofinteresttotheresearcher
2. Investigationofageneclusteridentiedbyanexpression-basedclustering algorithm
3. SimultaneousInvestigationofmultiple geneclustersidentiedbyanexpression-basedclusteringalgorithm
Firstly,analyzingaclusterofgenesforexploration aimsatretrievinghighlyrelevantkeywordsforthesetofgenes
of interesttotheresearcher. Thereturnedproleidentieseither themostimportantormoststable termsfor
thegenecluster fromaselectedvocabulary. Sincethese canbeanexcellentstartingpointforfurther literature
exploration, the gene cluster proling is directly integrated with an information retrieval system capable of
acceptingqueriesbasedontheresultinggeneclusterprole. Thesubsequentbrowsingofthereturneddocuments
of theinformationretrievalsystemdrivestheusertoabetterunderstandingoftheclusters.
Secondly,analyzingaclusterofgenesdirectlyobtainedfromexpressionanalysiscouldaimatprovidingfeedback
fortheclusteringalgorithm. Whileinthepreviouspartgeneswerenotnecessarilyconnectedto thedataworld
(i.e., no expressioninformation wasconsidered), heretheset of genes isdened by aclustering algorithm. As
thenumberofmembersperclustertypicallyliearoundtheone-hundredrange,thescientistwillratherskimthe
reportedproles, thanthoroughly check theliteraturefor each gene. Thisglobal inspection oftextual proles
enablestodetectoutliers{genesassignedtoaclusterinthedataworld,butexcludedfromitbytheliterature.
Such outlyinggene cansuggesta awin theparameterizationof thecluster algorithm orcanrepresentanew
discovery. Dependingontheoutcome,theusercaniteratebacktothedataworldtorepeattheanalysisorcheck
certainndingsand,ifnecessary,theloopcanbeenteredagain.
Thirdly, analyzing several clusters in general aims at performing a'dual' clustering: to compare a clustering
based ontheexpression data versusdomain knowledge. Till now,the domainknowledge-basedclustering was
implicitandbasedonhumanexpertise. Ourgoalatthispointistoprovideanexplicit,literature-basedclustering
of genes;basedonthecomparisonofthe clusteringsderivedfrom dataandliterature, clusterscanbeadapted
andalternativecluster strategiesorparameterizationscanbeconsidered(seealsoFig. 1).
4 Automatic textual analysis of gene clusters
Inthis sectionwewill brie ysummarizethemain featuresofourexperimental systembasedontheintegrated
approachfortheusageofexpressiondataandgeneticliterature.
4.1 A'genetic literaturedata' representationof genetic literature: Gene*:
Ourlong-termpurposeisto exploitthegeneticliteratureinsidevariousstatistical methods,herebygivingitan
equalstatusasexperimentaldata. Thereforewecallrepresentationssuitabletothisendgeneticliteraturedata.
ThemostdiÆcultquestioninderivinggeneticliteraturedataishowtocollect,foragivengene,themostrelevant
documentsfrom a largecollection of related documents to construct its literature representation. We used a
kernel-basedmethodology,asimpliedversionof[9],whereforeach geneitscorrespondingGeneCardcard was
chosenastheliteraturekernel. This genekernelfunctions asaquerydocumentforselectingthemostrelevant
Medlineabstracts.
Ineachphase,thedocuments(eitherMedlineabstractsorGeneCards)arerepresentedbytherelevancefactors
of 70.000 terms stemming from controlled vocabularies including the GO [2] and the MeSH ontologies. The
relevancefactorsarecomputedbytheTF-IDFweightingscheme[1]andthecosinesimilaritymeasurewasused
the N mostrelatedMedline abstracts. Basedonthis representation,similaritymatrices were generated which
wereusedasinputfortwowell-knownclusteringalgorithms: hierarchicalandk-medoids.
Table1summarizespossiblealternativesandparameterizationsfortherepresentation,actuallypresentinGene*.
Almosteachoftheshownparameterizationscanbecombinedwitheachother.
Representation TopN Similarity Clustering
binary 0 matching coeÆcient hierarchical
DF-IDF 5 cosine k-medoids
20
Table1: ParameterizationsforthegenerationofGene
, ageneticliteraturedata. Listedarethespecicvector
representation, thenumberofincorporatedMedlinedocuments,thesimilaritymeasureused to selectthemost
relevantMedlinedocumentsandtocomputethegene-to-genesimilaritymatrixandtheclusteringmethodapplied
onthesimilaritymatrix.
Since itsupports boththegeneration oftextual prolesforgene clusters andthegeneration of clusterspurely
ontheliterature,theGene* representationcanbeusedoptimallyforanintegratedusageofdataandliterature
in thegeneclusteringprocess.
4.2 Aniterativescheme basedon textual proles ofgene clusters
TherstapplicabilityofGene* intheoverallgeneclusteringprocessistogenerateprolesforgeneclustersas
described in Section 3. A more elaboratediterativescheme is presented in Fig. 2, which followsthe circular
nature ofthisprocessshownpreviouslyinFig. 1.
Figure2: Gene*-basedinteractionbetweenstatistical dataworldandliteratureworld
BasedonGene*,thesystemcomputesthemeanandvarianceoftherelevancefactorsofthetermscorresponding
totherepresentationsofthegenesinthegivencluster.Takingintoaccounteithertheaverageorthevariability
of therelevancefactorsof theterms, theuser canselect asubsetof keywordsto querythe Medlinecollection,
theGeneCardsor-again-theGene*. Twoadvantagesofusinganon-binaryrepresentationhere,areitsability
to generatemore detailed term statisticsand its exibility to encapsulate thebinary approach. However, the
systemremainsopen inthe sensethat theusercanselecttheunderlyinggeneticliteraturedata representation
from thealternativespresentedin Table1.
Preliminary,qualitativeresultsabouttheusefulnessof thissystemare encouraging. Forexample,in anexper-
iment, four highly relevantgenes for hypoxia (oxygen deprivation) were selected (being HIF1A, VHL, VEGF
athorough quantitativeevaluation,soitis notcoveredfurtherin this paper. Wecontinuewith theevaluation
of theapplicabilityofGene* forgeneratinggeneclustersbasedontheliterature.
5 Towards full integration: clustering the 'genetic literature data'
Beside the full GeneCard database, we used a collection of 30.000 Medline abstracts dated from 01/1992 to
11/2000. The abstracts come from ten journals selected accordingto their impact factor and their relevance
asassessedby abiologistin theeld ofMolecularOncology. Therationalebehind usingarestrictedsubsetof
journals,butwideningthetemporalscope,wasbothtokeepourprototypesysteminthemedium-scalerangeand
to eliminatenoiseexpectedfromunrelatedinformationsourcesbykeepingtheliteraturemoredomain-specic.
We looked into the performance of clustering methods based on various Gene* parameterizations, asa rst
attempttoanswerquestionssuchas"whichGene* representationprovidesanoptimalsimilaritymatrixforgene
clustering?". Thoughthisquestionneedsfurtherinvestigation,wefoundtheGene* representationincorporating
the20mostrelevantMedlineabstracts,tohavethebestperformance. Thereforethesubsequentreportedresults
are computed based on this representation, denoted as Gene
20
. The similarity matrices derived from Gene
representations were used in ahierarchicaland k-medoids clustering algorithm. InFig. 3wedepict a typical
resultafterapplication ofhierarchicalclustering.
Figure3: HierarchicalclusteringofgenesbasedonGene
20 .
ToevaluatethebiologicalusefulnessofaGene*-basedliteratureclustering,wecomparedtheresultingclustersto
thesetsofgenesforwhichthefunctionalassociationsarewell-established. Thereasonnottostartimmediately
fromexpression-basedgeneclusters,isthatrealdata-basedclustersthemselvescanbeunreliable,sotheobtained
results are hard to interpret. Consequently, in arst approximation we simulated thebehaviour of an'ideal'
expression data cluster algorithm byhavingamolecular biologist constructtwodisjunct groupsof genes; The
rstgroup('hypoxiagroup')consistedof54genesknowntoberegulatedbyhypoxia(oxygenshortage)ortobe
involvedin thehypoxicsignaltransduction;asecond group('TNF group')included58genesmediatingtumor
necrosisfactorsignaling. Fromabiologicalpointofview,thetwoprocessesunderconsiderationarenotentirely
independent, asboth hypoxia and TNF are able to induce apoptosis (programmed cell death), even utilizing