• No results found

Abstract Due to the complexity of the underlying processes, the scarcity of the data and the high level of noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands methodologies that incorporate domain literature with gene

N/A
N/A
Protected

Academic year: 2021

Share "Abstract Due to the complexity of the underlying processes, the scarcity of the data and the high level of noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands methodologies that incorporate domain literature with gene "

Copied!
9
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

[

literature in gene clustering: representations and methods

PeterAntal



Geert Fannes



Tamas Meszaros

#

Patrick Glenisson



Bart De Moor



Yves Moreau



Tom Boonefaes +

Pieter Rottiers +

JohanGrooten +

]

Abstract

Due to the complexity of the underlying processes, the scarcity of the data and the high level of

noiseonthemeasurements,understandinglarge-scalegeneinteractionsfromexpressiondatademands

methodologies that incorporate domain literature with gene clustering. These methodologies should

moreand morecomply withthecircular nature oftheoverall clusteringprocess, in which thehuman

expertise iterates between the measured statistical data and the domain literature. In this paper

we propose Gene*, a genetic literature data structure representing the relevant literature for each

gene based on its GeneCard and Medline abstracts. The Gene* representation is appropriate for

information retrieval and constructing textual pro les of a gene cluster. Based on this, we describe

a systemproviding a textual pro lefor agene cluster and supporting the corresponding information

retrievalandrevisionstepsaccordingtothecyclicnatureoftheclusteringprocess. Wepresentresults

aboutclusteringofgenes basedsolelyontheirGene* literaturerepresentation andonthedual usage

ofthisliterature-basedclusteringwithgeneclusteringbasedonexpressiondata.

1 Introduction

Thedisseminationofmicroarraytechnology,providingthepossibilityofsimultaneousmeasurementoftheactivity

of largenumbersofgenes,gavenewstimulitowardsabroaderunderstandingoftheirfunctionalroles andtheir

interactivenetworking.Thecomplexityofthesefundamentalbiologicalquestionsthough,demandmethodologies

that allowa exibleorganizationofhumanexpertise,themeasuredstatistical dataandthedomainliterature.

Oneofthemainresearchdirectionsinthepost-genomicageistheclusteringofgenestounderstandtheregulatory

patternsintheirinteractions,thepathwaystheybelongtoorthegeneticnetworktheyconstitute(foranoverview,

wereferto[8]). Todate,theinterpretationandvalidationofgeneexpressionclustersremainsatime-consuming

manual task based on browsing the genetic literature and hypothesizing about the relationships within the

clusters based on the data. While it is feasible for a small set of genes, this practicedoesnot scale up to a

medium orgenome-wideset.

However, the availability of large amounts of electronic domain literature has changed the traditional wayof

data analysis. Nowadaysahugeamountof domainknowledgeisavailablein either unstructuredorstructured

electronic format (e.g., natural languagepapersand structured documents, databasesor knowledge bases re-

spectively). Many e orts in the bioinformatics communityare currently directed towards the developmentof

machine-friendlyrepresentations,standardizationofthestate-of-the-artbiologicalinformationresourcesandthe

establishmentofcross-linkedorcombinedinformationsources. Despitethesee orts,muchofthedomain-speci c



(



)Dept. of Electrical Eng., Katholieke Universiteit Leuven, KasteelparkArenberg 10, B-3001 Heverlee (Leuven),

Belgium, ( +

)Dept. ofMolecular Biology,Ghent University andFlandersInteruniversityInstitutefor Biotechnology,K.

L. Ledeganckstraat35, B-9000 Ghent,Belgium, (

#

)Dept. of Measurement andInf. Systems,Budapest Univ. ofTech.

(2)

rapid developmentofthedomain.

Theexpert'senvironmentcanbecharacterizedassplitupinthedataworld,whichcontainsthehigh-throughput

dataandstatisticalanalysismethods,andtheknowledgeworld whichcontainsthedomainknowledgedominantly

presentin free-textform. Withinthis terminology,thedataanalysis canbeseenasdeepinteraction ofhuman

expertisewith thosetwoworlds. ToincreasetheeÆciency ofthis interaction,post-genomicintelligentsystems

shouldovercomethecurrentarti cialseparationofthetwoworlds(i.e.,separationbetweentoolsfordataanalysis

and thoseforinformationretrieval)byfollowingintegratedapproaches. Gradual stepsin removingthebarriers

betweentheexpressiondataanalysisandliteratureorientedmethodscanbethefollowing:

1. Intelligent support. More customized literature-based methods (e.g., information retrieval methods using

contextualinformationfrom theuser).

2. Semi-automaticsupport. Semi-automaticliterature-basedmethodsforinterpretingtheresultsofdataanal-

ysis(e.g., automaticderivationofatextualpro leforageneclustercomingfromthedataanalysis).

3. Integratedapproach. Integrationofmethodsbasedontheanalysisofexpressiondataanddomainliterature

(e.g., comparisonofclusteringsofgenesbasedonthemeasureddataand thecorrespondingliterature).

4. Literature-data approach. Methods using the domain literature in the sameway as the expression data,

possibly after complextransformation ofthe textual domain knowledge into amorenumerical, statistical

format(e.g.,clusteringofgenesbasedonadualvectorrepresentationforeachgene,composedofrelevance

factorsfordomaintermsandtheexpressionlevelsinmeasurements).

Thisarticleaddressesthe rstthreeissuesfromanintegratedpointofviewonthedataanddomainknowledge,

maintaining exibilitytowardsboththerapidlyevolving eld andexpertinterventions.

Thepaperisorganizedasfollows: Section2summarizestheexistingliterature-basedapproachesforsupporting

the expressiondataanalysis. InSection3, aniterative,cyclic model oftheoverallclustering process isshown.

Section 4presentsGene*, theso-called'geneticliteraturedata'representationofthedomainliteratureandour

semi-automaticsystembasedontextualpro lingofgeneclusterstosupportthiscircular,iterativenatureofthe

clusteringprocess. Section5containsresultsaboutclusteringthederived'geneticliteraturedata'anditsrelation

to clusteringexpressiondata. Finally, Section6concludes anddescribesourongoingworkonthissubject.

2 Literature-based microarray data analysis

Thedominantstrategiestosupporttheexpressiondataanalysisbyliterature,relyonthefactthatthestatistical

dataanalysisandthebiologicalinterpretationoftheresultsareseparated(i.e.,theusageofstatisticalpackages

versusdatabase or information retrievaltools). Consequently, most of thee orts are devoted to the compila-

tion of the available genetic literatureinto cross-linked databasesto support the structured and unstructured

information retrieval. Forthebiomedicalsciencesin general,theUS NationalLibraryofMedicine'sMedlineis

acommon bibliographicsourceofcitations andabstracts rangingfrom 1966to present. Morerecentlycreated

and focusedonthehumangenome, theGeneCardsystem[6] o ersaweb-based,comprehensiveandconstantly

updated knowledge repositoryforeach humangene. These curated databasesform basictoolsforthe manual

interpretationofexpressiondata.

A rststeptoautomatethemanualinvestigationofvariousdomainknowledgerepositoriesisserializingfrequent

subsequentstepsasreportedin [4]. ItinvolvestheextensionofquerieswithtermscomingfromtheGeneCards

repositorybeforesubmittingthemtotheMedlinesearchengine.Followingthefourgroupingsabovesuchmethods

canbeclassi edasintelligent supportforliterature-basedexpressiondataanalysis.

Another information retrieval-based approach is the use of domain-speci c concepts and term hierarchies (or

ontologies). Throughtheirstructure, ontologies canrepresentthesemanticsof thedomain. In [2] forexample,

theGeneOntology(GO)ConsortiumdemonstratesthepotentialofGOforinterpretinghigh-throughputgenomic

data. Masys et al. [3] designed an interface that combines the Medline query engine with both the MeSH

keyword list and the UMLS ontologyto infer the function of a group of genes. The MeSH (Medical Subject

(3)

conceptsdescribedinitscontrolledvocabularies(moredetailsavailableathttp://www.nlm.nih.gov/databases/).

Theinterfacein [3]reportsthequantitativesigni canceof each resultand provideslinks todi erentdatabases

allowing further browsing. Such methods can be interpreted as semi-automatic support for literature-based

expressiondataanalysis.

The most ambitious methods, the integrated approaches, tryto directly inferfunctional relationships between

genesandtheirproductsfromtextualinformation. Wecandistinguishtwoimportantapproaches. Rind eschet

al.[7] followthedeep-linguistic approach whereNatural LanguageProcessingforms thebasisof theknowledge

discovery process. An alternative way is to adopt the shallow statistical stance which basically discards the

linguistic structuralinformationand usesonly statisticsofindividual termsindocumentsand thecorpus. Our

focus will be directed towards this approach. Shatkay et al. [9] for example derivea gene-to-gene similarity

measurebyselectingtheN mostrelevantdocumentsfromacorpusofMedlineabstractsforeachgene. PubGene

[5] onthe otherhand is the rst on-linesystemthat generates genenetworks basedonthe full set of Medline

abstracts, all indexed in a simpli ed, but scalable way. Finally, Stapley et al. [10] report on a similar, but

prototypicalapplicationthatisbasedonabinarygene-by-documentmatrix. Itoutputsthesimilaritiesbetween

thegenesthat arerelatedto auser-speci edMeSHterm.

3 An integrated approach for using data and literature in gene clustering

As the previoussurveydemonstrates, oneof themain current trendsis theincreasingly closeincorporationof

theliteratureinthegeneclusteringprocess.

TodesigneÆcientintelligent,semi-automated,orintegratedmethods,requirestheidenti cationofasimple,but

powerfulmodel of the clustering process. Fig. 1 illustrates the overall cyclic process involving (1) clustering

basedonthedata,(2)interpretingtheclusters,(3)evaluatingcluster membersandpotentialnon-membersand

nally (4) modifying the experimental setup, the parameterization of the clustering algorithm, and so on to

returnto the rststep.

Figure1: Theinteractionloopbetweenthestatistical dataworldandtheliteratureworld

Theresearchintoclusteringstrategiesforhigh-throughputdata,mostnotablymicroarraydata,isatoppriority

in the current post-genomic stage. The algorithms and their evaluation are increasingly taking into account

conceptsasstatisticalclusterquality,aprioricon dencelevelsandprede nedcausalrelations(cfr. [8]). However,

attaching biological context to the computed clusters remains a major bottleneck like ruling out biological

discrepancies,peculiaritiesorexperimentalerrors. Althoughcurrente ortstocollectrelatedgeneticinformation

(or links to them) into centralized repositories has proved extremely valuable, they (being a closed system)

lack the interactive component needed to ne-tune the retrievalor discoveryprocess. To our knowledge, the

existing intelligent andsemi-automatedsystemsoperateonlyunidirectionally,lackingthepossibilitytosupport

thecircular, iterativenatureof theclustering process. Asacontinuation,therationalebehindourapproachto

(4)

automatic derivation of a textual pro le for a set of genes and clustering of genes based on this literature

representation. The 'textual pro le' of agene cluster consists ofrelevant termswith relevance and astability

factorscorrespondingto thiscluster(seelater).

Thepotentialofthesefunctionalitiesintheiterativeprocesscanbedemonstratedonthefollowingparadigmatic

examples:

1. Investigationofageneclusterofinteresttotheresearcher

2. Investigationofageneclusteridenti edbyanexpression-basedclustering algorithm

3. SimultaneousInvestigationofmultiple geneclustersidenti edbyanexpression-basedclusteringalgorithm

Firstly,analyzingaclusterofgenesforexploration aimsatretrievinghighlyrelevantkeywordsforthesetofgenes

of interesttotheresearcher. Thereturnedpro leidenti eseither themostimportantormoststable termsfor

thegenecluster fromaselectedvocabulary. Sincethese canbeanexcellentstartingpointforfurther literature

exploration, the gene cluster pro ling is directly integrated with an information retrieval system capable of

acceptingqueriesbasedontheresultinggeneclusterpro le. Thesubsequentbrowsingofthereturneddocuments

of theinformationretrievalsystemdrivestheusertoabetterunderstandingoftheclusters.

Secondly,analyzingaclusterofgenesdirectlyobtainedfromexpressionanalysiscouldaimatprovidingfeedback

fortheclusteringalgorithm. Whileinthepreviouspartgeneswerenotnecessarilyconnectedto thedataworld

(i.e., no expressioninformation wasconsidered), heretheset of genes isde ned by aclustering algorithm. As

thenumberofmembersperclustertypicallyliearoundtheone-hundredrange,thescientistwillratherskimthe

reportedpro les, thanthoroughly check theliteraturefor each gene. Thisglobal inspection oftextual pro les

enablestodetectoutliers{genesassignedtoaclusterinthedataworld,butexcludedfromitbytheliterature.

Such outlyinggene cansuggesta awin theparameterizationof thecluster algorithm orcanrepresentanew

discovery. Dependingontheoutcome,theusercaniteratebacktothedataworldtorepeattheanalysisorcheck

certain ndingsand,ifnecessary,theloopcanbeenteredagain.

Thirdly, analyzing several clusters in general aims at performing a'dual' clustering: to compare a clustering

based ontheexpression data versusdomain knowledge. Till now,the domainknowledge-basedclustering was

implicitandbasedonhumanexpertise. Ourgoalatthispointistoprovideanexplicit,literature-basedclustering

of genes;basedonthecomparisonofthe clusteringsderivedfrom dataandliterature, clusterscanbeadapted

andalternativecluster strategiesorparameterizationscanbeconsidered(seealsoFig. 1).

4 Automatic textual analysis of gene clusters

Inthis sectionwewill brie ysummarizethemain featuresofourexperimental systembasedontheintegrated

approachfortheusageofexpressiondataandgeneticliterature.

4.1 A'genetic literaturedata' representationof genetic literature: Gene*:

Ourlong-termpurposeisto exploitthegeneticliteratureinsidevariousstatistical methods,herebygivingitan

equalstatusasexperimentaldata. Thereforewecallrepresentationssuitabletothisendgeneticliteraturedata.

ThemostdiÆcultquestioninderivinggeneticliteraturedataishowtocollect,foragivengene,themostrelevant

documentsfrom a largecollection of related documents to construct its literature representation. We used a

kernel-basedmethodology,asimpli edversionof[9],whereforeach geneitscorrespondingGeneCardcard was

chosenastheliteraturekernel. This genekernelfunctions asaquerydocumentforselectingthemostrelevant

Medlineabstracts.

Ineachphase,thedocuments(eitherMedlineabstractsorGeneCards)arerepresentedbytherelevancefactors

of 70.000 terms stemming from controlled vocabularies including the GO [2] and the MeSH ontologies. The

relevancefactorsarecomputedbytheTF-IDFweightingscheme[1]andthecosinesimilaritymeasurewasused

(5)

the N mostrelatedMedline abstracts. Basedonthis representation,similaritymatrices were generated which

wereusedasinputfortwowell-knownclusteringalgorithms: hierarchicalandk-medoids.

Table1summarizespossiblealternativesandparameterizationsfortherepresentation,actuallypresentinGene*.

Almosteachoftheshownparameterizationscanbecombinedwitheachother.

Representation TopN Similarity Clustering

binary 0 matching coeÆcient hierarchical

DF-IDF 5 cosine k-medoids

20

Table1: ParameterizationsforthegenerationofGene



, ageneticliteraturedata. Listedarethespeci cvector

representation, thenumberofincorporatedMedlinedocuments,thesimilaritymeasureused to selectthemost

relevantMedlinedocumentsandtocomputethegene-to-genesimilaritymatrixandtheclusteringmethodapplied

onthesimilaritymatrix.

Since itsupports boththegeneration oftextual pro lesforgene clusters andthegeneration of clusterspurely

ontheliterature,theGene* representationcanbeusedoptimallyforanintegratedusageofdataandliterature

in thegeneclusteringprocess.

4.2 Aniterativescheme basedon textual pro les ofgene clusters

The rstapplicabilityofGene* intheoverallgeneclusteringprocessistogeneratepro lesforgeneclustersas

described in Section 3. A more elaboratediterativescheme is presented in Fig. 2, which followsthe circular

nature ofthisprocessshownpreviouslyinFig. 1.

Figure2: Gene*-basedinteractionbetweenstatistical dataworldandliteratureworld

BasedonGene*,thesystemcomputesthemeanandvarianceoftherelevancefactorsofthetermscorresponding

totherepresentationsofthegenesinthegivencluster.Takingintoaccounteithertheaverageorthevariability

of therelevancefactorsof theterms, theuser canselect asubsetof keywordsto querythe Medlinecollection,

theGeneCardsor-again-theGene*. Twoadvantagesofusinganon-binaryrepresentationhere,areitsability

to generatemore detailed term statisticsand its exibility to encapsulate thebinary approach. However, the

systemremainsopen inthe sensethat theusercanselecttheunderlyinggeneticliteraturedata representation

from thealternativespresentedin Table1.

Preliminary,qualitativeresultsabouttheusefulnessof thissystemare encouraging. Forexample,in anexper-

iment, four highly relevantgenes for hypoxia (oxygen deprivation) were selected (being HIF1A, VHL, VEGF

(6)

athorough quantitativeevaluation,soitis notcoveredfurtherin this paper. Wecontinuewith theevaluation

of theapplicabilityofGene* forgeneratinggeneclustersbasedontheliterature.

5 Towards full integration: clustering the 'genetic literature data'

Beside the full GeneCard database, we used a collection of 30.000 Medline abstracts dated from 01/1992 to

11/2000. The abstracts come from ten journals selected accordingto their impact factor and their relevance

asassessedby abiologistin the eld ofMolecularOncology. Therationalebehind usingarestrictedsubsetof

journals,butwideningthetemporalscope,wasbothtokeepourprototypesysteminthemedium-scalerangeand

to eliminatenoiseexpectedfromunrelatedinformationsourcesbykeepingtheliteraturemoredomain-speci c.

We looked into the performance of clustering methods based on various Gene* parameterizations, asa rst

attempttoanswerquestionssuchas"whichGene* representationprovidesanoptimalsimilaritymatrixforgene

clustering?". Thoughthisquestionneedsfurtherinvestigation,wefoundtheGene* representationincorporating

the20mostrelevantMedlineabstracts,tohavethebestperformance. Thereforethesubsequentreportedresults

are computed based on this representation, denoted as Gene



20

. The similarity matrices derived from Gene



representations were used in ahierarchicaland k-medoids clustering algorithm. InFig. 3wedepict a typical

resultafterapplication ofhierarchicalclustering.

Figure3: HierarchicalclusteringofgenesbasedonGene



20 .

ToevaluatethebiologicalusefulnessofaGene*-basedliteratureclustering,wecomparedtheresultingclustersto

thesetsofgenesforwhichthefunctionalassociationsarewell-established. Thereasonnottostartimmediately

fromexpression-basedgeneclusters,isthatrealdata-basedclustersthemselvescanbeunreliable,sotheobtained

results are hard to interpret. Consequently, in a rst approximation we simulated thebehaviour of an'ideal'

expression data cluster algorithm byhavingamolecular biologist constructtwodisjunct groupsof genes; The

rstgroup('hypoxiagroup')consistedof54genesknowntoberegulatedbyhypoxia(oxygenshortage)ortobe

involvedin thehypoxicsignaltransduction;asecond group('TNF group')included58genesmediatingtumor

necrosisfactorsignaling. Fromabiologicalpointofview,thetwoprocessesunderconsiderationarenotentirely

independent, asboth hypoxia and TNF are able to induce apoptosis (programmed cell death), even utilizing

Referenties

GERELATEERDE DOCUMENTEN

Next, we applied the proposed two-step procedure (using ICA based data reduction and all clustering methods) to the data sets from the selected simulation conditions using a range

 The literature-weighted global test can evaluate biomedical con- cepts for association with gene expression changes based on text mining-derived associations.The test uses

In terms of previous research, it can be considered that the present findings partially align with Verspoor and Smiskova’s (2012) conclusion that high- input learners used

Genes that are functionally related should be close in text space:.. Text Mining: principles . Validity of

Genes that are functionally related should be close in text space:.. Text Mining: principles . Validity of

For the clustering on pure data, 18 clusters are obtained of which five show a periodic profile and an enrichment of relevant motifs (see Figure 10): Cluster 16 is characterized by

The different columns contain (1) the vector representation, (2) the source of annotation, (3) the number of parents, (4) the vocabulary, (5) the correlation coefficient between

We present an extension of the Gibbs sampling method for motif finding that enables the use of higher-order models of the sequence background.. Gibbs sampling makes it possible