• No results found

Wedemonstratethee ect of various weightingschemesand informationsources ina functional clustering setup

N/A
N/A
Protected

Academic year: 2021

Share "Wedemonstratethee ect of various weightingschemesand informationsources ina functional clustering setup"

Copied!
12
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

in Text-Based GeneClustering

P.Glenisson,P.Antal,J.Mathys,Y.Moreau,B.DeMoor

DepartmentofElectricalEngineering,ESAT-SISTA,

Kasteelpark Arenberg10,

B-3001Leuven, Belgium

Thankstoitsincreasingavailability,electronicliteraturecannowbeamajorsource

ofinformationwhendevelopingcomplexstatisticalmodelswheredataisscarceor

containsmuchnoise. Thisraisesthe question of how to deeply integrate infor-

mationfromdomainliteraturewithexperimentaldata. Evaluatingwhatkindof

statisticaltextrepresentationscanintegrateliteratureknowledgeinclusteringstill

remainsanunsuÆcientlyexploredtopic. Inthisworkwediscusshowthebag-of-

wordsrepresentationcanbeusedsuccessfullytorepresentgeneticannotationand

free-textinformationcomingfromdi erentdatabases. Wedemonstratethee ect

of various weightingschemesand informationsources ina functional clustering

setup. Asaquantitativeevaluation,wecontrast fordi erentparametersettings

the functionalgroupingsobtainedfromtextwiththoseobtainedfromexpertas-

sessmentsandlinkeachoftheresultstoabiologicaldiscussion.

1 Introduction

More and more, a successful understanding of complex genetic mechanisms

(such as regulation, functional understanding,...) critically depends on the

interaction between statistical analysis and various knowledge sources, such

as annotations databases, specialized literature, and curated cross-links be-

tweenthem(Baxevanis 1

). Despitethesee orts,thecurrentinteractionbetween

theexperimental(data)analysisandtext-basedinformationrequiresextensive

userintervention. Geneexpressionexperiments,whichmeasurelarge-scalege-

neticactivityunder avariety of biologicalconditions,are excellentexamples

ofenvironmentsthatrelystrongly onthisinteraction. Indeedas(1) thecost

ofdatacollectionishigh, (2)measurementsareoftennoisyorunreliable,and

(3) establishedrelationships in the transcriptomeare fragmentary at best, a

deeper integration between data and text-based information will bene t the

knowledgediscoveryprocess.

Thepresentstrategiesforknowledge-basedexpressiondataanalysisrelyon

thepremisethatstatisticaldataanalysisandbiologicalknowledgecancomple-

menteachotherbylinking twoindependentlyconstructedsourcesthatcontain

conceptuallyrelatedrecords(Masys 2

andVidal 3

).

In yeast for example, interpreting cluster patterns involves the consul-

tation of curated functional databases such as the Saccharomyces Genome

(2)

Database (SGD), which o ers concise functional annotations and a variety

of cross-referencesto other repositories. Formore elaborate information, re-

searcherscanresortto MEDLINE,an onlinebibliographicsourceofcitations

and abstracts in biomedical research dating from 1966 till present. While

theuse of acontrolled and curated index, likeMeSH b

,is already commonin

automatically associating gene functions (see for example Jenssen 4

, Masys 5

,

Kankar 6

),wetested additionallytheuseoffree-textasapotentiallymorein-

formative,and inthefuture possibly moredominant,information source(see

alsoStapley 7

,Stephens 8

,Renner 9

, Iliopoulos 10

,Raychaudhuri 11

).

In this work, weexplore how representationsborrowed from the eld of

informationretrievalcan beadopted forclusteringgenesbasedontheirasso-

ciatedliterature. Weencodetext-basedinformationfromvarioussourcesina

typical bag-of-wordsrepresentationfollowing thevectorspacemodel, awork

horseininformationretrievalresearch. Weinvestigatethee ectofpoolingand

expandingthesesources,togetherwiththequestionofwhichtypeofrepresenta-

tionismoreappropriate. Toevaluatethebiologicalusefulnessofliteratureclus-

tering, weformulate aclustering problem withgenesets fromSaccharomyces

cerevisiae for which the functional associationsare well-established and bio-

logicallydistinct. Thereasonnottostartimmediatelyfrom expression-based

gene clusters is that these data-basedclusters are often biologicallycomplex

and cannotprovideagoldstandardto interpret and quantify the correspon-

dencebetweenvariousdataminingmethods. Additionally,weseektoidentify

someinherentbiasesofthevector-spacemodelbytestingand quantifying its

performanceona fairlysimplebiologicalproblem. Tocomparedi erent ver-

sionsoftherepresentationwithrespecttoclusteringperformance,weuseboth

externalandinternal scoresfor clustervalidation (seeSection2). Theaim of

theseevaluationsisto establishapowerfulstatistical textrepresentationasa

foundationforknowledge-basedgeneexpressionclustering.

2 Methods

2.1 Compilation of InformationSources

Wecollectandcompile(asofSeptember2001)severalsourcesfortextualan-

notationsofthegenes. Firstlyweretrievethegenedescriptionsfrom theSac-

charomycesGenomeDatabase(SGD) c

.Secondly,weuseSWISS-PROT(SP) d

,

acurated proteinsequence database. We pool theSGD andSP information

a

http://genome-www.stanford.edu/Saccharomyces/

b

http://www.nlm.nih.gov/mesh/meshhome.html

c

http://genome-www.stanford.edu/Saccharomyces/

d

http://www.expasy.org/sprot/

(3)

textualresourceforyeastgenes. Finally,asasourceformoredetailedinforma-

tion, we use acollection of 493,923 yeast-related MEDLINE abstracts dated

betweenJanuary1982and November2000. Theywereselected byretaining

thoseabstractscomingfromalistof59journalsthatwascomposedaccording

tobothimpactfactor e

andrelevance. Theaimofthistrimmingisto retaina

moredomain-speci csubsetofabstracts,whichisstilldiverseenoughtohold

essential geneticinformation. We evaluate how these sources in uence text-

basedgeneclusteringand,morespeci cally,weinvestigatehowtheexpansion

oftheSGDandYeastCardannotationswithMEDLINEabstract information

(seeSection2.3)a ectsclusteringperformance.

2.2 TextRepresentation

The representation called the vector space model encodes a document in a

k-dimensional spacewhere eachcomponentv

ij

representstheweightof term

t

j

in document d

i

. The grammatical structure of the text is neglected and

therefore it is also referred to as a bag-of-wordsrepresentation. As a basic

indexforeachdocumentinthecollection,weconstructavocabularyconsisting

of26,420(possiblymulti-word)termsextractedfromtheGeneOntology f

Term

eld. ThePorterstemmer isused to canonize thewords.Based onthe Term

eldinGOandSynonym eldinSWISS-PROT,weprocesscandidatephrases

andreplaceknownsynonyms. Inthisworkweusedthefollowingcommonused

indexingschemes(Baeza-Yates 12

andKorfhage 13

):

 v bool

ij

=1ift

j 2d

i

,0otherwise

 v freq

ij

= fij

max

8j (f

ij )

,wheref

ij

isthenumberofoccurrencesoft

j in d

i

 v tf:idf

ij

=f

ij log (

N

ni

),where N is thetotalnumberof documentsand n

i

isthenumberof documentscontainingtermiinthecollection

Additionally,wede neanothertypeofindexcalledthereferencerepresentation

(seeShatkay 14

). Whenadocumentcontainsreferencestootherdocumentsin

thesameoranotherrepository,wecanencodethisasfollows:

 v ref

ij

=1ifannotationicontainsareferencetodocumentj,0otherwise

e

http://jcrweb.com/

f

http://www.geneontology.org

(4)

Weexpresssimilaritybetweenpairsofdocumentsd

i andd

j

,orbetweenatext

documentd

i

andaquerydocumentd

j

,bythecosineoftheanglebetweenthe

correspondingnormalizedvectorrepresentations:

sim(d

i

;d

j

)=cos(d

i

;d

j ):

Theunderlyinghypothesis statesthathigh similarityequalsstrongrelevance.

Further, themethod termedpseudo-relevance feedback is geared towards ex-

panding aquerydocumentwith thenmostsimilardocumentsinacollection

andaimsat re ningthesearch orclustering processbyarecalculation ofthe

termweights(Yates 12

). WedenotetheannotationsAexpandedwithndocu-

mentsfrom collectionC by A-C

n

. A related application ofpseudo-relevance

feedback in combination with the reference representation can be found in

Shatkay 14

.

2.4 ClusterAlgorithm

AsdivisiveclusteringalgorithmweusedtheK-medoidsalgorithm(Rousseeuw 15

),

whichminimizestheobjectivefunction

K

X

k =1 X

j2C

k d(x

j

;m

k )

over multiple partitionings C = fC

1

;:::;C

K

g with fm

1

;:::;m

K

g the corre-

spondingrepresentativepoints(called medoids)ofeachcluster. Theparame-

terK denotes thenumberofclusters andis xedin advance. Oneadvantage

ofthisalgorithmovercentroid-basedmethods, suchasK-means,is thateach

medoidconstitutesarobustrepresentativedatapointforeachcluster.

2.5 ClusterQuality

To measure the performance and quality of the clustering we de ne three

scores: thesilhouettecoeÆcient,theperformanceoftheclusteringasaclassi-

er,and theRandindex. The rst twoaretermedinternal scoressincethey

relyonstatisticalpropertiesoftheclustereddata,thelastoneiscalledexternal

becauseitinvolvesacomparisonwithaknown,externallabeling.

Silhouette CoeÆcient

Asa rstinternalscoreforclusterqualityweusethesilhouettecoeÆcientper

clusterS

k

= 1

n P

nk

i=1 s

ik

andtheoverallsilhouettecoeÆcientS = 1

n P

k P

nk

i=1 s

ik

(5)

k

s

ik

=

b(i) a(i)

max(a(i);b(i))

;

wherea(i)istheaveragedissimilarityofmemberitoallothermembersofits

cluster and b(i) the dissimilarity of member i to the nearest member of the

nearest cluster. It is ametric-independent measure designed to describe the

ratiobetweenclustercoherenceandseparationandtoassistinchoosingwhich

clusteringispreferableaccordingtothedata(Rousseeuw 15

).

k-NNLearnability

For thesecond measureof internalcluster quality,welookuponthe problem

as beingsemi-supervised. Using theclustering resultasalabelingfor allthe

points, weassess theperformance of agivenclassi er on aclass (or cluster)

in across-validatedleave-one-outsetup. FollowingPavlidis 16

, weuse ak-NN

classi erjointlywiththe(1 cos(;))distancemeasuretocomputeamisclas-

si cation score for each class. The statistical signi cance of this score m, is

expressedbyap-valuederivedfromabinomialB(m,p

misclass

)withp

misclass

the prior chance of misclassi cation, which canbe computed analytically in

caseofak-NN classi er(detailscanbefoundinPavlidis 16

).

Rand Index

Asanexternalmeasureforclustervalidityweusetheadjusted Randindex 17

.

Givenasetofnpoints,anexternalpartitionP =fP

1

;:::;P

k

g,andaclustering

C =fC

1

;:::;C

l

g,de nea as thenumberofpairsofpointsthatco-occurin a

group in the partitioning P as well asin the clustering C, d the number of

pairsofpointsthatareindi erentgroupsinP aswellasinC,andb andcas

thenumberof pairsofpoints thatco-occurin agroupin P,but notin C or

vice-versa. TheRandindexis thende nedby

R=

a+d

a+b+c+d :

ThecorrectionforrandompartitioningisR

adj

=

R E(R)

max(R) E(R)

,whereahyper-

geometric baselinedistributionis used to computetheexpectedvalues. Ina

comparativestudy 17

,theadjustedRandindexisrecommendedastheexternal

measureofchoice.

(6)

3.1 Constructionof TestSet

We construct a set of genes for which the functional associations are well-

established. From theMIPS catalogue 1

, we selectthree biologicallydistinct

functionalgroupsconsistingof116genesintotal. Forallgenesweselecttheir

corresponding SGD and YC annotations (see Section 2.1) and proceed with

the105 genesthat haveentries in bothdatabases g

. The rst groupholds 63

genes thatencodelysosomal proteins. Thesecond groupconsists of30 genes

involved in translational control and the third contains 23 genes related to

aminoacidtransport.

3.2 ClusterPerformance

Followingthestrategiesoutlinedin Section 2,allgeneannotationsarerepre-

sentedbyvariousindices andsubsequentlyexpanded withthe20bestmatch-

ing MEDLINEabstracts. Morespeci cally, weperformthe expansionbyre-

indexing theenriched annotations,again following variousindexing schemes.

Table1summarizes theimpact of these settings oncluster performance, ex-

pressed by means of the Rand index R

adj

. Firstly we discuss the e ect of

informationsource,afterwardsfollowstheresultsontheindexingschemes.

Table1: R

adj

scoresforclusteringthethreegroupsusingvariousrepresentations.Notethat

someresultsareduplicatedalongtheblockstofacilitatediscussion.

Representation Weight R

adj



](t

i )

Source SPKeywords bool 0.1767 3

SGD tf 0.4050

YeastCard tfidf 0.4617

Index SGD bool 0.3386 8

SGD tf 0.4050

YeastCard bool 0.3323

YeastCard tf 0.4028

YeastCard tfidf 0.4617 26

YC-ML20 bool 0.3726

YC-ML20 tf 0.2953

YC-ML20 tfidf 0.7344 396

YC-ML20 ref 0.2354 20

Expansion SGD-ML20 tfidf 0.5920

YC-ML20 tfidf 0.7344

E ect of Indexing Scheme In thesecond block of Table1 we write the

performanceoftheboolean(bool),frequency(tf)andtfidf indexontypical

g

ftp.esat.kuleuven.ac.be/sista/glenisson/reports/webSuppl TR02121/yeastcardTable.htm

(7)

MEDLINE abstracts. Forverybriefkeyword-baseddescriptions(less than8

words)thebooleanrepresentationisfoundtobethebestone. Ifall eldsfrom

SGD are used, tf (0.4050) improveson bool (0.338). For the YC database,

typicallycontaining30to50termsperentry,tfidf (0.4617)outperformsbool

(0.3323)andtf (0.4028)slightly.

Intheexpansionstep,wecollectandre-indexthe20bestmatchingMED-

LINEabstractsfor eachgene. Thisoperationprovidesapro leforeach gene

withthenumberoftermsrangingtypicallybetween200and400. Amongthe

indexingoptionsforthisset ofabstracts,tfidf (0.7344)scoresconsiderably

higherthanbool(0.3726)andtf (0.2953),evenafterstopwordremoval.

Basingourselvesonthesame20top-scoringabstractswealsoevaluatethe

performance of the reference representation ref, which characterizes a gene

in document space instead of term space. It has an R

adj

value of 0.2354,

indicatingthatitisalessdescriptiverepresentation. Thiscanbeexplainedby

thefactthatref isprobablymoredependentontheretrievalofhighlyrelevant

abstracts(seealsoShatkay 14

).

E ect of Information Source In the rst block of Table1, we see that

forthe genegroupsconsidered,thekeywords eld inSWISS-PROT doesnot

providesuÆcientinformationforanacceptableclustering result(0.1767). For

instance, the SWISS-PROT keyword list only provides an average of 2 to

3 meaningful keywords for 86 out of 105 genes. The remaining genes are

described with no or irrelevantkeywordssuch ashypothetical protein, which

will not allowfor correct classi cation. Using the GO entries and especially

the description line of SGD improvesthe results and raises the Rand score

to 0.4050. Onlytwogeneshavenomeaningful representation,YKL002wand

YLR309c,whereastheothersarenowdescribedby7to8biologicallyrelevant

terms. WhenresortingtoourpooledinformationsourceYC(seeSection2.1),

we obtain a score of 0.4617, misclassifying 21 out of 105 genes. Although

theclusteringitself isnotdramaticallyin uencedbytheexpansionwithYC,

for most of the genes, the textual representation is greatly improved (e.g.,

the weights of speci c terms are increased and additional speci c terms are

incorporated). Forinstance, Table2showsthetext pro lesofthemedoids of

thevacuolarclusterforvarious representations.

In theclustering basedon SWISS-PROT keywords, the vacuolarcluster

itself isnotfound. Instead, thealgorithmidenti es aclusterof ATP-binding

proteinsthatcontainsthevacuolarATPasesbutalsoanumberofATP-binding

proteinsinvolvedintranslationalcontrol. TheSGDrepresentationensuresthe

groupingofvacuolarproteinssolelybasedononerelevantterm,vacuolar. Both

(8)

SPkeywords SGD YC YC-ML20

ATP(0.45) vacuolar(0.38) vacuolar(0.54) vacuolar(0.54)

ATPbind(0.45) vps41(0.38) ATPas(0.4) vacuol(0.45)

bind(0.45) vacuolarmembran(0.32) snare(0.36)

vma13(0.21) vacuolarmembran(0.18)

subunit(0.2) Tsnare(0.17)

associ(0.17) syntaxin(0.16)

organel(0.16) vacuolarassembli(0.12)

vacuolaracidif(0.16) Golgi(0.1)

acidif(0.15) carboxypeptidas(0.1)

sector(0.14) vam3(0.09)

hydrogen(0.13) pep12(0.09)

membran(0.1) Vsnare(0.08)

Table3:Textpro lesofgeneYPL029wbasedontheSGDandYeastCardrepresentations.

SGD YC

ATP(0.27) helicas(0.57)

ATPdependhelicas(0.27) mitochondri(0.36)

depend(0.27) ATPdependhelicas(0.29)

helicas(0.53) suv3(0.29)

RNA(0.27) ATP(0.23)

RNAhelicas(0.27) depend(0.2)

suv3(0.27) RNA(0.2)

RNAhelicas(0.19)

post(0.19)

ATPdependRNAhelicas(0.18)

elem(0.16)

translat(0.13)

control(0.13)

interact(0.11)

transcript(0.09)

Table4:Textpro lesofgeneYLL048candYPL149wbasedontheYeastCardrepresentation

andthecorrespondingexpansiontoMEDLINE(onlythetop-scoringtermsareshown).

YLL048c YPL149w

YC YC-ML

20

YC YC-ML

20

bile(0.68) bile(0.92) autophagi(0.89) autophagi(0.87)

transport(0.46) bileacidtransport(0.28) apg5(0.43) apg5(0.17)

bileacidtransport(0.25) bileacid(0.22) conjug(0.15)

ybt1(0.25) hepatocyt(0.06) apg1(0.13)

ATP(0.20) transport(0.06) cAMP(0.13)

abc(0.15) abc(0.05) starvat(0.11)

ATPbind(0.14) ATP(0.05) kinas(0.11)

integrmembran(0.14) ATPas(0.03) phosphati-

dylinositol(0.08)

integr(0.13) apic(0.03) vacuol(0.08)

membran(0.11) vesicl(0.03) apoptosis(0.08)

acid(0.1) cotransport(0.03) hepatocyt(0.07)

similar(0.1) sister(0.03) antagonist(0.06)

depend(0.09) voltag(0.03) ubiquitin(0.06)

bind(0.07) glycoprotein(0.02) apg12(0.06)

famili(0.03) triphosph(0.02) amino-

peptidas(0.04)

(9)

result in a large cluster containingmost of the vacuolar proteins. The text

pro les of the corresponding medoids con rm the success of the MEDLINE

expansionand the feasibilityof our approach to identify relevant termsthat

characterizeindividual genes orgroupsof genes. Fortheother twogroupsa

similarimprovementisobserved.

Table3showstwoexamplesof textpro lesofindividualgenesthat were

misclassi ed when the SGD representation was used whereas the YC repre-

sentation assigned the genes to the correct cluster. For the RNA helicase,

YPL029w, termslikemitochondri and translat are added tothe text pro le.

TheclusteringofYBR024winthegroupoftranslation-relatedproteinsisbased

ontermssuchasmitochondri,inner,andmembrane.

Expansion to MEDLINE improves the text pro les of almost all of the

genes and even the clustering of a few genes such as YLL048c, a lysosomal

bile transporter, and the genes that encode autophagy-related proteins. In

theclusteringbasedontheYCrepresentation,YLL048cwaswronglyassigned

to the group of amino acid transporters. However, the expansion strongly

decreasedtheweightof theterm transport and introducedthe termATPase

in the text pro le, resulting in a correct classi cation of the gene. For the

autophagy-relatedgenes,retrievalofthetermvacuol ensurescorrectgrouping

afterMEDLINEexpansionasshowninTable4. However,someofthegenesare

incorrectlyclustered nomatter whatrepresentation orweightingscheme was

used. Forinstance,Group1andGroup3includeseveralproteinsthatregulate

transcription,aprocessthatis closelyrelatedto translationandsharesmany

ofitskeywords. TheproteinsYLR025w(Group1), YLR375w(Group3)and

YDL048c(Group3)arethereforepersistentlymisclassi edintoGroup2. One

gene,YLR309c(Group1),isconsistentlyassignedtothewrongclusterbecause

it lacks proper annotation. The only terms that characterize YLR309c are

vague, aspeci c wordssuch asgene product and thename of thegene imh1.

This information is insuÆcient for successful expansion with MEDLINE. A

manualsearch viathePUBMED engine didnotreveal much information on

imh1 (YLR309c)either.

3.3 ClusterQuality

Becauseof theabsenceofagoldstandardorpriorknowledgein regularclus-

tering problems, internal measures of quality are used to evaluate a cluster

result(Jain 17

). Theyarebasedonvariousstatisticalpropertiesofthegrouped

dataandprovidecluestochoosebetweendi erentparameterizationsofasin-

gle algorithm (such as thenumberof clusters) orevenbetween various clus-

Referenties

GERELATEERDE DOCUMENTEN

The use of this task is found in that it provides better clusters of genes by fusing both information sources together, while it can also be used to guide the expert through the

Silhouette curves with mean Silhouette coefficient for clustering solutions of 2 up to 25 clusters for text-only clustering, link-only clustering, integrated clustering with

Voor de infraorde Caridea, waartoe de ‘snapping shrimp’ behoren, somden De Grave et al. Geen van deze lijken zowel fossiel als recent voor te komen, maar dat beeld kan

Hierdie gevalle is onder andere waar die insolvent akkoord met sy skuldeisers bereik het en daar minstens 50 sent in die rand betaal is of sekuriteit daarvoor

Gebruik van BINAS en grafische rekenmachine is toegestaan. Hij rijdt met doodsverachting een helling af en bereikt zonder te trappen een constante snelheid van 74,3 km/h. De helling

uit het energieschema de drie emissielijnen (geel, groen en blauw) horen en teken de bijbehorende pijlen in het schema op de bijlage. Omdat deze drie emissielijnen de

In figuur 1 zie je een schematische weergave van een opstelling waarin het foto-elektrisch effect (FEE) wordt toegepast: op een kathode (K) laat men licht vallen met een

Aan de rechterkant verlaat het elektron de ruimte tussen de platen. Daardoor ondervindt het geen gevolgen meer van het elektrisch veld. Het magnetisch veld is echter nog wel