Wedemonstratetheeect of various weightingschemesand informationsources ina functional clustering setup

(1)

in Text-Based GeneClustering

P.Glenisson,P.Antal,J.Mathys,Y.Moreau,B.DeMoor

DepartmentofElectricalEngineering,ESAT-SISTA,

Kasteelpark Arenberg10,

B-3001Leuven, Belgium

Thankstoitsincreasingavailability,electronicliteraturecannowbeamajorsource

ofinformationwhendevelopingcomplexstatisticalmodelswheredataisscarceor

containsmuchnoise. Thisraisesthe question of how to deeply integrate infor-

mationfromdomainliteraturewithexperimentaldata. Evaluatingwhatkindof

statisticaltextrepresentationscanintegrateliteratureknowledgeinclusteringstill

remainsanunsuÆcientlyexploredtopic. Inthisworkwediscusshowthebag-of-

wordsrepresentationcanbeusedsuccessfullytorepresentgeneticannotationand

free-textinformationcomingfromdierentdatabases. Wedemonstratetheeect

of various weightingschemesand informationsources ina functional clustering

setup. Asaquantitativeevaluation,wecontrast fordierentparametersettings

the functionalgroupingsobtainedfromtextwiththoseobtainedfromexpertas-

sessmentsandlinkeachoftheresultstoabiologicaldiscussion.

1 Introduction

More and more, a successful understanding of complex genetic mechanisms

(such as regulation, functional understanding,...) critically depends on the

interaction between statistical analysis and various knowledge sources, such

as annotations databases, specialized literature, and curated cross-links be-

tweenthem(Baxevanis 1

). Despitetheseeorts,thecurrentinteractionbetween

theexperimental(data)analysisandtext-basedinformationrequiresextensive

userintervention. Geneexpressionexperiments,whichmeasurelarge-scalege-

neticactivityunder avariety of biologicalconditions,are excellentexamples

ofenvironmentsthatrelystrongly onthisinteraction. Indeedas(1) thecost

ofdatacollectionishigh, (2)measurementsareoftennoisyorunreliable,and

(3) establishedrelationships in the transcriptomeare fragmentary at best, a

deeper integration between data and text-based information will benet the

knowledgediscoveryprocess.

Thepresentstrategiesforknowledge-basedexpressiondataanalysisrelyon

thepremisethatstatisticaldataanalysisandbiologicalknowledgecancomple-

menteachotherbylinking twoindependentlyconstructedsourcesthatcontain

conceptuallyrelatedrecords(Masys 2

andVidal 3

).

In yeast for example, interpreting cluster patterns involves the consul-

tation of curated functional databases such as the Saccharomyces Genome

(2)

Database (SGD), which oers concise functional annotations and a variety

of cross-referencesto other repositories. Formore elaborate information, re-

searcherscanresortto MEDLINE,an onlinebibliographicsourceofcitations

and abstracts in biomedical research dating from 1966 till present. While

theuse of acontrolled and curated index, likeMeSH b

,is already commonin

automatically associating gene functions (see for example Jenssen 4

, Masys 5

,

Kankar 6

),wetested additionallytheuseoffree-textasapotentiallymorein-

formative,and inthefuture possibly moredominant,information source(see

alsoStapley 7

,Stephens 8

,Renner 9

, Iliopoulos 10

,Raychaudhuri 11

).

In this work, weexplore how representationsborrowed from the eld of

informationretrievalcan beadopted forclusteringgenesbasedontheirasso-

ciatedliterature. Weencodetext-basedinformationfromvarioussourcesina

typical bag-of-wordsrepresentationfollowing thevectorspacemodel, awork

horseininformationretrievalresearch. Weinvestigatetheeectofpoolingand

expandingthesesources,togetherwiththequestionofwhichtypeofrepresenta-

tionismoreappropriate. Toevaluatethebiologicalusefulnessofliteratureclus-

tering, weformulate aclustering problem withgenesets fromSaccharomyces

cerevisiae for which the functional associationsare well-established and bio-

logicallydistinct. Thereasonnottostartimmediatelyfrom expression-based

gene clusters is that these data-basedclusters are often biologicallycomplex

and cannotprovideagoldstandardto interpret and quantify the correspon-

dencebetweenvariousdataminingmethods. Additionally,weseektoidentify

someinherentbiasesofthevector-spacemodelbytestingand quantifying its

performanceona fairlysimplebiologicalproblem. Tocomparedierent ver-

sionsoftherepresentationwithrespecttoclusteringperformance,weuseboth

externalandinternal scoresfor clustervalidation (seeSection2). Theaim of

theseevaluationsisto establishapowerfulstatistical textrepresentationasa

foundationforknowledge-basedgeneexpressionclustering.

2 Methods

2.1 Compilation of InformationSources

Wecollectandcompile(asofSeptember2001)severalsourcesfortextualan-

notationsofthegenes. Firstlyweretrievethegenedescriptionsfrom theSac-

charomycesGenomeDatabase(SGD) c

.Secondly,weuseSWISS-PROT(SP) d

,

acurated proteinsequence database. We pool theSGD andSP information

a

http://genome-www.stanford.edu/Saccharomyces/

b

http://www.nlm.nih.gov/mesh/meshhome.html

c

http://genome-www.stanford.edu/Saccharomyces/

d

http://www.expasy.org/sprot/

(3)

textualresourceforyeastgenes. Finally,asasourceformoredetailedinforma-

tion, we use acollection of 493,923 yeast-related MEDLINE abstracts dated

betweenJanuary1982and November2000. Theywereselected byretaining

thoseabstractscomingfromalistof59journalsthatwascomposedaccording

tobothimpactfactor e

andrelevance. Theaimofthistrimmingisto retaina

moredomain-specicsubsetofabstracts,whichisstilldiverseenoughtohold

essential geneticinformation. We evaluate how these sources in uence text-

basedgeneclusteringand,morespecically,weinvestigatehowtheexpansion

oftheSGDandYeastCardannotationswithMEDLINEabstract information

(seeSection2.3)aectsclusteringperformance.

2.2 TextRepresentation

The representation called the vector space model encodes a document in a

k-dimensional spacewhere eachcomponentv

ij

representstheweightof term

t

j

in document d

i

. The grammatical structure of the text is neglected and

therefore it is also referred to as a bag-of-wordsrepresentation. As a basic

indexforeachdocumentinthecollection,weconstructavocabularyconsisting

of26,420(possiblymulti-word)termsextractedfromtheGeneOntology f

Term

eld. ThePorterstemmer isused to canonize thewords.Based onthe Term

eldinGOandSynonymeldinSWISS-PROT,weprocesscandidatephrases

andreplaceknownsynonyms. Inthisworkweusedthefollowingcommonused

indexingschemes(Baeza-Yates 12

andKorfhage 13

):

v bool

ij

=1ift

j 2d

i

,0otherwise

v freq

ij

= fij

max

8j (f

ij )

,wheref

ij

isthenumberofoccurrencesoft

j in d

i

v tf:idf

ij

=f

ij log (

N

ni

),where N is thetotalnumberof documentsand n

i

isthenumberof documentscontainingtermiinthecollection

Additionally,wedeneanothertypeofindexcalledthereferencerepresentation

(seeShatkay 14

). Whenadocumentcontainsreferencestootherdocumentsin

thesameoranotherrepository,wecanencodethisasfollows:

v ref

ij

=1ifannotationicontainsareferencetodocumentj,0otherwise

e

http://jcrweb.com/

f

http://www.geneontology.org

(4)

Weexpresssimilaritybetweenpairsofdocumentsd

i andd

j

,orbetweenatext

documentd

i

andaquerydocumentd

j

,bythecosineoftheanglebetweenthe

correspondingnormalizedvectorrepresentations:

sim(d

i

;d

j

)=cos(d

i

;d

j ):

Theunderlyinghypothesis statesthathigh similarityequalsstrongrelevance.

Further, themethod termedpseudo-relevance feedback is geared towards ex-

panding aquerydocumentwith thenmostsimilardocumentsinacollection

andaimsat reningthesearch orclustering processbyarecalculation ofthe

termweights(Yates 12

). WedenotetheannotationsAexpandedwithndocu-

mentsfrom collectionC by A-C

n

. A related application ofpseudo-relevance

feedback in combination with the reference representation can be found in

Shatkay 14

.

2.4 ClusterAlgorithm

AsdivisiveclusteringalgorithmweusedtheK-medoidsalgorithm(Rousseeuw 15

),

whichminimizestheobjectivefunction

K

X

k =1 X

j2C

k d(x

j

;m

k )

over multiple partitionings C = fC

1

;:::;C

K

g with fm

1

;:::;m

K

g the corre-

spondingrepresentativepoints(called medoids)ofeachcluster. Theparame-

terK denotes thenumberofclusters andisxedin advance. Oneadvantage

ofthisalgorithmovercentroid-basedmethods, suchasK-means,is thateach

medoidconstitutesarobustrepresentativedatapointforeachcluster.

2.5 ClusterQuality

To measure the performance and quality of the clustering we dene three

scores: thesilhouettecoeÆcient,theperformanceoftheclusteringasaclassi-

er,and theRandindex. Therst twoaretermedinternal scoressincethey

relyonstatisticalpropertiesoftheclustereddata,thelastoneiscalledexternal

becauseitinvolvesacomparisonwithaknown,externallabeling.

Silhouette CoeÆcient

AsarstinternalscoreforclusterqualityweusethesilhouettecoeÆcientper

clusterS

k

= 1

n P

nk

i=1 s

ik

andtheoverallsilhouettecoeÆcientS = 1

n P

k P

nk

i=1 s

ik

(5)

k

s

ik

=

b(i) a(i)

max(a(i);b(i))

;

wherea(i)istheaveragedissimilarityofmemberitoallothermembersofits

cluster and b(i) the dissimilarity of member i to the nearest member of the

nearest cluster. It is ametric-independent measure designed to describe the

ratiobetweenclustercoherenceandseparationandtoassistinchoosingwhich

clusteringispreferableaccordingtothedata(Rousseeuw 15

).

k-NNLearnability

For thesecond measureof internalcluster quality,welookuponthe problem

as beingsemi-supervised. Using theclustering resultasalabelingfor allthe

points, weassess theperformance of agivenclassier on aclass (or cluster)

in across-validatedleave-one-outsetup. FollowingPavlidis 16

, weuse ak-NN

classierjointlywiththe(1 cos(;))distancemeasuretocomputeamisclas-

sication score for each class. The statistical signicance of this score m, is

expressedbyap-valuederivedfromabinomialB(m,p

misclass

)withp

misclass

the prior chance of misclassication, which canbe computed analytically in

caseofak-NN classier(detailscanbefoundinPavlidis 16

).

Rand Index

Asanexternalmeasureforclustervalidityweusetheadjusted Randindex 17

.

Givenasetofnpoints,anexternalpartitionP =fP

1

;:::;P

k

g,andaclustering

C =fC

1

;:::;C

l

g,denea as thenumberofpairsofpointsthatco-occurin a

group in the partitioning P as well asin the clustering C, d the number of

pairsofpointsthatareindierentgroupsinP aswellasinC,andb andcas

thenumberof pairsofpoints thatco-occurin agroupin P,but notin C or

vice-versa. TheRandindexis thendenedby

R=

a+d

a+b+c+d :

ThecorrectionforrandompartitioningisR

adj

=

R E(R)

max(R) E(R)

,whereahyper-

geometric baselinedistributionis used to computetheexpectedvalues. Ina

comparativestudy 17

,theadjustedRandindexisrecommendedastheexternal

measureofchoice.

(6)

3.1 Constructionof TestSet

We construct a set of genes for which the functional associations are well-

established. From theMIPS catalogue 1

, we selectthree biologicallydistinct

functionalgroupsconsistingof116genesintotal. Forallgenesweselecttheir

corresponding SGD and YC annotations (see Section 2.1) and proceed with

the105 genesthat haveentries in bothdatabases g

. Therst groupholds 63

genes thatencodelysosomal proteins. Thesecond groupconsists of30 genes

involved in translational control and the third contains 23 genes related to

aminoacidtransport.

3.2 ClusterPerformance

Followingthestrategiesoutlinedin Section 2,allgeneannotationsarerepre-

sentedbyvariousindices andsubsequentlyexpanded withthe20bestmatch-

ing MEDLINEabstracts. Morespecically, weperformthe expansionbyre-

indexing theenriched annotations,again following variousindexing schemes.

Table1summarizes theimpact of these settings oncluster performance, ex-

pressed by means of the Rand index R

adj

. Firstly we discuss the eect of

informationsource,afterwardsfollowstheresultsontheindexingschemes.

Table1: R

adj

scoresforclusteringthethreegroupsusingvariousrepresentations.Notethat

someresultsareduplicatedalongtheblockstofacilitatediscussion.

Representation Weight R

adj

](t

i )

Source SPKeywords bool 0.1767 3

SGD tf 0.4050

YeastCard tfidf 0.4617

Index SGD bool 0.3386 8

SGD tf 0.4050

YeastCard bool 0.3323

YeastCard tf 0.4028

YeastCard tfidf 0.4617 26

YC-ML20 bool 0.3726

YC-ML20 tf 0.2953

YC-ML20 tfidf 0.7344 396

YC-ML20 ref 0.2354 20

Expansion SGD-ML20 tfidf 0.5920

YC-ML20 tfidf 0.7344

Eect of Indexing Scheme In thesecond block of Table1 we write the

performanceoftheboolean(bool),frequency(tf)andtfidf indexontypical

g

ftp.esat.kuleuven.ac.be/sista/glenisson/reports/webSuppl TR02121/yeastcardTable.htm

(7)

MEDLINE abstracts. Forverybriefkeyword-baseddescriptions(less than8

words)thebooleanrepresentationisfoundtobethebestone. Ifalleldsfrom

SGD are used, tf (0.4050) improveson bool (0.338). For the YC database,

typicallycontaining30to50termsperentry,tfidf (0.4617)outperformsbool

(0.3323)andtf (0.4028)slightly.

Intheexpansionstep,wecollectandre-indexthe20bestmatchingMED-

LINEabstractsfor eachgene. Thisoperationprovidesaproleforeach gene

withthenumberoftermsrangingtypicallybetween200and400. Amongthe

indexingoptionsforthisset ofabstracts,tfidf (0.7344)scoresconsiderably

higherthanbool(0.3726)andtf (0.2953),evenafterstopwordremoval.

Basingourselvesonthesame20top-scoringabstractswealsoevaluatethe

performance of the reference representation ref, which characterizes a gene

in document space instead of term space. It has an R

adj

value of 0.2354,

indicatingthatitisalessdescriptiverepresentation. Thiscanbeexplainedby

thefactthatref isprobablymoredependentontheretrievalofhighlyrelevant

abstracts(seealsoShatkay 14

).

Eect of Information Source In the rst block of Table1, we see that

forthe genegroupsconsidered,thekeywords eld inSWISS-PROT doesnot

providesuÆcientinformationforanacceptableclustering result(0.1767). For

instance, the SWISS-PROT keyword list only provides an average of 2 to

3 meaningful keywords for 86 out of 105 genes. The remaining genes are

described with no or irrelevantkeywordssuch ashypothetical protein, which

will not allowfor correct classication. Using the GO entries and especially

the description line of SGD improvesthe results and raises the Rand score

to 0.4050. Onlytwogeneshavenomeaningful representation,YKL002wand

YLR309c,whereastheothersarenowdescribedby7to8biologicallyrelevant

terms. WhenresortingtoourpooledinformationsourceYC(seeSection2.1),

we obtain a score of 0.4617, misclassifying 21 out of 105 genes. Although

theclusteringitself isnotdramaticallyin uencedbytheexpansionwithYC,

for most of the genes, the textual representation is greatly improved (e.g.,

the weights of specic terms are increased and additional specic terms are

incorporated). Forinstance, Table2showsthetext prolesofthemedoids of

thevacuolarclusterforvarious representations.

In theclustering basedon SWISS-PROT keywords, the vacuolarcluster

itself isnotfound. Instead, thealgorithmidenties aclusterof ATP-binding

proteinsthatcontainsthevacuolarATPasesbutalsoanumberofATP-binding

proteinsinvolvedintranslationalcontrol. TheSGDrepresentationensuresthe

groupingofvacuolarproteinssolelybasedononerelevantterm,vacuolar. Both

(8)

SPkeywords SGD YC YC-ML20

ATP(0.45) vacuolar(0.38) vacuolar(0.54) vacuolar(0.54)

ATPbind(0.45) vps41(0.38) ATPas(0.4) vacuol(0.45)

bind(0.45) vacuolarmembran(0.32) snare(0.36)

vma13(0.21) vacuolarmembran(0.18)

subunit(0.2) Tsnare(0.17)

associ(0.17) syntaxin(0.16)

organel(0.16) vacuolarassembli(0.12)

vacuolaracidif(0.16) Golgi(0.1)

acidif(0.15) carboxypeptidas(0.1)

sector(0.14) vam3(0.09)

hydrogen(0.13) pep12(0.09)

membran(0.1) Vsnare(0.08)

Table3:TextprolesofgeneYPL029wbasedontheSGDandYeastCardrepresentations.

SGD YC

ATP(0.27) helicas(0.57)

ATPdependhelicas(0.27) mitochondri(0.36)

depend(0.27) ATPdependhelicas(0.29)

helicas(0.53) suv3(0.29)

RNA(0.27) ATP(0.23)

RNAhelicas(0.27) depend(0.2)

suv3(0.27) RNA(0.2)

RNAhelicas(0.19)

post(0.19)

ATPdependRNAhelicas(0.18)

elem(0.16)

translat(0.13)

control(0.13)

interact(0.11)

transcript(0.09)

Table4:TextprolesofgeneYLL048candYPL149wbasedontheYeastCardrepresentation

andthecorrespondingexpansiontoMEDLINE(onlythetop-scoringtermsareshown).

YLL048c YPL149w

YC YC-ML

20

YC YC-ML

20

bile(0.68) bile(0.92) autophagi(0.89) autophagi(0.87)

transport(0.46) bileacidtransport(0.28) apg5(0.43) apg5(0.17)

bileacidtransport(0.25) bileacid(0.22) conjug(0.15)

ybt1(0.25) hepatocyt(0.06) apg1(0.13)

ATP(0.20) transport(0.06) cAMP(0.13)

abc(0.15) abc(0.05) starvat(0.11)

ATPbind(0.14) ATP(0.05) kinas(0.11)

integrmembran(0.14) ATPas(0.03) phosphati-

dylinositol(0.08)

integr(0.13) apic(0.03) vacuol(0.08)

membran(0.11) vesicl(0.03) apoptosis(0.08)

acid(0.1) cotransport(0.03) hepatocyt(0.07)

similar(0.1) sister(0.03) antagonist(0.06)

depend(0.09) voltag(0.03) ubiquitin(0.06)

bind(0.07) glycoprotein(0.02) apg12(0.06)

famili(0.03) triphosph(0.02) amino-

peptidas(0.04)

(9)

result in a large cluster containingmost of the vacuolar proteins. The text

proles of the corresponding medoids conrm the success of the MEDLINE

expansionand the feasibilityof our approach to identify relevant termsthat

characterizeindividual genes orgroupsof genes. Fortheother twogroupsa

similarimprovementisobserved.

Table3showstwoexamplesof textprolesofindividualgenesthat were

misclassied when the SGD representation was used whereas the YC repre-

sentation assigned the genes to the correct cluster. For the RNA helicase,

YPL029w, termslikemitochondri and translat are added tothe text prole.

TheclusteringofYBR024winthegroupoftranslation-relatedproteinsisbased

ontermssuchasmitochondri,inner,andmembrane.

Expansion to MEDLINE improves the text proles of almost all of the

genes and even the clustering of a few genes such as YLL048c, a lysosomal

bile transporter, and the genes that encode autophagy-related proteins. In

theclusteringbasedontheYCrepresentation,YLL048cwaswronglyassigned

to the group of amino acid transporters. However, the expansion strongly

decreasedtheweightof theterm transport and introducedthe termATPase

in the text prole, resulting in a correct classication of the gene. For the

autophagy-relatedgenes,retrievalofthetermvacuol ensurescorrectgrouping

afterMEDLINEexpansionasshowninTable4. However,someofthegenesare

incorrectlyclustered nomatter whatrepresentation orweightingscheme was

used. Forinstance,Group1andGroup3includeseveralproteinsthatregulate

transcription,aprocessthatis closelyrelatedto translationandsharesmany

ofitskeywords. TheproteinsYLR025w(Group1), YLR375w(Group3)and

YDL048c(Group3)arethereforepersistentlymisclassiedintoGroup2. One

gene,YLR309c(Group1),isconsistentlyassignedtothewrongclusterbecause

it lacks proper annotation. The only terms that characterize YLR309c are

vague, aspecic wordssuch asgene product and thename of thegene imh1.

This information is insuÆcient for successful expansion with MEDLINE. A

manualsearch viathePUBMED engine didnotreveal much information on

imh1 (YLR309c)either.

3.3 ClusterQuality

Becauseof theabsenceofagoldstandardorpriorknowledgein regularclus-

tering problems, internal measures of quality are used to evaluate a cluster

result(Jain 17

). Theyarebasedonvariousstatisticalpropertiesofthegrouped

dataandprovidecluestochoosebetweendierentparameterizationsofasin-

gle algorithm (such as thenumberof clusters) orevenbetween various clus-

Wedemonstratethee ect of various weightingschemesand informationsources ina functional clustering setup

Wedemonstratetheeect of various weightingschemesand informationsources ina functional clustering setup