in Text-Based GeneClustering
P.Glenisson,P.Antal,J.Mathys,Y.Moreau,B.DeMoor
DepartmentofElectricalEngineering,ESAT-SISTA,
Kasteelpark Arenberg10,
B-3001Leuven, Belgium
Thankstoitsincreasingavailability,electronicliteraturecannowbeamajorsource
ofinformationwhendevelopingcomplexstatisticalmodelswheredataisscarceor
containsmuchnoise. Thisraisesthe question of how to deeply integrate infor-
mationfromdomainliteraturewithexperimentaldata. Evaluatingwhatkindof
statisticaltextrepresentationscanintegrateliteratureknowledgeinclusteringstill
remainsanunsuÆcientlyexploredtopic. Inthisworkwediscusshowthebag-of-
wordsrepresentationcanbeusedsuccessfullytorepresentgeneticannotationand
free-textinformationcomingfromdierentdatabases. Wedemonstratetheeect
of various weightingschemesand informationsources ina functional clustering
setup. Asaquantitativeevaluation,wecontrast fordierentparametersettings
the functionalgroupingsobtainedfromtextwiththoseobtainedfromexpertas-
sessmentsandlinkeachoftheresultstoabiologicaldiscussion.
1 Introduction
More and more, a successful understanding of complex genetic mechanisms
(such as regulation, functional understanding,...) critically depends on the
interaction between statistical analysis and various knowledge sources, such
as annotations databases, specialized literature, and curated cross-links be-
tweenthem(Baxevanis 1
). Despitetheseeorts,thecurrentinteractionbetween
theexperimental(data)analysisandtext-basedinformationrequiresextensive
userintervention. Geneexpressionexperiments,whichmeasurelarge-scalege-
neticactivityunder avariety of biologicalconditions,are excellentexamples
ofenvironmentsthatrelystrongly onthisinteraction. Indeedas(1) thecost
ofdatacollectionishigh, (2)measurementsareoftennoisyorunreliable,and
(3) establishedrelationships in the transcriptomeare fragmentary at best, a
deeper integration between data and text-based information will benet the
knowledgediscoveryprocess.
Thepresentstrategiesforknowledge-basedexpressiondataanalysisrelyon
thepremisethatstatisticaldataanalysisandbiologicalknowledgecancomple-
menteachotherbylinking twoindependentlyconstructedsourcesthatcontain
conceptuallyrelatedrecords(Masys 2
andVidal 3
).
In yeast for example, interpreting cluster patterns involves the consul-
tation of curated functional databases such as the Saccharomyces Genome
Database (SGD), which oers concise functional annotations and a variety
of cross-referencesto other repositories. Formore elaborate information, re-
searcherscanresortto MEDLINE,an onlinebibliographicsourceofcitations
and abstracts in biomedical research dating from 1966 till present. While
theuse of acontrolled and curated index, likeMeSH b
,is already commonin
automatically associating gene functions (see for example Jenssen 4
, Masys 5
,
Kankar 6
),wetested additionallytheuseoffree-textasapotentiallymorein-
formative,and inthefuture possibly moredominant,information source(see
alsoStapley 7
,Stephens 8
,Renner 9
, Iliopoulos 10
,Raychaudhuri 11
).
In this work, weexplore how representationsborrowed from the eld of
informationretrievalcan beadopted forclusteringgenesbasedontheirasso-
ciatedliterature. Weencodetext-basedinformationfromvarioussourcesina
typical bag-of-wordsrepresentationfollowing thevectorspacemodel, awork
horseininformationretrievalresearch. Weinvestigatetheeectofpoolingand
expandingthesesources,togetherwiththequestionofwhichtypeofrepresenta-
tionismoreappropriate. Toevaluatethebiologicalusefulnessofliteratureclus-
tering, weformulate aclustering problem withgenesets fromSaccharomyces
cerevisiae for which the functional associationsare well-established and bio-
logicallydistinct. Thereasonnottostartimmediatelyfrom expression-based
gene clusters is that these data-basedclusters are often biologicallycomplex
and cannotprovideagoldstandardto interpret and quantify the correspon-
dencebetweenvariousdataminingmethods. Additionally,weseektoidentify
someinherentbiasesofthevector-spacemodelbytestingand quantifying its
performanceona fairlysimplebiologicalproblem. Tocomparedierent ver-
sionsoftherepresentationwithrespecttoclusteringperformance,weuseboth
externalandinternal scoresfor clustervalidation (seeSection2). Theaim of
theseevaluationsisto establishapowerfulstatistical textrepresentationasa
foundationforknowledge-basedgeneexpressionclustering.
2 Methods
2.1 Compilation of InformationSources
Wecollectandcompile(asofSeptember2001)severalsourcesfortextualan-
notationsofthegenes. Firstlyweretrievethegenedescriptionsfrom theSac-
charomycesGenomeDatabase(SGD) c
.Secondly,weuseSWISS-PROT(SP) d
,
acurated proteinsequence database. We pool theSGD andSP information
a
http://genome-www.stanford.edu/Saccharomyces/
b
http://www.nlm.nih.gov/mesh/meshhome.html
c
http://genome-www.stanford.edu/Saccharomyces/
d
http://www.expasy.org/sprot/
textualresourceforyeastgenes. Finally,asasourceformoredetailedinforma-
tion, we use acollection of 493,923 yeast-related MEDLINE abstracts dated
betweenJanuary1982and November2000. Theywereselected byretaining
thoseabstractscomingfromalistof59journalsthatwascomposedaccording
tobothimpactfactor e
andrelevance. Theaimofthistrimmingisto retaina
moredomain-specicsubsetofabstracts,whichisstilldiverseenoughtohold
essential geneticinformation. We evaluate how these sources in uence text-
basedgeneclusteringand,morespecically,weinvestigatehowtheexpansion
oftheSGDandYeastCardannotationswithMEDLINEabstract information
(seeSection2.3)aectsclusteringperformance.
2.2 TextRepresentation
The representation called the vector space model encodes a document in a
k-dimensional spacewhere eachcomponentv
ij
representstheweightof term
t
j
in document d
i
. The grammatical structure of the text is neglected and
therefore it is also referred to as a bag-of-wordsrepresentation. As a basic
indexforeachdocumentinthecollection,weconstructavocabularyconsisting
of26,420(possiblymulti-word)termsextractedfromtheGeneOntology f
Term
eld. ThePorterstemmer isused to canonize thewords.Based onthe Term
eldinGOandSynonymeldinSWISS-PROT,weprocesscandidatephrases
andreplaceknownsynonyms. Inthisworkweusedthefollowingcommonused
indexingschemes(Baeza-Yates 12
andKorfhage 13
):
v bool
ij
=1ift
j 2d
i
,0otherwise
v freq
ij
= fij
max
8j (f
ij )
,wheref
ij
isthenumberofoccurrencesoft
j in d
i
v tf:idf
ij
=f
ij log (
N
ni
),where N is thetotalnumberof documentsand n
i
isthenumberof documentscontainingtermiinthecollection
Additionally,wedeneanothertypeofindexcalledthereferencerepresentation
(seeShatkay 14
). Whenadocumentcontainsreferencestootherdocumentsin
thesameoranotherrepository,wecanencodethisasfollows:
v ref
ij
=1ifannotationicontainsareferencetodocumentj,0otherwise
e
http://jcrweb.com/
f
http://www.geneontology.org
Weexpresssimilaritybetweenpairsofdocumentsd
i andd
j
,orbetweenatext
documentd
i
andaquerydocumentd
j
,bythecosineoftheanglebetweenthe
correspondingnormalizedvectorrepresentations:
sim(d
i
;d
j
)=cos(d
i
;d
j ):
Theunderlyinghypothesis statesthathigh similarityequalsstrongrelevance.
Further, themethod termedpseudo-relevance feedback is geared towards ex-
panding aquerydocumentwith thenmostsimilardocumentsinacollection
andaimsat reningthesearch orclustering processbyarecalculation ofthe
termweights(Yates 12
). WedenotetheannotationsAexpandedwithndocu-
mentsfrom collectionC by A-C
n
. A related application ofpseudo-relevance
feedback in combination with the reference representation can be found in
Shatkay 14
.
2.4 ClusterAlgorithm
AsdivisiveclusteringalgorithmweusedtheK-medoidsalgorithm(Rousseeuw 15
),
whichminimizestheobjectivefunction
K
X
k =1 X
j2C
k d(x
j
;m
k )
over multiple partitionings C = fC
1
;:::;C
K
g with fm
1
;:::;m
K
g the corre-
spondingrepresentativepoints(called medoids)ofeachcluster. Theparame-
terK denotes thenumberofclusters andisxedin advance. Oneadvantage
ofthisalgorithmovercentroid-basedmethods, suchasK-means,is thateach
medoidconstitutesarobustrepresentativedatapointforeachcluster.
2.5 ClusterQuality
To measure the performance and quality of the clustering we dene three
scores: thesilhouettecoeÆcient,theperformanceoftheclusteringasaclassi-
er,and theRandindex. Therst twoaretermedinternal scoressincethey
relyonstatisticalpropertiesoftheclustereddata,thelastoneiscalledexternal
becauseitinvolvesacomparisonwithaknown,externallabeling.
Silhouette CoeÆcient
AsarstinternalscoreforclusterqualityweusethesilhouettecoeÆcientper
clusterS
k
= 1
n P
nk
i=1 s
ik
andtheoverallsilhouettecoeÆcientS = 1
n P
k P
nk
i=1 s
ik
k
s
ik
=
b(i) a(i)
max(a(i);b(i))
;
wherea(i)istheaveragedissimilarityofmemberitoallothermembersofits
cluster and b(i) the dissimilarity of member i to the nearest member of the
nearest cluster. It is ametric-independent measure designed to describe the
ratiobetweenclustercoherenceandseparationandtoassistinchoosingwhich
clusteringispreferableaccordingtothedata(Rousseeuw 15
).
k-NNLearnability
For thesecond measureof internalcluster quality,welookuponthe problem
as beingsemi-supervised. Using theclustering resultasalabelingfor allthe
points, weassess theperformance of agivenclassier on aclass (or cluster)
in across-validatedleave-one-outsetup. FollowingPavlidis 16
, weuse ak-NN
classierjointlywiththe(1 cos(;))distancemeasuretocomputeamisclas-
sication score for each class. The statistical signicance of this score m, is
expressedbyap-valuederivedfromabinomialB(m,p
misclass
)withp
misclass
the prior chance of misclassication, which canbe computed analytically in
caseofak-NN classier(detailscanbefoundinPavlidis 16
).
Rand Index
Asanexternalmeasureforclustervalidityweusetheadjusted Randindex 17
.
Givenasetofnpoints,anexternalpartitionP =fP
1
;:::;P
k
g,andaclustering
C =fC
1
;:::;C
l
g,denea as thenumberofpairsofpointsthatco-occurin a
group in the partitioning P as well asin the clustering C, d the number of
pairsofpointsthatareindierentgroupsinP aswellasinC,andb andcas
thenumberof pairsofpoints thatco-occurin agroupin P,but notin C or
vice-versa. TheRandindexis thendenedby
R=
a+d
a+b+c+d :
ThecorrectionforrandompartitioningisR
adj
=
R E(R)
max(R) E(R)
,whereahyper-
geometric baselinedistributionis used to computetheexpectedvalues. Ina
comparativestudy 17
,theadjustedRandindexisrecommendedastheexternal
measureofchoice.
3.1 Constructionof TestSet
We construct a set of genes for which the functional associations are well-
established. From theMIPS catalogue 1
, we selectthree biologicallydistinct
functionalgroupsconsistingof116genesintotal. Forallgenesweselecttheir
corresponding SGD and YC annotations (see Section 2.1) and proceed with
the105 genesthat haveentries in bothdatabases g
. Therst groupholds 63
genes thatencodelysosomal proteins. Thesecond groupconsists of30 genes
involved in translational control and the third contains 23 genes related to
aminoacidtransport.
3.2 ClusterPerformance
Followingthestrategiesoutlinedin Section 2,allgeneannotationsarerepre-
sentedbyvariousindices andsubsequentlyexpanded withthe20bestmatch-
ing MEDLINEabstracts. Morespecically, weperformthe expansionbyre-
indexing theenriched annotations,again following variousindexing schemes.
Table1summarizes theimpact of these settings oncluster performance, ex-
pressed by means of the Rand index R
adj
. Firstly we discuss the eect of
informationsource,afterwardsfollowstheresultsontheindexingschemes.
Table1: R
adj
scoresforclusteringthethreegroupsusingvariousrepresentations.Notethat
someresultsareduplicatedalongtheblockstofacilitatediscussion.
Representation Weight R
adj
](t
i )
Source SPKeywords bool 0.1767 3
SGD tf 0.4050
YeastCard tfidf 0.4617
Index SGD bool 0.3386 8
SGD tf 0.4050
YeastCard bool 0.3323
YeastCard tf 0.4028
YeastCard tfidf 0.4617 26
YC-ML20 bool 0.3726
YC-ML20 tf 0.2953
YC-ML20 tfidf 0.7344 396
YC-ML20 ref 0.2354 20
Expansion SGD-ML20 tfidf 0.5920
YC-ML20 tfidf 0.7344
Eect of Indexing Scheme In thesecond block of Table1 we write the
performanceoftheboolean(bool),frequency(tf)andtfidf indexontypical
g
ftp.esat.kuleuven.ac.be/sista/glenisson/reports/webSuppl TR02121/yeastcardTable.htm
MEDLINE abstracts. Forverybriefkeyword-baseddescriptions(less than8
words)thebooleanrepresentationisfoundtobethebestone. Ifalleldsfrom
SGD are used, tf (0.4050) improveson bool (0.338). For the YC database,
typicallycontaining30to50termsperentry,tfidf (0.4617)outperformsbool
(0.3323)andtf (0.4028)slightly.
Intheexpansionstep,wecollectandre-indexthe20bestmatchingMED-
LINEabstractsfor eachgene. Thisoperationprovidesaproleforeach gene
withthenumberoftermsrangingtypicallybetween200and400. Amongthe
indexingoptionsforthisset ofabstracts,tfidf (0.7344)scoresconsiderably
higherthanbool(0.3726)andtf (0.2953),evenafterstopwordremoval.
Basingourselvesonthesame20top-scoringabstractswealsoevaluatethe
performance of the reference representation ref, which characterizes a gene
in document space instead of term space. It has an R
adj
value of 0.2354,
indicatingthatitisalessdescriptiverepresentation. Thiscanbeexplainedby
thefactthatref isprobablymoredependentontheretrievalofhighlyrelevant
abstracts(seealsoShatkay 14
).
Eect of Information Source In the rst block of Table1, we see that
forthe genegroupsconsidered,thekeywords eld inSWISS-PROT doesnot
providesuÆcientinformationforanacceptableclustering result(0.1767). For
instance, the SWISS-PROT keyword list only provides an average of 2 to
3 meaningful keywords for 86 out of 105 genes. The remaining genes are
described with no or irrelevantkeywordssuch ashypothetical protein, which
will not allowfor correct classication. Using the GO entries and especially
the description line of SGD improvesthe results and raises the Rand score
to 0.4050. Onlytwogeneshavenomeaningful representation,YKL002wand
YLR309c,whereastheothersarenowdescribedby7to8biologicallyrelevant
terms. WhenresortingtoourpooledinformationsourceYC(seeSection2.1),
we obtain a score of 0.4617, misclassifying 21 out of 105 genes. Although
theclusteringitself isnotdramaticallyin uencedbytheexpansionwithYC,
for most of the genes, the textual representation is greatly improved (e.g.,
the weights of specic terms are increased and additional specic terms are
incorporated). Forinstance, Table2showsthetext prolesofthemedoids of
thevacuolarclusterforvarious representations.
In theclustering basedon SWISS-PROT keywords, the vacuolarcluster
itself isnotfound. Instead, thealgorithmidenties aclusterof ATP-binding
proteinsthatcontainsthevacuolarATPasesbutalsoanumberofATP-binding
proteinsinvolvedintranslationalcontrol. TheSGDrepresentationensuresthe
groupingofvacuolarproteinssolelybasedononerelevantterm,vacuolar. Both
SPkeywords SGD YC YC-ML20
ATP(0.45) vacuolar(0.38) vacuolar(0.54) vacuolar(0.54)
ATPbind(0.45) vps41(0.38) ATPas(0.4) vacuol(0.45)
bind(0.45) vacuolarmembran(0.32) snare(0.36)
vma13(0.21) vacuolarmembran(0.18)
subunit(0.2) Tsnare(0.17)
associ(0.17) syntaxin(0.16)
organel(0.16) vacuolarassembli(0.12)
vacuolaracidif(0.16) Golgi(0.1)
acidif(0.15) carboxypeptidas(0.1)
sector(0.14) vam3(0.09)
hydrogen(0.13) pep12(0.09)
membran(0.1) Vsnare(0.08)
Table3:TextprolesofgeneYPL029wbasedontheSGDandYeastCardrepresentations.
SGD YC
ATP(0.27) helicas(0.57)
ATPdependhelicas(0.27) mitochondri(0.36)
depend(0.27) ATPdependhelicas(0.29)
helicas(0.53) suv3(0.29)
RNA(0.27) ATP(0.23)
RNAhelicas(0.27) depend(0.2)
suv3(0.27) RNA(0.2)
RNAhelicas(0.19)
post(0.19)
ATPdependRNAhelicas(0.18)
elem(0.16)
translat(0.13)
control(0.13)
interact(0.11)
transcript(0.09)
Table4:TextprolesofgeneYLL048candYPL149wbasedontheYeastCardrepresentation
andthecorrespondingexpansiontoMEDLINE(onlythetop-scoringtermsareshown).
YLL048c YPL149w
YC YC-ML
20
YC YC-ML
20
bile(0.68) bile(0.92) autophagi(0.89) autophagi(0.87)
transport(0.46) bileacidtransport(0.28) apg5(0.43) apg5(0.17)
bileacidtransport(0.25) bileacid(0.22) conjug(0.15)
ybt1(0.25) hepatocyt(0.06) apg1(0.13)
ATP(0.20) transport(0.06) cAMP(0.13)
abc(0.15) abc(0.05) starvat(0.11)
ATPbind(0.14) ATP(0.05) kinas(0.11)
integrmembran(0.14) ATPas(0.03) phosphati-
dylinositol(0.08)
integr(0.13) apic(0.03) vacuol(0.08)
membran(0.11) vesicl(0.03) apoptosis(0.08)
acid(0.1) cotransport(0.03) hepatocyt(0.07)
similar(0.1) sister(0.03) antagonist(0.06)
depend(0.09) voltag(0.03) ubiquitin(0.06)
bind(0.07) glycoprotein(0.02) apg12(0.06)
famili(0.03) triphosph(0.02) amino-
peptidas(0.04)
result in a large cluster containingmost of the vacuolar proteins. The text
proles of the corresponding medoids conrm the success of the MEDLINE
expansionand the feasibilityof our approach to identify relevant termsthat
characterizeindividual genes orgroupsof genes. Fortheother twogroupsa
similarimprovementisobserved.
Table3showstwoexamplesof textprolesofindividualgenesthat were
misclassied when the SGD representation was used whereas the YC repre-
sentation assigned the genes to the correct cluster. For the RNA helicase,
YPL029w, termslikemitochondri and translat are added tothe text prole.
TheclusteringofYBR024winthegroupoftranslation-relatedproteinsisbased
ontermssuchasmitochondri,inner,andmembrane.
Expansion to MEDLINE improves the text proles of almost all of the
genes and even the clustering of a few genes such as YLL048c, a lysosomal
bile transporter, and the genes that encode autophagy-related proteins. In
theclusteringbasedontheYCrepresentation,YLL048cwaswronglyassigned
to the group of amino acid transporters. However, the expansion strongly
decreasedtheweightof theterm transport and introducedthe termATPase
in the text prole, resulting in a correct classication of the gene. For the
autophagy-relatedgenes,retrievalofthetermvacuol ensurescorrectgrouping
afterMEDLINEexpansionasshowninTable4. However,someofthegenesare
incorrectlyclustered nomatter whatrepresentation orweightingscheme was
used. Forinstance,Group1andGroup3includeseveralproteinsthatregulate
transcription,aprocessthatis closelyrelatedto translationandsharesmany
ofitskeywords. TheproteinsYLR025w(Group1), YLR375w(Group3)and
YDL048c(Group3)arethereforepersistentlymisclassiedintoGroup2. One
gene,YLR309c(Group1),isconsistentlyassignedtothewrongclusterbecause
it lacks proper annotation. The only terms that characterize YLR309c are
vague, aspecic wordssuch asgene product and thename of thegene imh1.
This information is insuÆcient for successful expansion with MEDLINE. A
manualsearch viathePUBMED engine didnotreveal much information on
imh1 (YLR309c)either.
3.3 ClusterQuality
Becauseof theabsenceofagoldstandardorpriorknowledgein regularclus-
tering problems, internal measures of quality are used to evaluate a cluster
result(Jain 17
). Theyarebasedonvariousstatisticalpropertiesofthegrouped
dataandprovidecluestochoosebetweendierentparameterizationsofasin-
gle algorithm (such as thenumberof clusters) orevenbetween various clus-