Arjen P. de Vries and Djoerd Hiemstra
farjen,hiemstrag tit.utwente.nl
CTIT, University of Twente
The Netherlands
Abstra t
The databasegroup atUniversity of Twenteparti ipatesinTREC8
usingtheMirrorDBMS,aprototypedatabasesystemespe iallydesigned
formultimediaandwebretrieval. Fromadatabaseperspe tive,thepur-
posehasbeento he kwhetherwe angetsuÆ ientperforman e,andto
prepare for the verylarge orpus tra k inwhi h we plan to parti ipate
nextyear. FromanIRperspe tive,the experimentshavebeendesigned
tolearnmoreabouttheee toftheglobalstatisti sontheranking.
1 Introdu tion
TheMirrorDBMS[dV99℄ ombines ontentmanagementanddatamanagement
in a single system. The main advantage of su h integration is the fa ility to
ombine IR with traditional data retrieval. Furthermore, IR resear hers an
experimentmoreeasilywithnewretrievalmodels,usingand ombiningvarious
sour esof information. This isanimportantbenet foradvan edIR resear h;
webretrieval,spee hretrieval,and ross-languageretrieval,ea hrequiretheuse
ofseveralrepresentationsof ontent,whi hishardto handleinthetraditional
le-basedapproa h,and be omestooslowin traditionaldatabasesystems.
In the Mirror DBMS, the IR retrieval model is ompletely integrated in the
databasear hite ture, emphasizingeÆ ientset-orientedquerypro essing. The
supportforinformationretrievalin oursystemis presentedin detailin [dV98℄
and[dVW99℄. Itsupportsothertypesofmediaaswell,whi hhasbeendemon-
stratedintheimageretrievalsystemprototypedes ribedin[dVvDBA99℄. The
maingoalofourparti ipationinTRECistotestifoursystem anhandlelarger
datasetswithouttoomanyproblems. Also,wewantedtondouttheee tof
globalstatisti sontheranking.
This paperis organized asfollows. Se tions 2and 3 review thedesign of the
MirrorDBMSand its support forIR, and dis ussits use forTREC pro essing.
Se tion 4explains the experimental setup and interprets our results. Se tion
...
...
Extension 1 Extension n
Extension 1 Extension n
Logical algebra
Query language
Physical algebra
Storage layer
...
Extension 1 Extension n
Logical algebra Query language
Physical algebra
Storage layer
...
Enhanced ADT 1 Enhanced ADT n
Query language
Storage layer Logical and physical algebras
Figure1: Themulti-model DBMSar hite turenextto theextended relational
andE-ADTDBMSar hite tures(fromleftto right).
5dis usses ourexperien e with usingthe MirrorDBMS for TREC,followed by
on lusions.
2 Design
A omplete overview and motivation of all aspe ts of the design of the Mirror
DBMS is presented in [dV99℄. Although following a traditional three-s hema
ar hite ture,itusesdierentdatamodelsatdierentlevels:wetherefore lassify
its design as multi-model DBMS ar hite ture. The ru ial ar hite tural
dieren e from other extensible databasesystems is that query pro essing at
thelogi allayerusesonlyoperatorsthatareprovidedbythephysi allayer(see
alsoFigure1),and,domain-spe i querypro essing(su hasanIRextension)
is dened at the logi al level primarily. This hoi e enfor es a system-wide
physi aldatamodelandalgebraspanningallextensions. Of ourse,thephysi al
algebra analsobeextendedifne essary,i.e.whenlogi aloperations annotbe
expressedeÆ ientlyinthephysi al algebra. Thestri t separationbetweenthe
logi alandphysi allevelsallowsusingalgebrai queryoptimizationte hniques,
akeypropertyofrelationaldatabasemanagementsystemsbuthardlyeverused
innon-businessappli ationareaslike ontentmanagement.
The multi-model ar hite ture provides the querypro essorwith transparan y
through the layers. Put informally, query evaluation an `look down' from
the original request through all layers of the ar hite ture. This should en-
ableset-orientedqueryevaluationforalmosteveryrequest,andallowmaximal
exploitationofparallelizationandpipelining. In ontrast,thebla k-boxADTs
of `obje t-relational' databasesystems restri t theDBMS in thepossible ma-
nipulations of the query plans. This makes it more ompli ated to distribute
andparallelizethequeryplans,or hangethebuerstrategyforiterativequery
pro essing as proposed in [JFS98℄. Another alternative, the enhan ed ADTs
proposedbySeshadri[Ses98℄,provideslittleopportunityforoptimizationsthat
ar hite turess hemati ally.
3 Implementation
TheprototypeimplementationoftheMirrorDBMSusesMoaatthelogi allevel,
and Monet at the physi al level. Monet is aparallel main-memory database
systemunderdevelopmentattheCWI in Amsterdam[BK95,BMK99℄,that is
targetedasaba kendsystemforvarious(query-intensive)appli ationdomains,
su hasGISanddatamining.
1
Moaisanobje talgebrastudiedinthedatabase
group at University of Twente, that is extensible with domain-spe i stru -
tures. The Moa tools transform expressions in this algebra into sequen es of
operationsin MIL, analgebrafor thebinary relationaldata model supported
byMonet.
ForthesupportofIR,weextendedMoawithnewstru turesatthelogi allevelto
handledo umentrepresentation,ranking,andthe omputationof o-o urren e
statisti s. In ombinationwithMoa'skernelsupport for olle tionsand tuples,
these stru tures anmodel a wide variety of IR retrievalmodels: the urrent
prototypesupports thewell-known Okapirankings heme, InQuery'sinferen e
networkretrievalmodel,aswell asthelinguisti allymotivated retrievalmodel
(LMM, presented in Se tion 4.3). To illustrate, the following Moa expression
ranksa olle tionofdo uments:
map[sum(THIS)℄(
map[getBL(THIS, query, stats)℄( do s )
);
Therstmapoperation omputestermprobabilitiesforthequerytermso ur-
ring in thedo ument, using the global statisti sspe ied in stru ture stats.
The subsequent map ombines these probabilities using a sum operation. Al-
thoughthisparti ularexpressionmaynotseemveryinteresting,theIRranking
operators an be ombined with other operators su h assele t, resulting in a
powerfulquerylanguage.
The representation ofthe logi alIR stru turesat the physi al levelis termed
the attenedrepresentationofthe ontent. It onsistsofthreebinarytables
(BATs),storingthefrequen ytf(t
i
;d
j
)oftermt
i
indo umentd
j
,forea hterm
t
i
o urringindo umentd
j
. Table1illustratesthisfora olle tionfd
1
;d
2 gwith
do umentsd
1
=[a; ; ;a; ℄and d
2
=[a;e;b;b;e℄. Computing theprobability
of relevan e of the obje ts for query q = [a;b℄ pro eeds as follows. First, a
table withthequerytermsisjoined withthedo umenttermsinti(theresult
1
Monetisusedsu esfullyona ommer ialbasisbyDataDistilleries,astart-upspe ializing
dj ti tfij
1 a 2
1 3
2 a 1
2 b 2
2 e 2
intermediateresults
qdj qti qtfij qntfij
1 a 2 0.796578
2 a 1 0.621442
2 b 2 0.900426
do ument olle tion
Table1: Representationof ontentinBATs
is alled qti). Next, (usingadditionaljoins) the do umentidentiersand the
termfrequen iesare lookedup(qdjand qtfij).
2
Finally, theretrievalstatus
valuesare omputedwithsomevariantofthepopulartf idf rankingformula.
Tosupport these omputations, Monet's physi al algebrahas to be extended
withnewoperators,eitherin CorC++,orasaMIL pro edure. Thelatteris
preferrableforeasyexperimentation;forexample,thefollowingMILpro edure
omputes theterm probabilities given normalizedterm frequen y and inverse
do umentfrequen yusingtheLMMmodel:
PROC bel( nidfi, ntfij ) := {
RETURN log( 1.0 + nidfi * ntfij * C );
}
An evaluation runpro esses50 topi sin bat h,but the lientinterfa esofthe
MirrorDBMShavebeendesignedforintera tivesessionswithanend-user. Also,
transferringthedatafrom Monet totheMoa lienthasnotbeenimplemented
optimally. Furthermore,optimizationssu hasusingmaterializedviewsarenot
performedinthe urrentMoarewriter. These minor awswouldhaveinferred
anunfairperforman e penalty totheevaluationofthear hite ture,and made
loggingthe results rather umbersome. Therefore, as a(temporary) solution,
theMIL program generatedby the Moarewriterhas beenmanually editedto
loopoverthe50 topi s, log the omputedrankingfor ea h topi ,and usetwo
additionaltables, onewithpre omputednormalizedinversedo umentfrequen-
ies (a materialized view), and one with the do ument-spe i onstants for
normalizingthetermfrequen ies.
2
Notethatthesejoinsareexe utedveryeÆ iently,be ausetheMoastru turesmakesure
4 Experimental setup and results
Colle tion fusion is the pro ess of merging the results of retrieval runs on
seperate, autonomous do ument olle tions into an ee tive ombined result
[VGJL95℄. We havefo used on this problem be auselarge olle tionswill be
fragmented (horizontally) in several partitions, ea h managed by a separate
server. Maintainingtheexa tglobal statisti sindu esanextraoverhead,that
maynotbene essaryifthefragmentsaresuÆ ientlylarge.
Colle tionfusionis atrivialtaskfor exa tmat hing retrievalsystemslikesys-
tems using Boolean retrieval, but more ompli ated if a ranked retrieval sys-
tem isused. Inanumberof publi ationson olle tionfusionit isargued that
simply omparingsimilaritymeasuresa rosssub olle tionsleadstounsatisfa -
toryresultsbe auseofdieren esinthe olle tion-dependentfrequen y ounts
[Bau97, CLC95, VF95, VGJL95℄. One of the obje tivesof the TREC-8 eval-
uation des ribed in this paper is to question this hypothesis. We feel that
similaritymeasuresa rosssub olle tionsmightinfa tbe omparable,butshow
worseevaluationresultsbe auseoftheevaluationsetup.
4.1 Evaluation using the TREC olle tion
Relevan eassessmentsontheTRECtest olle tionsareassembledbythepool-
ing method: apoolof possiblyrelevant do umentsis reatedby taking thea
sampleof do umentsretrievedbyea hparti ipatingsystem. Thispoolisthen
shownto the humanassessors[VH99a℄. Thesampling method used in TREC
takesthetop100oftheretrieveddo umentsofea hparti ipatingsystem.
Sin ethestartofTRECin1992,thetest olle tionshavebeenusedinnumerous
evaluationsoutsidetheoÆ ialTREC.Fortheseevaluations,alldo umentsthat
werenotinthetop100ofanyoftheoÆ ialparti ipatingsystemsareassumed
to benotrelevant. But,evaluationsthat didnot ontributetotheTRECpool
probablyhaveunjudgeddo umentsinthetop100makingtheseevaluationsless
reliablethantheoÆ ialTRECevaluation. Thisisespe iallytruefornew,previ-
ouslyunexploredapproa hestoretrieval. Ifasystemsndsrelevantdo uments
thatnosystemwasabletondbefore,thenthesedo umentswillprobablynot
bejudged in anold TREC olle tion. The only way to he k the relevan e of
these do umentsisby oÆ ial TREC parti ipation.
4.2 Conditions for naive olle tion fusion
Letusdene'naive' olle tionfusionasthepro essofmergingthesear hresults
on the sub olle tions based on the do ument similarities. Therst ondition
for naive olle tion fusion is that ea h sub olle tion uses the same retrieval
sub olle tionusesthesameindexing vo abulary[Bau97℄. A third onditionis
thatsub olle tionsaresuÆ ientlylargetoallowforthereliablelo alestimation
ofdo umentfrequen ies. Ifthesub olle tionsaretoosmall,inee tiveretrieval
onthesub olle tions willae tthemergedresult.
AnevaluationofCallanetal. [CLC95℄underthese onditionsforTRECtopi s
51-150showedthatnaivemergingwassigni antlyworsethanrankingbasedon
globallyestimateddo umentfrequen ies, ausinglossesfrom10-20%inaverage
pre ision. But,theresultsofnaivemergingreported byCallanet al. [CLC95℄
were notpartof anoÆ ial TRECparti ipation. Itis likelythat theirmerged
runhasaworse overageofjudgements,be ausetheTREC-2and3poolswere
(almost)only reatedbysystemsthatusea entralindexforretrieval. Maybe,
theirmergedrunwasasgood asthe entralindex run after all. To he kthis
hypothesis,wede idedtoputuparetrievalrunusingnaivemergingforjudging.
4.3 Some theoreti al ba k-up for naive merging
The Mirror DBMS uses the linguisti ally motivated probabilisti model of in-
formation retrieval[Hie99, HK99℄. The model builds a simplestatisti al lan-
guagemodelforea hdo umentinthe olle tion. Theprobabilitythat aquery
T
1
;T
2
;;T
n
oflengthnis generatedbythelanguagemodelof thedo ument
withidentierDis denedbythefollowingequation:
P(T
1
=t
1
;;T
n
=t
n
jD=d)= n
Y
i=1 (
1 df(t
i )
P
t df(t)
+
2 tf(t
i
;d)
P
t tf(t;d)
) (1)
Equation 1 an be rewritten to a ve tor produ t formula by rst dividing it
by Q
n
i=1 (
1 df(t
i )=
P
t
df(t))[Hie99℄. This willnot ae t therankingwithin a
sub olle tion, but it will ae t the nal ranking after mergingthe sear h re-
sults of the seperate sub olle tions, be ause we divided by olle tion spe i
do umentfrequen ies. It anbeshownthattherankingoftheve torprodu t
formulain table 2approximatestherankingdened by the onditionalproba-
bilityP(DjT
1
;T
2
;;T
n
)ofado umentbeingrelevantgivenaquery.
ve torprodu tformula: similarity(Q;D)= l
X
k =1 w
qk
w
dk
querytermweight: w
qk
=tf(t
k
;q)
do umenttermweight: w
dk
=log (1+
tf(t
k
;d)
df(t
k )
P
t tf(t;d)
2 P
t df(t)
1 )
Table2: tf idf termweightingalgorithm
From Bayes' rule we knowthat dividing equation 1by P(T
1
;T
2
;;T
n ) and
multiplying it by P(D) results in P(DjT ;T ;;T ). For a large olle tion
and aquerythat hasasmall numberofhits, tf(t;d)=0for mosttermst and
do umentsd. Therefore, Q
n
i=1 (
1 df(t
i )=
P
t
df(t)) approximates themarginal
probabilityP(T
1
;T
2
;;T
n
) andthe rankingdened by table 2approximates
therankingdened byP(DjT
1
;T
2
;;T
n
). Thea-prioriprobabilityP(D=d)
of a do ument d being relevant an be in luded by adding the logarithm of
equation2tothesimilaritiesoftable2asanalstep.
P(D=d)= P
t tf(t;d)
P
t P
d tf(t;d)
(2)
We hypothesise that, if the approximation is not too far o, the result after
mergingis not signi antlyworsethan what would have been possible witha
entralindex.
4.4 OÆ ial results
Table3lists theoÆ ialTRECruns. Global runsdenote runsusing theglobal
olle tionstatisti s. Lo alrunsdenotethenaive olle tionfusionruns,usinglo-
al olle tionstatisti sonthefourTRECsub olle tions: FederalRegister,For-
eignBroad astInformation Servi es,LosAngelas Timesand Finan ialTimes.
runname des ription avg. pre .
UT800 globalrun 0.260
UT803 globalrun;LCA 0.176
UT803b globalrun;LCAfromF.TimesandLATimes 0.260
UT810 lo alrun(judged) 0.043
UT813 lo alrun;LCAfromlo al 0.145
Table3: oÆ ialresults
Unfortunately, our submitted oÆ ial runs have been degraded by two bugs,
that ae ted in parti ular the naive merging run that was judged by NIST.
By ourown mistake, theglobal runs have used thewrong (lo al)normalizing
onstant for the idf; 3
an error in Monet's join implementation resulted in
random answersfor three of the four lo al runs. After xing these bugs, the
results ofthe global runUT800improvedfrom 0.260 to 0.275 and theresults
of thelo alrunUT810improvedfrom 0.043to 0.260. Table4liststheresults
onthe foursub olle tions. Ex eptfortheFederal Register,whi hhashits for
only19topi sanyway,theaveragepre isiononthesub olle tionsdonotdier
mu h atall. UnoÆ ialruns,withthese bugsxed, are indi atedin this paper
bya`u'postx(so`UT500u'isthexed`UT500'run).
The mergedlo al runis about 6% worse than theglobal run. This might be
a signi antdieren e a ordingto somesigni an e test, like e.g. the t-test
[Hul93℄; but, if so, it is still not valid to draw the on lusion that the global
3
runname Fed.Reg. FBIS LATimes F.Times merged
UT800u(global) 0.326 0.317 0.279 0.356 0.275
UT810u(lo al) 0.351 0.319 0.276 0.356 0.260
topi sw. hits 19 43 45 49 50
Table4: averagepre isionpersub olle tionafterbug-x
approa h is indeed better than the naive merging approa h. This on lusion
would only be valid ifbothevaluations were doneunder identi al, ontrolled,
onditions; whi h they are not, be ause both runs were not judged by NIST
andwedonot ontroltheothersystemsthat ontributedto thepool. Almost
allsystemsthat ontributedtotheTREC-8poolweresystemsusingtheglobal
approa h. Therefore,thepoolfavours entralindexapproa hesoverdistributed
index approa hesif it is used to evaluate runs that did not ontributeto the
pool. This anbeshownby looking at the per entageof do umentsthat are
judgedfor dierent ut-olevelsof thexed UT800uand UT810u runs. The
per entageof do uments in a run that are judged, will be alled the judged
fra tion.
runname Pat10 Pat30 Pat100 PatR avg. P
UT800u(global) 0.496 0.378 0.234 0.319 0.275
UT810u(lo al) 0.436 0.343 0.222 0.310 0.260
runname Jat10 Jat30 Jat100 JatR
UT800u(global) 1.000 1.000 0.996 0.987
UT810u(lo al) 0.984 0.978 0.952 0.947
Table5: mergedresultsafter bug-x: a)pre ision;b) judgedfra tion
Table5a and b show the pre isionand the judged fra tion of the global and
thelo alrunatdierent ut-olevels. Thereisamajordieren ebetweenthe
judgedfra tionsoftheglobalrunandthelo alrun. Theglobalrunmisses0.4%
ofthedo umentsinitstop100. Thelo alrunmisses4.8%ofthedo umentsin
thetop100,someofthemareevenmissing inthetop10.
4.5 Lo al ontext analysis
Basedonitssu essonInQueryatpreviousTREC onferen es,weexpe teda
signi antimprovementbyusing topi sexpandedwith LCA[XC96℄. Also,in-
vestigatingtheexpansionterms,LCAseemedtodoagoodjob. Forexample,on
topi 311(whi h isaboutindustrial espionage),itndstermslike`spy', `intel-
ligen e',and` ounterintelligen e',andfromthenan ialtimessub- olle tionit
evenidenties`Opel',`Volkswagen',and`Lopez'asrelevantterms. But,instead
ofimprovingtheee tivenessofretrieval,themeasuredperforman eturnedout
to have degraded. Some tweaking of the parameters, redu ingthe weightsof
uponthebaseline,butonlyslightly;ontherunssubmittedforTREC-8,ithas
degradedperforman e.
Apossibleexplanationforthesedisappointingresultsisthatthealgorithmhas
been applied to do uments instead of passages (as done in [XC96℄), and the
TREC olle tion itself was used to nd expansion terms instead of another,
larger olle tion. One result was that the varying length of do uments had
a largeimpa t onthe expansionterms hosen, whi h is undesirable. Another
explanationisthatLMMweightingprovidessu hahighbaseline,thatitisvery
hardtoimproveupon. A omparisonbetweenthe(impressive) baselineresults
of LMM onTREC-6 favoursthe latterexplanation: be ause theperforman e
of theMirrorDBMSwithLMMweightings heme,withoutLCA, wasalmost as
goodasInQuery'sperforman eafterusingLCA.WiththetweakedLCA,LMM
weightingperformed better on allreported pre isionand re all points, ex ept
for the pre isionat twenty retrieveddo uments, at whi h InQuery performed
slightly better. On theTREC-8 topi s it didnot ontribute positively to the
results.
5 Dis ussion
Althoughmoreofane dotalthans ienti value,thestoryofourparti ipation
in TREC-8 withtheMirrorDBMS illustratesthesuitabilityof thisar hite ture
forexperimentalIR.Eightdaysbeforethedeadline,itstillseemedimpossibleto
parti ipatewith thisyear'sTREC, asMonetkept rashingwhile indexingthe
data; until, the seventhday, the newreleasesuddenly made thingswork! We
de idedto tryourlu kandsee howfarwe ouldgetin aweek;andweshould
admit, ithasbeena razyweek. Itmeantrunningthetopi sonTREC-6rst,
to omparetheresultswiththerunsperformedbefore;aswellas hangingthe
rankingformulatointegratedo umentlengthnormalization. Intheweekend,we
implementedtheuseof o-o urren estatisti s(whi hhasturnedouttobenot
sousefulasexpe ted). So,inoneweekwemanagedtoindexthedata,perform
various experimentsfor alibration,run thebest experimentsonTREC-8,and
submitveruns,just before thenaldeadline.
5.1 EÆ ien y
The ma hine onwhi h theexperimentshavebeenperformedis aSunUltra4
with 1 Gb of main-memory, running SunOS 5.6. The ma hine is not a dedi-
atedserver,butsharedwithsomeotherresear hgroupsasa` omputeserver'.
Monetee tively laimsonepro essor ompletelywhileindexingthe olle tion,
or pro essing the fty topi s on ea h of the sub- olle tions. The division of
the omplete olle tioninvesub- olle tions(asit omesondierent ompa t
dis s) is maintained. The topi s are rst run in ea h sub- olle tion, and the
estimatingthetop1000rankingtakesbetween20se ondsandtwominutesper
topi . Howtofurther improvethisexe utionperforman eisdis ussedbelow.
Preparationofthevesub- olle tionstakesaboutsixhoursintotal. Computing
the table with do ument-spe i term frequen iesis performed using Monet's
module for rosstables. But,usingthegroupingoperationforalldo umentsat
on eallo atesallavailable memory,andeventually rashestheDBMSbe ause
it annotgetmore,ifitisrun onthe ompletesetofdo umentsofanybutthe
smallest sub- olle tion.
4
Therefore, the indexing s ripts run on fragments of
thesub- olle tionsatatime, andfrequentlywriteintermediateresultstodisk,
obviouslyslowingdownthepro essmorethanne essary.
5.2 The road ahead
Theexe utionperforman eoftheMirrorDBMSonTRECis learlybetterthan
anaive(nested-loop)implementationinanyimperativeprogramminglanguage,
but, the obtainedeÆ ien y is not fast enoughto beat the better stand-alone
IR systems that also parti ipate in TREC. But, ompared to the te hniques
used in systemslike InQuery (see [Bro95℄), the urrent mappingbetween the
logi aland physi al level istoostraightforward: it doesnot useinvertedles,
has not fragmented the terms using their do ument frequen y, and it ranks
all do umentseven if onlythe beliefs for thetop 1000 areused. Also, Monet
shouldmakeitrelativelyeasytotakeadvantageofparallelisminmodernSMP
workstations.
Themeritsofsomepossibleimprovements anonlybeevaluatedexperimentally.
For example, itis notso lear beforehandwhether invertedles arereally the
wayto go. Querypro essing with inverted les requires mergingthe inverted
listsbeforebeliefs anbe omputed,whi hishardtoperformwithouttrashing
thememory a hesfrequently;whi hhasbeenshownasigni antperforman e
bottlene kon modern systemar hite tures (seee.g.[BMK99℄for experiments
demonstratingthisforMonet).
Withoutexperiments,mu himprovement anbeexpe ted fromfragmentation
of the do ument representation BATs based on the do ument frequen y, in
ombinationwiththe`unsafe'te hniquesforrankingreportedin[Bro95℄. Some
preliminary experiments indi ate a 100times improvementwith only a small
loss in pre ision. Su h (domain-spe i ) optimization te hniques are easy to
integrateinthemappingfromMoastru turestoMIL,thankstothede larative
nature ofthealgebrai approa h. Asimilar argumentapplies toextendingthe
Mirror DBMS with thebuer management te hniques dis ussed in [JFS98℄. In
MIL, buermanagement is equivalent to dire ting Monet to load and unload
itstables. Byintegratingsu hdire tivesin thegenerated MIL programs,itis
4
Noti e that su h problems are not ne essarily solved by using ommer ial systems;
Sarawagiet al. reportsimilarmemoryproblemswithDB2whenusingnormalSQLqueries
expe tedthat these improvements an alsobeadded withoutmany ompli a-
tions.
6 Con lusions
Without any additional algorithms, LMM ranking produ es reasonably good
results. Unfortunately, dueto the bugin ourexperiments, we annot yet give
on lusiveanswersaboutthedieren ebetweenusinglo alorglobalstatisti s;
but, we may on lude that the dieren e is rather small. Our urrent use of
o-o urren estatisti s has not improved our results, but further resear h is
ne essaryin thisarea.
Despite of the awsin the urrentimplementation, we believethat the Mirror
DBMS has proven to be a useful platform for IR experiments on the TREC
data. Thetruebenetsof itsdesignwillonlybeexploited whenthesystemis
developed further, and the indexing task is more hallenging. Next year, the
MirrorDBMSshouldbereadytoparti ipateinthelargeWEBtra k.
A knowledgements
ManythanksgotoPeterBon zandtheothermembersoftheCWIMonetteam,
for theirgreat support. Thework reportedin this paperis funded in partby
theDut hTelemati sInstitute proje tDRUID.
Referen es
[Bau97℄ C.Baumgarten. Aprobabilisti modelofdistributedinformationretrieval. In
Pro eedingsofthe20thACMSIGIRConferen eonResear handDevelopment
inInformationRetrieval(SIGIR'97),pages258{266,1997.
[BK95℄ P.A.Bon zandM.L.Kersten.Monet:Animpressionistsket hofanadvan ed
databasesystem.InBIWIT'95:Basqueinternationalworkshoponinformation
te hnology,July1995.
[BMK99℄ P.A.Bon z,S.Manegold,andM.L.Kersten. Databasear hite tureoptimized
forthe newbottlene k: Memorya ess. InPro eedings of25th International
Conferen eonVeryLargeDatabases(VLDB'99),Edinburgh, S otland,UK,
September1999.Toappear.
[Bro95℄ E.W.Brown. Exe utionperforman eissues infull-textinformationretrieval.
PhDthesis,UniversityofMassa husetts,Amherst,O tober1995.Alsoappears
aste hni alreport95-81.
[CLC95℄ J.P.Callan, Z. Lu, and W.B. Croft. Sear hing distributed olle tions with
inferen e networks. InPro eedings of the 18th ACM SIGIR Conferen e on
Resear handDevelopmentinInformationRetrieval(SIGIR'95),pages21{28,
[dV98℄ A.P.deVries. Mirror: Multimedia querypro essinginextensibledatabases.
In Pro eedings of the fourteenth Twente workshop on language te hnology
(TWLT14): LanguageTe hnologyinMultimediaInformationRetrieval,pages
37{48,Ens hede,TheNetherlands,De ember1998.
[dV99℄ A.P.deVries. Contentand multimedia databasemanagement systems. PhD
thesis,UniversityofTwente,Ens hede,TheNetherlands, De ember1999. To
appear.
[dVvDBA99℄ A.P.deVries, M.G.L.M.van Doorn,H.M. Blanken, andP.M.G.Apers. The
MirrorMMDBMSar hite ture.InPro eedingsof25thInternationalConferen e
onVeryLarge Databases(VLDB'99), Edinburgh, S otland, UK,September
1999.Te hni aldemo.
[dVW99℄ A.P.deVriesandA.N.Wils hut. OntheintegrationofIRanddatabases. In
Databaseissues inmultimedia; shortpaper pro eedings, international onfer-
en eondatabasesemanti s(DS-8),Rotorua,NewZealand,January1999.
[Hie99℄ D. Hiemstra. A probabilisti justi ation for using tf _
idf term weighting in
informationretrieval. International Journal on Digital Libraries, 1999. To
appear.
[HK99℄ D. Hiemstra and W. Kraaij. Twenty-One at TREC-7: Ad-ho and ross-
languagetra k. InVoorheesandHarman[VH99b℄.
[HT98℄ LauraM. Haas and Ashutosh Tiwary, editors. SIGMOD 1998, Pro eedings
ACMSIGMODInternational Conferen eonManagementofData,June2-4,
1998,Seattle,Washington,USA.ACMPress,1998.
[Hul93℄ D.Hull. Usingstatisti altestingintheevaluationofretrievalexperiments.In
Pro eedingsofthe16thACMSIGIRConferen eonResear handDevelopment
inInformationRetrieval(SIGIR'93),pages329{338,1993.
[JFS98℄ B.Th.Jonsson,M.J.Franklin,andD.Srivastava.Intera tionofqueryevaluation
andbuermanagementforinformationretrieval. InHaasandTiwary[HT98℄,
pages118{129.
[Ses98℄ P.Seshadri. Enhan edabstra tdatatypesinobje t-relationaldatabases. The
VLDBJournal,7(3):130{140,1998.
[STA98℄ S.Sarawagi,S.Thomas,and R.Agrawal. Integratingminingwithrelational
databasesystems:Alternativesandimpli ations.InHaasandTiwary[HT98℄,
pages343{354.
[VF95℄ C.L. Vilesand J.C. Fren h. Dissemination of olle tion wide informationin
adistributedinformationretrievalsystem. InPro eedings of the18thannual
internationalACMSIGIR onferen e onResear hand development ininfor-
mationretrieval(SIGIR'95),pages167{174,1995.
[VGJL95℄ E.M.Voorhees,N.K.Gupta,andB.Johnson-Laird.The olle tionfusionprob-
lem.InD.K.Harman,editor,Pro eedingsofthe3rdTextRetrievalConferen e
TREC-3,number500-225inNISTSpe ialPubli ations,pages95{104,1995.
[VH99a℄ E.M.VoorheesandD.K.Harman.Overviewofthe7thtextretrieval onferen e.
InPro eedings ofthe7th TextRetrievalConferen e TREC-7 [VH99b ℄,pages
1{23.
[VH99b℄ E.M.Voorhees and D.K.Harman, editors. Pro eedings of the Seventh Text
RetrievalConferen e(TREC-7),number500-242inNISTSpe ialpubli ations,
1999.
[XC96℄ J.XuandW.B.Croft.Queryexpansionusinglo alandglobaldo umentanaly-
sis.InPro eedingsofthe19thInternationalConferen eonResear handDevel-
opmentinInformationRetrieval(SIGIR'96),pages4{11,Zuri h,Switzerland,