Query language. Query language. Query language. Logical. Logical algebra. Logical and physical. algebra. algebras.

(1)

Arjen P. de Vries and Djoerd Hiemstra

farjen,hiemstrag tit.utwente.nl

CTIT, University of Twente

The Netherlands

Abstra t

The databasegroup atUniversity of Twenteparti ipatesinTREC8

usingtheMirrorDBMS,aprototypedatabasesystemespe iallydesigned

formultimediaandwebretrieval. Fromadatabaseperspe tive,thepur-

posehasbeento he kwhetherwe angetsuÆ ientperforman e,andto

prepare for the verylarge orpus tra k inwhi h we plan to parti ipate

nextyear. FromanIRperspe tive,the experimentshavebeendesigned

tolearnmoreabouttheee toftheglobalstatisti sontheranking.

1 Introdu tion

TheMirrorDBMS[dV99℄ ombines ontentmanagementanddatamanagement

in a single system. The main advantage of su h integration is the fa ility to

ombine IR with traditional data retrieval. Furthermore, IR resear hers an

experimentmoreeasilywithnewretrievalmodels,usingand ombiningvarious

sour esof information. This isanimportantbenet foradvan edIR resear h;

webretrieval,spee hretrieval,and ross-languageretrieval,ea hrequiretheuse

ofseveralrepresentationsof ontent,whi hishardto handleinthetraditional

le-basedapproa h,and be omestooslowin traditionaldatabasesystems.

In the Mirror DBMS, the IR retrieval model is ompletely integrated in the

databasear hite ture, emphasizingeÆ ientset-orientedquerypro essing. The

supportforinformationretrievalin oursystemis presentedin detailin [dV98℄

and[dVW99℄. Itsupportsothertypesofmediaaswell,whi hhasbeendemon-

stratedintheimageretrievalsystemprototypedes ribedin[dVvDBA99℄. The

maingoalofourparti ipationinTRECistotestifoursystem anhandlelarger

datasetswithouttoomanyproblems. Also,wewantedtondouttheee tof

globalstatisti sontheranking.

This paperis organized asfollows. Se tions 2and 3 review thedesign of the

MirrorDBMSand its support forIR, and dis ussits use forTREC pro essing.

Se tion 4explains the experimental setup and interprets our results. Se tion

(2)

...

Extension 1 Extension n

Logical algebra

Query language

Physical algebra

Storage layer

...

Extension 1 Extension n

Logical algebra Query language

Physical algebra

Storage layer

...

Enhanced ADT 1 Enhanced ADT n

Query language

Storage layer Logical and physical algebras

Figure1: Themulti-model DBMSar hite turenextto theextended relational

andE-ADTDBMSar hite tures(fromleftto right).

5dis usses ourexperien e with usingthe MirrorDBMS for TREC,followed by

on lusions.

2 Design

A omplete overview and motivation of all aspe ts of the design of the Mirror

DBMS is presented in [dV99℄. Although following a traditional three-s hema

ar hite ture,itusesdierentdatamodelsatdierentlevels:wetherefore lassify

its design as multi-model DBMS ar hite ture. The ru ial ar hite tural

dieren e from other extensible databasesystems is that query pro essing at

thelogi allayerusesonlyoperatorsthatareprovidedbythephysi allayer(see

alsoFigure1),and,domain-spe i querypro essing(su hasanIRextension)

is dened at the logi al level primarily. This hoi e enfor es a system-wide

physi aldatamodelandalgebraspanningallextensions. Of ourse,thephysi al

algebra analsobeextendedifne essary,i.e.whenlogi aloperations annotbe

expressedeÆ ientlyinthephysi al algebra. Thestri t separationbetweenthe

logi alandphysi allevelsallowsusingalgebrai queryoptimizationte hniques,

akeypropertyofrelationaldatabasemanagementsystemsbuthardlyeverused

innon-businessappli ationareaslike ontentmanagement.

The multi-model ar hite ture provides the querypro essorwith transparan y

through the layers. Put informally, query evaluation an `look down' from

the original request through all layers of the ar hite ture. This should en-

ableset-orientedqueryevaluationforalmosteveryrequest,andallowmaximal

exploitationofparallelizationandpipelining. In ontrast,thebla k-boxADTs

of `obje t-relational' databasesystems restri t theDBMS in thepossible ma-

nipulations of the query plans. This makes it more ompli ated to distribute

andparallelizethequeryplans,or hangethebuerstrategyforiterativequery

pro essing as proposed in [JFS98℄. Another alternative, the enhan ed ADTs

proposedbySeshadri[Ses98℄,provideslittleopportunityforoptimizationsthat

(3)

ar hite turess hemati ally.

3 Implementation

TheprototypeimplementationoftheMirrorDBMSusesMoaatthelogi allevel,

and Monet at the physi al level. Monet is aparallel main-memory database

systemunderdevelopmentattheCWI in Amsterdam[BK95,BMK99℄,that is

targetedasaba kendsystemforvarious(query-intensive)appli ationdomains,

su hasGISanddatamining.

1

Moaisanobje talgebrastudiedinthedatabase

group at University of Twente, that is extensible with domain-spe i stru -

tures. The Moa tools transform expressions in this algebra into sequen es of

operationsin MIL, analgebrafor thebinary relationaldata model supported

byMonet.

ForthesupportofIR,weextendedMoawithnewstru turesatthelogi allevelto

handledo umentrepresentation,ranking,andthe omputationof o-o urren e

statisti s. In ombinationwithMoa'skernelsupport for olle tionsand tuples,

these stru tures anmodel a wide variety of IR retrievalmodels: the urrent

prototypesupports thewell-known Okapirankings heme, InQuery'sinferen e

networkretrievalmodel,aswell asthelinguisti allymotivated retrievalmodel

(LMM, presented in Se tion 4.3). To illustrate, the following Moa expression

ranksa olle tionofdo uments:

map[sum(THIS)℄(

map[getBL(THIS, query, stats)℄( do s )

);

Therstmapoperation omputestermprobabilitiesforthequerytermso ur-

ring in thedo ument, using the global statisti sspe ied in stru ture stats.

The subsequent map ombines these probabilities using a sum operation. Al-

thoughthisparti ularexpressionmaynotseemveryinteresting,theIRranking

operators an be ombined with other operators su h assele t, resulting in a

powerfulquerylanguage.

The representation ofthe logi alIR stru turesat the physi al levelis termed

the attenedrepresentationofthe ontent. It onsistsofthreebinarytables

(BATs),storingthefrequen ytf(t

i

;d

j

)oftermt

i

indo umentd

j

,forea hterm

t

i

o urringindo umentd

j

. Table1illustratesthisfora olle tionfd

1

;d

2 gwith

do umentsd

1

=[a; ; ;a; ℄and d

2

=[a;e;b;b;e℄. Computing theprobability

of relevan e of the obje ts for query q = [a;b℄ pro eeds as follows. First, a

table withthequerytermsisjoined withthedo umenttermsinti(theresult

1

Monetisusedsu esfullyona ommer ialbasisbyDataDistilleries,astart-upspe ializing

(4)

dj ti tfij

1 a 2

1 3

2 a 1

2 b 2

2 e 2

intermediateresults

qdj qti qtfij qntfij

1 a 2 0.796578

2 a 1 0.621442

2 b 2 0.900426

do ument olle tion

Table1: Representationof ontentinBATs

is alled qti). Next, (usingadditionaljoins) the do umentidentiersand the

termfrequen iesare lookedup(qdjand qtfij).

2

Finally, theretrievalstatus

valuesare omputedwithsomevariantofthepopulartf idf rankingformula.

Tosupport these omputations, Monet's physi al algebrahas to be extended

withnewoperators,eitherin CorC++,orasaMIL pro edure. Thelatteris

preferrableforeasyexperimentation;forexample,thefollowingMILpro edure

omputes theterm probabilities given normalizedterm frequen y and inverse

do umentfrequen yusingtheLMMmodel:

PROC bel( nidfi, ntfij ) := {

RETURN log( 1.0 + nidfi * ntfij * C );

}

An evaluation runpro esses50 topi sin bat h,but the lientinterfa esofthe

MirrorDBMShavebeendesignedforintera tivesessionswithanend-user. Also,

transferringthedatafrom Monet totheMoa lienthasnotbeenimplemented

optimally. Furthermore,optimizationssu hasusingmaterializedviewsarenot

performedinthe urrentMoarewriter. These minor awswouldhaveinferred

anunfairperforman e penalty totheevaluationofthear hite ture,and made

loggingthe results rather umbersome. Therefore, as a(temporary) solution,

theMIL program generatedby the Moarewriterhas beenmanually editedto

loopoverthe50 topi s, log the omputedrankingfor ea h topi ,and usetwo

additionaltables, onewithpre omputednormalizedinversedo umentfrequen-

ies (a materialized view), and one with the do ument-spe i onstants for

normalizingthetermfrequen ies.

2

Notethatthesejoinsareexe utedveryeÆ iently,be ausetheMoastru turesmakesure

(5)

4 Experimental setup and results

Colle tion fusion is the pro ess of merging the results of retrieval runs on

seperate, autonomous do ument olle tions into an ee tive ombined result

[VGJL95℄. We havefo used on this problem be auselarge olle tionswill be

fragmented (horizontally) in several partitions, ea h managed by a separate

server. Maintainingtheexa tglobal statisti sindu esanextraoverhead,that

maynotbene essaryifthefragmentsaresuÆ ientlylarge.

Colle tionfusionis atrivialtaskfor exa tmat hing retrievalsystemslikesys-

tems using Boolean retrieval, but more ompli ated if a ranked retrieval sys-

tem isused. Inanumberof publi ationson olle tionfusionit isargued that

simply omparingsimilaritymeasuresa rosssub olle tionsleadstounsatisfa -

toryresultsbe auseofdieren esinthe olle tion-dependentfrequen y ounts

[Bau97, CLC95, VF95, VGJL95℄. One of the obje tivesof the TREC-8 eval-

uation des ribed in this paper is to question this hypothesis. We feel that

similaritymeasuresa rosssub olle tionsmightinfa tbe omparable,butshow

worseevaluationresultsbe auseoftheevaluationsetup.

4.1 Evaluation using the TREC olle tion

Relevan eassessmentsontheTRECtest olle tionsareassembledbythepool-

ing method: apoolof possiblyrelevant do umentsis reatedby taking thea

sampleof do umentsretrievedbyea hparti ipatingsystem. Thispoolisthen

shownto the humanassessors[VH99a℄. Thesampling method used in TREC

takesthetop100oftheretrieveddo umentsofea hparti ipatingsystem.

Sin ethestartofTRECin1992,thetest olle tionshavebeenusedinnumerous

evaluationsoutsidetheoÆ ialTREC.Fortheseevaluations,alldo umentsthat

werenotinthetop100ofanyoftheoÆ ialparti ipatingsystemsareassumed

to benotrelevant. But,evaluationsthat didnot ontributetotheTRECpool

probablyhaveunjudgeddo umentsinthetop100makingtheseevaluationsless

reliablethantheoÆ ialTRECevaluation. Thisisespe iallytruefornew,previ-

ouslyunexploredapproa hestoretrieval. Ifasystemsndsrelevantdo uments

thatnosystemwasabletondbefore,thenthesedo umentswillprobablynot

bejudged in anold TREC olle tion. The only way to he k the relevan e of

these do umentsisby oÆ ial TREC parti ipation.

4.2 Conditions for naive olle tion fusion

Letusdene'naive' olle tionfusionasthepro essofmergingthesear hresults

on the sub olle tions based on the do ument similarities. Therst ondition

for naive olle tion fusion is that ea h sub olle tion uses the same retrieval

(6)

sub olle tionusesthesameindexing vo abulary[Bau97℄. A third onditionis

thatsub olle tionsaresuÆ ientlylargetoallowforthereliablelo alestimation

ofdo umentfrequen ies. Ifthesub olle tionsaretoosmall,inee tiveretrieval

onthesub olle tions willae tthemergedresult.

AnevaluationofCallanetal. [CLC95℄underthese onditionsforTRECtopi s

51-150showedthatnaivemergingwassigni antlyworsethanrankingbasedon

globallyestimateddo umentfrequen ies, ausinglossesfrom10-20%inaverage

pre ision. But,theresultsofnaivemergingreported byCallanet al. [CLC95℄

were notpartof anoÆ ial TRECparti ipation. Itis likelythat theirmerged

runhasaworse overageofjudgements,be ausetheTREC-2and3poolswere

(almost)only reatedbysystemsthatusea entralindexforretrieval. Maybe,

theirmergedrunwasasgood asthe entralindex run after all. To he kthis

hypothesis,wede idedtoputuparetrievalrunusingnaivemergingforjudging.

4.3 Some theoreti al ba k-up for naive merging

The Mirror DBMS uses the linguisti ally motivated probabilisti model of in-

formation retrieval[Hie99, HK99℄. The model builds a simplestatisti al lan-

guagemodelforea hdo umentinthe olle tion. Theprobabilitythat aquery

T

1

;T

2

;;T

n

oflengthnis generatedbythelanguagemodelof thedo ument

withidentierDis denedbythefollowingequation:

P(T

1

=t

1

;;T

n

=t

n

jD=d)= n

Y

i=1 (

1 df(t

i )

P

t df(t)

+

2 tf(t

i

;d)

P

t tf(t;d)

) (1)

Equation 1 an be rewritten to a ve tor produ t formula by rst dividing it

by Q

n

i=1 (

1 df(t

i )=

P

t

df(t))[Hie99℄. This willnot ae t therankingwithin a

sub olle tion, but it will ae t the nal ranking after mergingthe sear h re-

sults of the seperate sub olle tions, be ause we divided by olle tion spe i

do umentfrequen ies. It anbeshownthattherankingoftheve torprodu t

formulain table 2approximatestherankingdened by the onditionalproba-

bilityP(DjT

1

;T

2

;;T

n

)ofado umentbeingrelevantgivenaquery.

ve torprodu tformula: similarity(Q;D)= l

X

k =1 w

qk

w

dk

querytermweight: w

qk

=tf(t

k

;q)

do umenttermweight: w

dk

=log (1+

tf(t

k

;d)

df(t

k )

P

t tf(t;d)

2 P

t df(t)

1 )

Table2: tf idf termweightingalgorithm

From Bayes' rule we knowthat dividing equation 1by P(T

1

;T

2

;;T

n ) and

multiplying it by P(D) results in P(DjT ;T ;;T ). For a large olle tion

(7)

and aquerythat hasasmall numberofhits, tf(t;d)=0for mosttermst and

do umentsd. Therefore, Q

n

i=1 (

1 df(t

i )=

P

t

df(t)) approximates themarginal

probabilityP(T

1

;T

2

;;T

n

) andthe rankingdened by table 2approximates

therankingdened byP(DjT

1

;T

2

;;T

n

). Thea-prioriprobabilityP(D=d)

of a do ument d being relevant an be in luded by adding the logarithm of

equation2tothesimilaritiesoftable2asanalstep.

P(D=d)= P

t tf(t;d)

P

t P

d tf(t;d)

(2)

We hypothesise that, if the approximation is not too far o, the result after

mergingis not signi antlyworsethan what would have been possible witha

entralindex.

4.4 OÆ ial results

Table3lists theoÆ ialTRECruns. Global runsdenote runsusing theglobal

olle tionstatisti s. Lo alrunsdenotethenaive olle tionfusionruns,usinglo-

al olle tionstatisti sonthefourTRECsub olle tions: FederalRegister,For-

eignBroad astInformation Servi es,LosAngelas Timesand Finan ialTimes.

runname des ription avg. pre .

UT800 globalrun 0.260

UT803 globalrun;LCA 0.176

UT803b globalrun;LCAfromF.TimesandLATimes 0.260

UT810 lo alrun(judged) 0.043

UT813 lo alrun;LCAfromlo al 0.145

Table3: oÆ ialresults

Unfortunately, our submitted oÆ ial runs have been degraded by two bugs,

that ae ted in parti ular the naive merging run that was judged by NIST.

By ourown mistake, theglobal runs have used thewrong (lo al)normalizing

onstant for the idf; 3

an error in Monet's join implementation resulted in

random answersfor three of the four lo al runs. After xing these bugs, the

results ofthe global runUT800improvedfrom 0.260 to 0.275 and theresults

of thelo alrunUT810improvedfrom 0.043to 0.260. Table4liststheresults

onthe foursub olle tions. Ex eptfortheFederal Register,whi hhashits for

only19topi sanyway,theaveragepre isiononthesub olle tionsdonotdier

mu h atall. UnoÆ ialruns,withthese bugsxed, are indi atedin this paper

byaù'postx(soÙT500u'isthexedÙT500'run).

The mergedlo al runis about 6% worse than theglobal run. This might be

a signi antdieren e a ordingto somesigni an e test, like e.g. the t-test

[Hul93℄; but, if so, it is still not valid to draw the on lusion that the global

3

(8)

runname Fed.Reg. FBIS LATimes F.Times merged

UT800u(global) 0.326 0.317 0.279 0.356 0.275

UT810u(lo al) 0.351 0.319 0.276 0.356 0.260

topi sw. hits 19 43 45 49 50

Table4: averagepre isionpersub olle tionafterbug-x

approa h is indeed better than the naive merging approa h. This on lusion

would only be valid ifbothevaluations were doneunder identi al, ontrolled,

onditions; whi h they are not, be ause both runs were not judged by NIST

andwedonot ontroltheothersystemsthat ontributedto thepool. Almost

allsystemsthat ontributedtotheTREC-8poolweresystemsusingtheglobal

approa h. Therefore,thepoolfavours entralindexapproa hesoverdistributed

index approa hesif it is used to evaluate runs that did not ontributeto the

pool. This anbeshownby looking at the per entageof do umentsthat are

judgedfor dierent ut-olevelsof thexed UT800uand UT810u runs. The

per entageof do uments in a run that are judged, will be alled the judged

fra tion.

runname Pat10 Pat30 Pat100 PatR avg. P

UT800u(global) 0.496 0.378 0.234 0.319 0.275

UT810u(lo al) 0.436 0.343 0.222 0.310 0.260

runname Jat10 Jat30 Jat100 JatR

UT800u(global) 1.000 1.000 0.996 0.987

UT810u(lo al) 0.984 0.978 0.952 0.947

Table5: mergedresultsafter bug-x: a)pre ision;b) judgedfra tion

Table5a and b show the pre isionand the judged fra tion of the global and

thelo alrunatdierent ut-olevels. Thereisamajordieren ebetweenthe

judgedfra tionsoftheglobalrunandthelo alrun. Theglobalrunmisses0.4%

ofthedo umentsinitstop100. Thelo alrunmisses4.8%ofthedo umentsin

thetop100,someofthemareevenmissing inthetop10.

4.5 Lo al ontext analysis

Basedonitssu essonInQueryatpreviousTREC onferen es,weexpe teda

signi antimprovementbyusing topi sexpandedwith LCA[XC96℄. Also,in-

vestigatingtheexpansionterms,LCAseemedtodoagoodjob. Forexample,on

topi 311(whi h isaboutindustrial espionage),itndstermslike`spy', `intel-

ligen e',and` ounterintelligen e',andfromthenan ialtimessub- olle tionit

evenidenties`Opel',`Volkswagen',and`Lopez'asrelevantterms. But,instead

ofimprovingtheee tivenessofretrieval,themeasuredperforman eturnedout

to have degraded. Some tweaking of the parameters, redu ingthe weightsof

(9)

uponthebaseline,butonlyslightly;ontherunssubmittedforTREC-8,ithas

degradedperforman e.

Apossibleexplanationforthesedisappointingresultsisthatthealgorithmhas

been applied to do uments instead of passages (as done in [XC96℄), and the

TREC olle tion itself was used to nd expansion terms instead of another,

larger olle tion. One result was that the varying length of do uments had

a largeimpa t onthe expansionterms hosen, whi h is undesirable. Another

explanationisthatLMMweightingprovidessu hahighbaseline,thatitisvery

hardtoimproveupon. A omparisonbetweenthe(impressive) baselineresults

of LMM onTREC-6 favoursthe latterexplanation: be ause theperforman e

of theMirrorDBMSwithLMMweightings heme,withoutLCA, wasalmost as

goodasInQuery'sperforman eafterusingLCA.WiththetweakedLCA,LMM

weightingperformed better on allreported pre isionand re all points, ex ept

for the pre isionat twenty retrieveddo uments, at whi h InQuery performed

slightly better. On theTREC-8 topi s it didnot ontribute positively to the

results.

5 Dis ussion

Althoughmoreofane dotalthans ienti value,thestoryofourparti ipation

in TREC-8 withtheMirrorDBMS illustratesthesuitabilityof thisar hite ture

forexperimentalIR.Eightdaysbeforethedeadline,itstillseemedimpossibleto

parti ipatewith thisyear'sTREC, asMonetkept rashingwhile indexingthe

data; until, the seventhday, the newreleasesuddenly made thingswork! We

de idedto tryourlu kandsee howfarwe ouldgetin aweek;andweshould

admit, ithasbeena razyweek. Itmeantrunningthetopi sonTREC-6rst,

to omparetheresultswiththerunsperformedbefore;aswellas hangingthe

rankingformulatointegratedo umentlengthnormalization. Intheweekend,we

implementedtheuseof o-o urren estatisti s(whi hhasturnedouttobenot

sousefulasexpe ted). So,inoneweekwemanagedtoindexthedata,perform

various experimentsfor alibration,run thebest experimentsonTREC-8,and

submitveruns,just before thenaldeadline.

5.1 EÆ ien y

The ma hine onwhi h theexperimentshavebeenperformedis aSunUltra4

with 1 Gb of main-memory, running SunOS 5.6. The ma hine is not a dedi-

atedserver,butsharedwithsomeotherresear hgroupsasa` omputeserver'.

Monetee tively laimsonepro essor ompletelywhileindexingthe olle tion,

or pro essing the fty topi s on ea h of the sub- olle tions. The division of

the omplete olle tioninvesub- olle tions(asit omesondierent ompa t

dis s) is maintained. The topi s are rst run in ea h sub- olle tion, and the

(10)

estimatingthetop1000rankingtakesbetween20se ondsandtwominutesper

topi . Howtofurther improvethisexe utionperforman eisdis ussedbelow.

Preparationofthevesub- olle tionstakesaboutsixhoursintotal. Computing

the table with do ument-spe i term frequen iesis performed using Monet's

module for rosstables. But,usingthegroupingoperationforalldo umentsat

on eallo atesallavailable memory,andeventually rashestheDBMSbe ause

it annotgetmore,ifitisrun onthe ompletesetofdo umentsofanybutthe

smallest sub- olle tion.

4

Therefore, the indexing s ripts run on fragments of

thesub- olle tionsatatime, andfrequentlywriteintermediateresultstodisk,

obviouslyslowingdownthepro essmorethanne essary.

5.2 The road ahead

Theexe utionperforman eoftheMirrorDBMSonTRECis learlybetterthan

anaive(nested-loop)implementationinanyimperativeprogramminglanguage,

but, the obtainedeÆ ien y is not fast enoughto beat the better stand-alone

IR systems that also parti ipate in TREC. But, ompared to the te hniques

used in systemslike InQuery (see [Bro95℄), the urrent mappingbetween the

logi aland physi al level istoostraightforward: it doesnot useinvertedles,

has not fragmented the terms using their do ument frequen y, and it ranks

all do umentseven if onlythe beliefs for thetop 1000 areused. Also, Monet

shouldmakeitrelativelyeasytotakeadvantageofparallelisminmodernSMP

workstations.

Themeritsofsomepossibleimprovements anonlybeevaluatedexperimentally.

For example, itis notso lear beforehandwhether invertedles arereally the

wayto go. Querypro essing with inverted les requires mergingthe inverted

listsbeforebeliefs anbe omputed,whi hishardtoperformwithouttrashing

thememory a hesfrequently;whi hhasbeenshownasigni antperforman e

bottlene kon modern systemar hite tures (seee.g.[BMK99℄for experiments

demonstratingthisforMonet).

Withoutexperiments,mu himprovement anbeexpe ted fromfragmentation

of the do ument representation BATs based on the do ument frequen y, in

ombinationwiththe`unsafe'te hniquesforrankingreportedin[Bro95℄. Some

preliminary experiments indi ate a 100times improvementwith only a small

loss in pre ision. Su h (domain-spe i ) optimization te hniques are easy to

integrateinthemappingfromMoastru turestoMIL,thankstothede larative

nature ofthealgebrai approa h. Asimilar argumentapplies toextendingthe

Mirror DBMS with thebuer management te hniques dis ussed in [JFS98℄. In

MIL, buermanagement is equivalent to dire ting Monet to load and unload

itstables. Byintegratingsu hdire tivesin thegenerated MIL programs,itis

4

Noti e that su h problems are not ne essarily solved by using ommer ial systems;

Sarawagiet al. reportsimilarmemoryproblemswithDB2whenusingnormalSQLqueries

(11)

expe tedthat these improvements an alsobeadded withoutmany ompli a-

tions.

6 Con lusions

Without any additional algorithms, LMM ranking produ es reasonably good

results. Unfortunately, dueto the bugin ourexperiments, we annot yet give

on lusiveanswersaboutthedieren ebetweenusinglo alorglobalstatisti s;

but, we may on lude that the dieren e is rather small. Our urrent use of

o-o urren estatisti s has not improved our results, but further resear h is

ne essaryin thisarea.

Despite of the awsin the urrentimplementation, we believethat the Mirror

DBMS has proven to be a useful platform for IR experiments on the TREC

data. Thetruebenetsof itsdesignwillonlybeexploited whenthesystemis

developed further, and the indexing task is more hallenging. Next year, the

MirrorDBMSshouldbereadytoparti ipateinthelargeWEBtra k.

A knowledgements

ManythanksgotoPeterBon zandtheothermembersoftheCWIMonetteam,

for theirgreat support. Thework reportedin this paperis funded in partby

theDut hTelemati sInstitute proje tDRUID.

Referen es

[Bau97℄ C.Baumgarten. Aprobabilisti modelofdistributedinformationretrieval. In

Pro eedingsofthe20thACMSIGIRConferen eonResear handDevelopment

inInformationRetrieval(SIGIR'97),pages258{266,1997.

[BK95℄ P.A.Bon zandM.L.Kersten.Monet:Animpressionistsket hofanadvan ed

databasesystem.InBIWIT'95:Basqueinternationalworkshoponinformation

te hnology,July1995.

[BMK99℄ P.A.Bon z,S.Manegold,andM.L.Kersten. Databasear hite tureoptimized

forthe newbottlene k: Memorya ess. InPro eedings of25th International

Conferen eonVeryLargeDatabases(VLDB'99),Edinburgh, S otland,UK,

September1999.Toappear.

[Bro95℄ E.W.Brown. Exe utionperforman eissues infull-textinformationretrieval.

PhDthesis,UniversityofMassa husetts,Amherst,O tober1995.Alsoappears

aste hni alreport95-81.

[CLC95℄ J.P.Callan, Z. Lu, and W.B. Croft. Sear hing distributed olle tions with

inferen e networks. InPro eedings of the 18th ACM SIGIR Conferen e on

Resear handDevelopmentinInformationRetrieval(SIGIR'95),pages21{28,

(12)

[dV98℄ A.P.deVries. Mirror: Multimedia querypro essinginextensibledatabases.

In Pro eedings of the fourteenth Twente workshop on language te hnology

(TWLT14): LanguageTe hnologyinMultimediaInformationRetrieval,pages

37{48,Ens hede,TheNetherlands,De ember1998.

[dV99℄ A.P.deVries. Contentand multimedia databasemanagement systems. PhD

thesis,UniversityofTwente,Ens hede,TheNetherlands, De ember1999. To

appear.

[dVvDBA99℄ A.P.deVries, M.G.L.M.van Doorn,H.M. Blanken, andP.M.G.Apers. The

MirrorMMDBMSar hite ture.InPro eedingsof25thInternationalConferen e

onVeryLarge Databases(VLDB'99), Edinburgh, S otland, UK,September

1999.Te hni aldemo.

[dVW99℄ A.P.deVriesandA.N.Wils hut. OntheintegrationofIRanddatabases. In

Databaseissues inmultimedia; shortpaper pro eedings, international onfer-

en eondatabasesemanti s(DS-8),Rotorua,NewZealand,January1999.

[Hie99℄ D. Hiemstra. A probabilisti justi ation for using tf _

idf term weighting in

informationretrieval. International Journal on Digital Libraries, 1999. To

appear.

[HK99℄ D. Hiemstra and W. Kraaij. Twenty-One at TREC-7: Ad-ho and ross-

languagetra k. InVoorheesandHarman[VH99b℄.

[HT98℄ LauraM. Haas and Ashutosh Tiwary, editors. SIGMOD 1998, Pro eedings

ACMSIGMODInternational Conferen eonManagementofData,June2-4,

1998,Seattle,Washington,USA.ACMPress,1998.

[Hul93℄ D.Hull. Usingstatisti altestingintheevaluationofretrievalexperiments.In

Pro eedingsofthe16thACMSIGIRConferen eonResear handDevelopment

inInformationRetrieval(SIGIR'93),pages329{338,1993.

[JFS98℄ B.Th.Jonsson,M.J.Franklin,andD.Srivastava.Intera tionofqueryevaluation

andbuermanagementforinformationretrieval. InHaasandTiwary[HT98℄,

pages118{129.

[Ses98℄ P.Seshadri. Enhan edabstra tdatatypesinobje t-relationaldatabases. The

VLDBJournal,7(3):130{140,1998.

[STA98℄ S.Sarawagi,S.Thomas,and R.Agrawal. Integratingminingwithrelational

databasesystems:Alternativesandimpli ations.InHaasandTiwary[HT98℄,

pages343{354.

[VF95℄ C.L. Vilesand J.C. Fren h. Dissemination of olle tion wide informationin

adistributedinformationretrievalsystem. InPro eedings of the18thannual

internationalACMSIGIR onferen e onResear hand development ininfor-

mationretrieval(SIGIR'95),pages167{174,1995.

[VGJL95℄ E.M.Voorhees,N.K.Gupta,andB.Johnson-Laird.The olle tionfusionprob-

lem.InD.K.Harman,editor,Pro eedingsofthe3rdTextRetrievalConferen e

TREC-3,number500-225inNISTSpe ialPubli ations,pages95{104,1995.

[VH99a℄ E.M.VoorheesandD.K.Harman.Overviewofthe7thtextretrieval onferen e.

InPro eedings ofthe7th TextRetrievalConferen e TREC-7 [VH99b ℄,pages

1{23.

[VH99b℄ E.M.Voorhees and D.K.Harman, editors. Pro eedings of the Seventh Text

RetrievalConferen e(TREC-7),number500-242inNISTSpe ialpubli ations,

1999.

[XC96℄ J.XuandW.B.Croft.Queryexpansionusinglo alandglobaldo umentanaly-

sis.InPro eedingsofthe19thInternationalConferen eonResear handDevel-

opmentinInformationRetrieval(SIGIR'96),pages4{11,Zuri h,Switzerland,