Language Independent Search in MediaEval's Spoken Web Search Task

(1)

ScienceDirect

ComputerSpeechandLanguagexxx(2014)xxx–xxx

Language

independent

search

in

MediaEval’s

Spoken

Web

Search

task

Florian

Metze

a,∗

,

Xavier

Anguera

c

,

Etienne

Barnard

b

,

Marelie

Davel

b

,

Guillaume

Gravier

d

a_Carnegie_Mellon_University,_Pittsburgh,_PA,_USA

b_North-West_University,_{Vanderbijlpark,}_South_Africa

c_Telefonica_Research,_Barcelona,_Spain

d_{CNRS–IRISA,}_Rennes,_France

Received2August2012;receivedinrevisedform13October2013;accepted29December2013

Abstract

Inthispaper,wedescribeseveralapproachestolanguage-independentspokentermdetectionandcomparetheirperformance onacommontask,namely“SpokenWebSearch”.ThegoalofthispartoftheMediaEvalinitiativeistoperformlow-resource language-independentaudiosearchusingaudioasinput.Thedatawastakenfrom“spokenweb”materialcollectedovermobile phoneconnectionsbyIBMIndiaaswellasfromtheLWAZIcorpusofAfricanlanguages.Aspartofthe2011and2012MediaEval benchmarkcampaigns,anumberofdiversesystemswereimplementedbyindependentteams,andsubmittedtothe“SpokenWeb Search”task.Thispaperpresentsthe2011and2012results,andcomparestherelativemeritsandweaknessesofapproachesdeveloped byparticipants,providinganalysisanddirectionsforfutureresearch,inordertoimprovevoiceaccesstospokeninformationinlow resourcesettings.

Keywords: Low-resourcespeechtechnology;Evaluation;Spokenweb;Spokentermdetection

1. Introduction

In recentyears, speechtechnology has emerged as an enabling technology for increasing the accessibility of informationforanumberofquitediverseusecases.Theseincludesearchinglargearchivesofaudio-visualmaterial, dialogsystemsforaccesstopersonalinformationand(mobile)websearch,aswellasapplicationsinlanguagelearning andpronunciationtraining.Aparticularlydeservingaspectoftheseisthepotentialofspeechtechnologiestofoster participationofdisabled,low-literate,or“minority”usersintheinformationsociety.

Thelastcasehasproventobeparticularlychallenging,becauseresourcesofanykindareusuallyscarceforminority languages,dialectsandothernon-mainstreamconditions,whichcanthereforenotbeapproachedwiththetypical“there isnodatalikemoredata”engineeringapproach.Clearly,societywouldbenefitgreatlyfromtheabilitytoeasilyprocess

∗_{Corresponding}_author._Tel.:₊₁₄₁₂_2688984.

E-mailaddress:fmetze@cs.cmu.edu(F.Metze).

(2)

audioinanylanguage(ordialect),orlanguageindependently,withouthavingtospendresourcesonlanguage-specific development,butsignificantresearchisstillneededinthatarea.

“SpokenWebSearch”involvessearchingforaudiocontent,withinaudiocontent,usinganaudioquery,inalanguage, dialect,ordomainforwhichonlyverylimitedresourcesareavailable.Theoriginalmotivationforthistaskwastobe abletoprovidevoice-basedaccesstospokendocumentscreatedbylocalcommunityeffortsinruralIndia.Because noexperts areavailable tocreateand maintaindedicated speechdialog systems, acousticsimilarity andkeyword searchcouldbeusedtoaccessinformation.Acallerwouldforexamplesay“weathertomorrow”andretrieveaspoken documentwhichcontainsthephrase“theweathertomorrowwillbe...”(inamatchingdialect).Whilesucha retrieval-basedapproachisclearlylimitedwhencomparedtofullydevelopeddialogsystems,itisstillpreferabletonothaving anycapabilitytoaccessinformationatall.Thefactthatusersinsuchapplicationswilloftenberepeatcallers,and thereforewillbefamiliarwiththesystem,alsoenhancesthepotentialefficacyofthisapproach.

Themainchallengeisthereforetodevelopapproachestospokentermdetection(ratherthanfullspeech-to-text) thatscaletomanylanguages,dialects,anddomainsveryquickly,withoutrequiringdataandlanguageortechnology experts.Anefficientlarge-scale deployment withmanyusers isdesirable,butnot theprimary goal. To solvethis problem,tworesearchavenuespresentthemselves:portexistingspeechrecognitionapproaches,or builddedicated solutions.Whenstartingfromexisting speechrecognitionapproaches andresources,techniques havetobefound whichwillmakethemusefulinlow-resourcesettings.Multi-lingualmodelingandcross-lingual(or-dialectal)transfer havebeenusedinthepasttodothat.Asanalternative,limitedkeywordsearchoracousticpatternmatchingsystems canbetailoredspecificallytothetargetusecase.Itmaythereforebepossibletodevelopthemwithrelativelylittle data,orevenzero(outside)resources.

Tocomparethesetwoapproaches,andanalyzethetrade-offsentailedbysuchadesigndecision,“SpokenWeb Search”(SWS)wasrunasachallenge-styletaskatMediaEval2011(RajputandMetze,2011)andMediaEval2012 (Metzeetal.,2012).Thisevaluationattemptstoprovideacommon evaluationcorpusandbaselinefor researchon language-independentsearchandretrievalofreal-worldspeechdata,withaspecialfocusonlow-resourcelanguages. InSection2,wesurveythefieldoflow-resourceacousticpatternmatchingandspokentermdetection.Section3

presentsthe“SpokenWebSearch”taskasapublicdatasetandevaluationcampaigntoinitiateresearchanddiscussion inthisresearcharea.Aunifiedviewanddiscussionofthe approachesimplementedbytheparticipantsisgivenin

Sections4–7,followingthedifferentstepsofatypicalsystem.InSection8,wediscussresultsachievedinthe2011 and2012evaluations,andprovideresearchdirectionsforthefuture.

2. Relatedwork

TheWorldWideWebhaschangedthe informationlandscapefor thedevelopedworld, wherecitizensare used toaccessinginformationaboutalmostanything,anytimeandonanydevice.TheWeb2.0hasdemocratizedtheweb furtherbyenablinguser-generatedcontentthroughwikis,blogs,andmorerecentlythroughsocialnetworkingwebsites. However,thedevelopingregionsstillfaceseveralchallengesinbeingpartofthisinformationrevolution:lowliteracy (Education,2010)andlackofinternetpenetration(Internet,2010)beingsomeofthem.

Ontheotherhand,India,China,BrazilandIndonesiatakentogetherhavemoremobilephonesthantheTop50 developedcountriestakentogether(Internet, 2010).Voice-basedinformationsystems arethereforeevolvingas an alternativetothetext/visualweb,andcouldpotentiallyachievesignificantpenetration.Usersindevelopingregions arenowabletocreate,accessandsharecontentusingjusttheirvoiceandaphone.Severalaudio-basedinformation systemshavebeendeployedinthelastfiveyears. TheHealthline(Sherwanietal., 2007)systemprovidesreliable healthinformationforcommunityhealthworkersindevelopingcountries.TheAudioWikisystem(Kotkaretal.,2008) providesarepositoryofspokencontentthatcanbemodifiedthroughalow-costphoneandcansupportanylanguage. AdoptionissueshaverecentlybeenstudiedinRazaetal.(2013).TheSpokenWebsystemhasbeenusedinagriculture (Pateletal.,2010),employment(Kumaretal.,2008),social(Agarwaletal.,2009,2010)andseveralothersettings (Agarwaletal.,2010;Diaoetal.,2010).

Whiletheseexamplesclearlyillustratetheusefulnessofavoice-basedinformationsystem,theyalsoposechallenging researchproblems.Insomeofthesesystems,mostofthecontentisuser-generatedandinlocallanguagesanddialects. Thecontentisthereforemostlyspontaneousandcolloquialinstyle,withalotandahighvarietyofbackgroundnoise. Voicebeingasequentialmodality,navigationbycommandsisoftenachallenge,suggestingthataudiocontentsearch beappliedintheseinformationsystems.Leavingasidequestionsofusabilityanddeployment,efficientkeyword-style

(3)

search is akey requirement inorderto createausable andlow-cost (for boththe provider and consumer)audio informationsystemforthetargetusergroup.

The“SpokenWebSearch”taskthereforesitsattheintersectionoftworesearchdomainsthathaverecentlyseen significantactivity,namelyspeechtechnologyforunder-resourcedlanguages,asdiscussedabove,andspokenterm detection(STD).

Mostresearchinspeechtechnologyhastraditionallybeenconductedonasmallsetofwell-resourcedlanguages, buttheneed toextendsuch technologytomanymorelanguagesiswidelyrecognized(Pateletal.,2010;Barnard etal.,2010).Avarietyofapproachesfortaskssuchascorpusdevelopment(deVriesetal.,2011)andrapidrecognizer bootstrapping(Hughesetal.,2010;Schultz,2000)havebeendevelopedandapplied.SpokenTermDetectionfor under-resourcedlanguageshasrecentlyattractedattention,intheformofsystems(Hazenetal.,2009;ZhangandGlass,2009; Muscarielloetal.,2011)thatemploydynamictimewarpingtofindmatchesbetweenquerytermsandthematerialto besearched.Inordertoachievespeakerindependence,bothqueryandreferencespeecharerepresentedintermsof

posteriorgrams–thatis,aframe-synchronousseriesofvectors,eachcontainingestimatedposteriorprobabilities.In

Hazenetal.(2009),theseareposteriorprobabilitiesofthephoneticclasses,whereas(ZhangandGlass,2009;Zhang etal.,2012)employposteriorprobabilitiesofclassesdeterminedwithunsupervisedclustering.Notethatthequery isitselfassumedtoexistinspokenform–hence,thesemethodsstartfromsomewhatdifferentassumptionsthanthe conventionalSTDsystems,whichassumequeriesintextform.Spokeninputofquerytermsischaracterizedas“query byexample”inHazenetal.(2009).

Conventionalstate-of-the-artSTDmethodsemploygeneral-purposespeech-recognitionsystemstogeneratealattice of wordhypotheses (Miller etal., 2007;Chelba etal., 2008).The latticeisthenused tocreatean indexof word occurrences within the audio data,andthe corresponding confidence scores for each of thesedetections. During retrieval,itisthensimplyamatteroffindingthosewordstringsthatcorrespondtothequeryterms.Onecomplication tothisconceptuallysimpleapproachisthatthesearchtermsmaynotbepresentintherecognitionvocabularythat wasusedforspeechrecognition.Hence,alternativerepresentationsthatcanalsocaptureout-of-vocabularywordsare required–forexample,eachwordintherecognizedlatticecanbeexpressedintermsofitsconstituentphones.During retrieval,out-of-vocabularywordscanthenbefoundbymatchingtheirsub-word(forexample,phonetic)representation againstsuch phone-basedlattices(Wallaceetal.,2007).Othersub-worddecompositionshavealsobeenemployed successfully(Szöke,2010),butinmostcasesdetectionofout-of-vocabularywordsremainssubstantiallyinferiorto thatofin-vocabularywords.

AlthoughthequerybyexamplemethodsinHazenetal.(2009),ZhangandGlass(2009),Zhangetal.(2012)achieve promisingdetectionrates,retrievalissignificantlymoredemandingcomputationallythanwithindex-basedapproaches (Wallaceetal.,2007).InHazenetal.(2009),itisthereforerecommendedthatsuchmethodsbeusedasarescoring mechanismfortermsretrievedbyacruder(index-based)approach.Also,thesemethodshavetodatebeenassessed onwell-resourcedlanguages;hence,issuesintheirapplicationinrealunder-resourcedenvironmentshavenotbeen explored.

TheworkpresentedinthispaperisdifferentfromIARPA’scurrentlyactive“Babel”program(IARPA,2011)intwo keyaspects:weusetypicallyanorderofmagnitudelessdataperlanguageandthefocusisonlanguageindependent approaches,ratherthanacapabilitytorapidlybootstrapsystemsinnewlanguages.SWS-liketechnologiescouldalso behelpfultotheintelligenceormilitarycommunity,forexampleinquicklychangingtheatersofoperationsthatspan multiplelinguisticgroups, todeveloptriage capabilitiesfor intercepted radiocommunications,or aspart of other tacticalsolutions.Anotherusefulaspectofthedescribedworkisthatitcouldbeveryusefultoimplement(initial) speechprocessingcapabilitiesforusebynon-speechrecognitionexpertstoapplyinotherresearchcontexts(Kumar etal.,2013).MorediscussionofrelatedworkinASRcanbefoundinMilleretal.(2007),Shenetal.(2009),Larson etal.(2012),and(Zeroresource,2012).

3. SpokenWebSearchatMediaEval

MediaEvalisabenchmarkinginitiativededicatedtoevaluatingnewalgorithmsformultimediaaccessandretrieval (MediaEval,2014).The“SpokenWebSearch”(SWS)taskwasrunin2011(RajputandMetze,2011)and2012(Metze etal.,2012),2013(MediaEval,2013),andwillagainberuninOctober2014,followingamodelinwhichparticipants receivelabeleddevelopmentdata,beforereceivingunseenevaluationdataacoupleof monthslater,onwhichthey blindlysubmitresultstotheorganizersforscoring.

(4)

Table1

Development(Dev)andevaluation(Eval)corporausedforthe“SpokenWebSearchTask”atMediaEval(Metzeetal.,2012,2013).Eventhough weposetheevaluationasa“SpokenTermDetection”problem,SWScanalsobeseenasanInformationRetrieval(IR)problem,involving“queries” and“documents”.

Category 2011(“Indian”) 2012(“African”)

#Utts Total(h) Avg.(s) #Utts Total(h) Avg.(s)

Devdocs 400 2:02:22 18.3 1,580 3:41:52 8.4

Devqueries 64 0:01:19 1.2 100 0:02:22 1.4

Evaldocs 200 0:47:04 14.1 1,660 3:52:32 8.4

Evalqueries 36 0:00:58 1.6 100 0:02:32 1.5

Total 700 2:51:42 14.7 3,440 7:35:18 7.9

Bydesign,thepilot“Indian”dataset,whichwasusedinthe2011evaluation(RajputandMetze,2011),andwas retainedasa“progress”setforthe2012evaluation(Metzeetal.,2012),consistedofonly700utterancesintelephony quality(8kHz/16bit)fromfourIndianlanguages(English,Gujarati,Hindi,Telugu).Thedatawasprovidedbythe SpokenWebteamatIBMResearchIndia(Kumaretal.,2007)forresearchpurposes.The“African”datasetreplaced theIndiandataasprimaryconditionin2012toprovidemoredataandvariety,whileattemptingtomatchtheIndian dataset’soverall characteristics.It comprisesmorethan3000utterancesfrom isiNdebele,Siswati,Tshivenda,and Xitsonga,takenfromtheLWAZIcorpus(Barnardetal.,2009).

FortheAfricandataset,querytermsrangingfromonetothreewordspertermwereselectedfromthedevelopment andevaluationsets,insuchawayastoproducearangeofoccurrencefrequenciesinboththesesets.Nospeakeroverlap wasallowedbetweenspeakersinthedevelopmentset,theevaluationand/orthequeryset.Assomeofthelanguages areagglutinative,possiblesearchtermsthatoverlapwithveryfrequentlyoccurringwordswereexcludedfromtheset ofqueriesused.Ifforexample‘asajamile’occursfrequently,then‘ajamile’wouldbeexcluded.Askingparticipantsto makethisdistinctionwouldrequirethemtoperformmorphologicalanalysis–ataskoutsidethescopeofthecurrent challenge.Table1liststhecharacteristicsoftheSWSdatasets.

Languageswererepresentedequally withinthe respectivesets, andlanguage labelswereprovided onlyon the developmentdata,inordertorepresentarealisticscenario.Word-leveltranscriptions(andcorresponding phonetic dictionaryentries)were alsomade available for the training anddevelopmentsections of the dataonly. The task thereforerequiredresearcherstobuildalanguage-independentaudiosearchsystemsothat,givenaquery,itshouldbe abletofindtheappropriateaudiofile(s)andthe(approximate)locationofaquerytermwithintheaudiofile(s).

Targetsweredefinedbyanexactstringmatchofthequeryterminthereferencetranscription.Atargetwas“hit”,if thesystemreturnedapositivedetectiondecisionwithinatemporalwindowaroundthereferencealignment,whileit was“missed”ifnopositivedetectionwasreturned.Asexplainedabove,thetemporal“cushion”wassettoalargevalue forthe“Indian”data,becausenoexacttemporalalignmentwasavailable,andutteranceswerequiteshortanyway.

Bydesign,performinglanguageidentification,followedbystandardspeech-to-textisnotanappropriateapproach toSWS,becausefull-fledgedrecognizersaretypicallynotavailableintheselanguages.

Inordertonotrestrictparticipation,theuseofexternalresourceswaspermitted,aslongastheirusehadbeendeclared (“open”condition).Systemsthatwereonlydevelopedontheprovideddataarecalled“restricted”.Throughoutthis paper,wewillusethenomenclatureshowninTable2whendiscussingandcomparingapproachesandsystems.

3.1. Evaluationandscoring

SWSexperimentswerescoredwithamodifiedversionoftheNIST2006STDevaluationscoringsoftware(Fiscus etal.,2007).TheprimaryevaluationmetricwasATWV(ActualTerm-WeightedValue),whileMTWV(Maximum Term-WeightedValue)wasalsoreported.AccordingtoFiscusetal.(2007),theTerm-Weightedvalueiscomputedas afunctionof themissandfalsealarmprobabilities(Pmiss,PFA)averagedoverallqueryterms.The“Actual”TWV

isobtainedbycomputingthevalueforagivenoperatingpointsetbytheparticipant,“Max”TWVisdeterminedby selectingtheoptimumoperatingpointoverallqueries.

Asnoaccuratewordalignmentwasavailableforthe“Indian”data,ahypothesizedmatchwasconsideredcorrect providedthatitoccuredinthecorrectreferencefile,bysettingwidetemporalpaddingparameters.Onthe“African”

(5)

Table2

Classificationof(primary)SWSsubmissions:“open”meansthatexternaldatasourceswereused,“restricted”meansthatonlytheresourcesprovided duringthatyear’sevaluationwereused.A“symbol-based”systemcomputessimilaritiesatsomesymboliclevel,whilea“frame-based”system computesandsumsdistancesovertimeduringdecisionmaking.

Approach Condition

2011 2012

Open Restricted Open Restricted

Symbol-based Barnardetal.(2011), Szökeetal.(2011),and Mantenaetal.(2011)

– AbadandAstudillo(2012),

Varonaetal.(2012),and Buzoetal.(2012)

Szökeetal.(2012)

Frame-based – Anguera(2011)and

MuscarielloandGravier (2011)

WangandLee(2012) Anguera(2012),Jansenetal. (2012),Joderetal.(2012), andVavreketal.(2012)

data,accurategroundtruthalignmentswereavailable,thereforetheseparametersweresetbacktotheNISTstandard parameters.Additionally,in2012amodificationwasincorporatedtoweightmissedandfalsealarmdetections dif-ferentlyinthefinalmetric.InthedefaultNISTsetting,theimpactofafalsealarmisthreeordersofmagnitudethat ofamiss.Whilethesesettingsareadequateforalargedatamonitoring scenario,theymightnotbeappripriatefor aretrievalscenario,liketheoneproposedinthisevaluation.Theexactimpactofanindividualfalsealarmdetection varieswiththelengthofthereferencedatabaseandthenumberofquerytermsused,sothatcarehastobetakenwhen partitioningorcombiningdatasetsorquerylists.ForSWS2012,settingsintheNISTscoringscriptsweremodified toensurethatPmiss=PFAforanequalnumberofmissesandfalsealarmsintheoutput.

Apatchedscoringpackagewasdistributedalongwiththedata.

3.2. Overallsystemarchitecture

The2011pilotevaluationattractedtheinterestof5sites,while9teamsparticipatedinthe2012evaluation. Com-petingsystemsimplementedawiderangeofsolutions.Nevertheless,allsystemsfollowthesameoverallarchitecture, illustratedinFigure1.Thefirststepisfront-endprocessingwhichcomprisesseveralstepssuchassilencedetection, featureextractionandnormalization.Thesecondstepcorrespondstotheactualsearchwherethedatabaseismatched againstthequeryusingeitherasymbolicrepresentationorframe-basedpatternmatching.Thesearchprocedure com-putesascoreforeachattemptedmatchfortheutterancesinthedatabase.Thelaststepisthedecisionmakingstep whereactualanswerstothequeryareselectedfromthesearchresult.Thisstepisoftenlimitedtothecomparisontoa decisionthreshold,possiblyafterscorenormalization.

Werevieweachofthesethreestepsinturninthenextsections:“Front-End”and“Database”inSection4, Frame-based“Search”inSection5,Symbolic“Search”inSection6,andthe“Decision”stepinSection7.

(6)

4. Front-end

Inthissection, we reviewsuch fundamentalunderlyingtechniques usedfor low-resourceSTD such as feature extraction,featurenormalization,voiceactivitydetectionandtokenizerdevelopment.

4.1. Featureextraction

MostcommonacousticfeaturesincludedstandardMFCC(Joderetal.,2012;WangandLee,2012;Vavreketal., 2012;Barnardetal.,2011;Mantenaetal.,2011),Bottle-neck(Szökeetal.,2011,2012),Frequencydomainlinear prediction(FDLP-S)(Jansenetal.,2012)andPLPfeatures (AbadandAstudillo, 2012).Apartfrombeingusedin the developmentof full-fledged tokenizers (c.f. Section4.4), thesefeatures were also used indirect frame-based comparisons,orasapreliminarysteptoobtainposteriorgram-basedfeatures(WangandLee,2012;Muscarielloand Gravier,2011).

Posteriorgrams were obtained in different manners. One approachwas the use of phonemetokenizers trained usingexternaldata.Inthiscasetheposteriorprobabilitiesofeachphonemewereconcatenatedintoafeaturevector. Asphonemetokenizers are inherentlydependent ona language,othercommon approaches weretouse posterior probabilitiesfromlanguageindependentmodelssuchasGMMsdirectlytrainedonthedata(Anguera,2011,2012; MuscarielloandGravier,2011;WangandLee,2012)orusingautomaticallyderivedacousticunitsobtainedonthe developmentdataset(WangandLee,2012).

4.2. Featurenormalization

Somesystemsappliedmethodstonormalizethefeaturessoastominimizedependenciestospeakersandacoustic conditions,whichisespeciallyrelevant forframe-basedsystems. Inparticular, it wasobservedthat applying cep-stralmeanandvariancenormalizationimprovedthematchingaccuracies(MuscarielloandGravier,2011;Anguera, 2012;Joderetal.,2012).Inaddition,twosystemsshowedsubstantialimprovementsbyapplyingvocaltractlength normalization(VTLN)(WangandLee,2012;Szökeetal.,2012).

4.3. Voiceactivitydetection

GiventhespontaneousnatureoftheacousticdataandthefactthatinMediaEval2012someofthequeriescontained multiplewords,variableamountsofnon-speechwerepresentintherecordings,whichalwayscausesaproblemfor frame-basedmatchingsystemsandcanpotentiallyleadsymbol-basedsystemstoobtainwrongdecodings.Forthis reason,severalsystemsproposedandimplementedvariousvoiceactivitydetectionalgorithms(Anguera,2012;Wang andLee,2012;Jansenetal.,2012;Szökeetal.,2012).Examplesincludeunsupervisedtrainingofa2-class speech/non-speechclassifierusingGMMsonMFCCfeatures(Anguera,2012),ortheuseofposteriorsfromgivenphonetokenizers (Szökeetal.,2012).

4.4. Tokenizerdevelopment

Allsymbol-basedandafewframe-basedapproachesrequiredthedevelopmentofafull-fledgedphonetictokenizer, orasetoftokenizers.Forsymbol-basedapproaches,thetokenizersproducephonestringsorlatticesforfurtheranalysis; intheframe-basedapproach,thetokenizerisusedtogenerateposteriorgrams.

ThevastmajorityofsystemsbasedphonetictokenizersonhiddenMarkovmodels(Mantenaetal.,2011;Barnard etal.,2011;Varonaetal.,2012;Szökeetal.,2011),usingstandardtrainingprocedures.Mostteamsutilizedexternal datainordertooptimizetheaccuracyoftheirtokenizers,apartfromWangandLee(2012)whereautomaticallyderived phonetic-likeunitswerederivedfromthedevelopmentdata.ThisisdescribedinmoredetailinSection6.3.

Systemsvariedquitesignificantlywithregardtotheactualnumberofbaseunitsinthemodel;typically, symbol-basedtechniquesareverysensitivetowardthisparameter.Baseunitswereinallcasesapproximationsofphonesor phonegroups,withphonesetsmostlyreduced,oftenquiteaggressively,forexample,from77to28inthecaseofBuzo etal.(2012)and62to43,andthento21inBarnardetal.(2011).Systemstriedtocompensateforlimitedacousticdata (andlanguagemismatchwhenusingexternaldata)bymodelingbroadphonemicclassesratherthandetailedphonemes.

(7)

ThisreductionwasmostlyachievedbymergingsimilarphonesbasedontheirIPAidentity(Buzoetal.,2012;Varona etal.,2012;Barnardetal.,2011),andinonecase,byusingarticulatoryfeaturestocombinemodels(Mantenaetal., 2011).

5. Frame-basedapproaches

Frame-basedapproaches(oftenalsobeingreferredtoaspattern-basedorpatternmatchingapproaches)perform thematchingofqueryaudioandthereferencedataattheacousticfeatureframelevel.Suchmatchingreliesonlyonthe localsimilaritybetweenframepairsandposteriortime-alignment(allowingforframeinsertionsanddeletions)ofpairs insequence.Allframe-basedapproachessubmittedtotheevaluationusedynamicprogrammingalgorithmsinspired byDynamicTimeWarping(DTW)forposteriortimealignments,oftenimplementingsegmentalversionstoaccount forunknownstartandendpointsofmatchesinthesearchdatabase.Severalvariationswereproposed,dependingon thefront-endprocessingandonthetime-alignmentalgorithm.

WhenusedinMediaEval’sframe-basedapproaches,posteriorgramsturnedouttogenerallyoutperformrawacoustic featuressuchasMFCCsduetotheirincreasedrobustnesstospeakerandacousticvariability.

Manyvariantsofdynamicalignmentswereusedforframe-basedsearch.StandardDTWwasconsideredinsome systems,withstartandendpointsdeterminedpriortoDTW.InMantenaetal.(2011),aroughacousticdecodingstep basedonarticulatoryfeaturesisusedtofindputativematchingregionsonwhichDTWistobeapplied.Abrute-force approachwastakeninSzökeetal.(2011)wheretheDTWsimilarityiscalculatedforallthepossiblecombinations ofstartandendpoints.AsanalternativetostandardDTW,segmentalvariantswereused,wherestartandendpoints weredecidedaspartoftheoptimalalignmentsratherthandefinedinapre-processingstep.InMuscarielloandGravier (2011),segmentallocallynormalizedDTW(Muscarielloetal.,2012)wasusedtofindpotentialmatchesofthequery. InAnguera(2011),asub-sequenceDTWalgorithm(AngueraandFerrarons,2013)wasused,usingeitherposterior probabilityfeaturesorbinaryfeaturesforefficiency.AnothervariantofDTW,called“cumulativeDTW”,wasusedin

Joderetal.(2012)wheretheusualmaximizationisreplacedbyasoft-maxrule.Moreover,thepairwiselocaldistances werereplacedbystepfunctionsresultingfromthecombinationoffeaturefunctionswherethecombinationparameters werelearnedfromthedata.

Interestingly,adifferentsearchstrategywasusedinJansenetal.(2012)basedonHoughtransformstofind near-diagonal lines in asparse similarity matrix obtained from locality sensitive hashing (LSH) of raw features. The combinationofafastHoughtransformandframeindexing(JansenandDurme,2012)offerssubstantialpotentialin termsofspeedandscalability.

Anyofthesearchstrategiesmentionedabovereturnedasetofputativematchingsegmentswhichwere,inmost cases, post-processedtorefinethe matchesbefore makingadecision.In Vavreketal.(2012),SVM classification was appliedtotheoutput of theDTWalignment todecide whetherthe alignmentcorrespondstoamatchor not. InMuscarielloandGravier(2011),animage-basedcomparisonofthequeryandreferencesegmentsrepresentedas self-similaritymatriceswasperformedtoincreaserobustnesstospeakerandacousticconditions(Muscarielloetal., 2012).Aself-similaritymatrixisasquare,symmetricmatrixofdistancescomputedbetweentheindividualframesof anutterance.Iftwosequencesaresimilar,soaretherespectiveself-similaritymatrices,whichcanthereforebeusedas adistancemeasureitself.Similarly,putativematchesresultingfromtheHoughtransforminJansenetal.(2012)were furthervalidatedusingstandardDTW.

6. Symbol-basedapproaches

Symbol-basedapproachestotheSWStaskfirstconvertthequeryandthecontentintoasymbolicrepresentation, on whichthebest matchisthen computedwithouttaking temporal alignmentinto account. Whilethesesymbols can,inprinciple,beanycategoricalrepresentation,allSWSsubmissionsusedsymbolsetsthatwereeitherbasedon phonesorbroadphonemicclasses,definedatacoarsergranularitythanwouldtypicallybeusedinstandardspeech recognition.

Sincethesuccessofsymbol-basedapproachesreliesheavilyontheaccuracyofthetokenizer,itisnotsurprisingthat mostparticipantsusedtokenizersfromthe“open”category,andusedadditionalresourcesduringsystemdevelopment. Onlyoneparticipant(Szökeetal.,2012)createdasymbol-basedsystemwithoututilizinganyexternalresources.

(8)

Twomainapproacheswereused:

• Acoustickeywordspotting(AKWS),inwhichatokenizedquerystringcompeteswithabackgroundand/orfiller modelduringdecoding(AbadandAstudillo,2012;Szökeetal.,2012,2011).

• Stringmatching,usingaformofDTWatthesymbollevel.Queriesweremostlyconvertedtosinglestrings,while contentutteranceswerealternativelyrepresentedasstrings,n-bestlists,latticesorconfusionnetworks(Varonaetal., 2012;Buzoetal.,2012;Barnardetal.,2011;Mantenaetal.,2011).

6.1. Acoustickeywordspotting

Apartfromthedevelopmentoftheactualtokenizers(c.f.Section4.4)approachesdifferedmainlywithregardto thearchitectureoftheAKWSsystemandnormalizationsapplied(c.f.Section7).

ThesysteminSzökeetal.(2012)createdanHMMforeachquery,andcalculatedthelog-likelihoodratiobetween thequeryandabackground/fillermodel(Schwarz,2009;Szökeetal.,2005),implementingthebackgroundmodelas afreephoneloopwithoutanyweighting.InAbadandAstudillo(2012),ahybridANN/HMMapproachisemployed, e.g.,theHMMisusedtomodelthespeechsignal,andtheANNtoestimatetheposteriorphoneprobabilities.Asliding windowwasusedtoprocesseachfile,withauniform1-glanguagemodelformedbythetargetqueryandacompeting speechbackgroundmodelbeingused,andtheweightofthebackgroundmodeltunedonthedevelopmentset.

Inallcasesquerypronunciationswereobtainedbytokenizingtheaudioautomatically(Szökeetal.,2012,2011), with(Abad andAstudillo, 2012)optimizingonadevelopmentsettoobtainthe correct numberofphonemes. All systemsusedsinglepronunciations,withsomeexperimentsincreatingadditionalvariantsnotproducingawin(Abad andAstudillo,2012).Inaseparateexperiment(in-admissiblefortheprimarycondition)thesensitivityofthesystemwas testedtowardtheaccuracyofthederiveddictionary,findingthatforce-alignedtranscriptionsresultedinasignificant TWVincreaseofabout0.28(Szökeetal.,2012).

6.2. Stringmatching

AllstringmatchingapproachesusedsomeformofDTWtomatchquerytocontent,anddifferedmainlywithregard tothecontentrepresentationusedandthecostfunctionemployedduringDTW.Apartfromoneuseofan articulatory-inspiredphoneset(onlydifferentiatingbetweenarticulationeffects ratherthanindividualphones) (Mantenaetal., 2011),allsystemsusedfairlystandardphonesets,reducingthenumberofphonemesforincreasedgeneralizationand improvedaccuracygiventhatthesehadbeenusuallytrainedwithdatafromotherlanguages.

Basicstring-basedDTW,witheachcontentutterancealsobeingrepresentedasasinglestring,wasimplementedby mostparticipants(Buzoetal.,2012;Barnardetal.,2011;Mantenaetal.,2011).Stringsaretokenized,DTWisapplied withaslidingwindowandsomeformoffilteringusedtoremoveoverlappingqueries.Inaddition,DTWsearchwas alsoperformedwithinconfusionnetworks(Manguetal.,2000),withthescoreweightedaccordingtothealignment ofthequerywiththeconfusionnetwork(Buzoetal.,2012),andwithinlattices(Barnardetal.,2011;Varonaetal., 2012).Specifically,theVaronaetal.(2012)systemextractedtheNphonedecodingswiththehighestlikelihoodsfrom thephonelatticeandthenconvertedthemtomultigrams(Wangetal.,2011).Resultswerefilteredbasedonrankas wellasscore,afternormalization.

Oneofthekeyelement ofstringmatchingisthecostmatrixwhichholdsthecostofsubstitutingoneunit(e.g., phone)withanother.Cost matriceswereeitherflat(Mantenaetal.,2011),linguisticallymotivated(Barnardetal., 2011),orestimatedfromdevelopmentdata.

6.3. Cross-languagedatasharing

Whilethemainobjectiveofthe“SpokenWebSearch”taskistosearchlanguageswithverylimitedtrainingresources, manysystemsintheMediaEvalevaluationfounditusefultoutilizedataresourcesfromotherlanguages.

Systemsinthe“open”categorydifferedsubstantiallywithregardtotheamountandtypeofdatathatwere incor-poratedfromadditionalsources; anapproachthat was onlyused as basisfor symbol-basedtechniques. Onlytwo groupsuseddataofthesamelanguagefamilyasthetargetdata,thesebeingTelugu(Mantenaetal.,2011)andHindi (Barnardetal.,2011)inMediaEval2011.Fortherest,datawassourcedfromvariousotherlanguages(Romanian,

(9)

Czech,English,Hungarian,Levantine,Polish,Russian,Slovak,EuropeanandBrazilianPortuguese,EuropeanSpanish, AmericanEnglish,etc.),oftenbasedontheBUTphonerecognizers(Phonemerecognizer,2009).

Externaldatawas incorporatedintoprimarysystems byeitherusing theforeignmodelsdirectly,tokenizingthe queriesandcontentusingtheforeigntokenizersdirectly(Szökeetal.,2011;AbadandAstudillo,2012),orbydeveloping tokenizerswithforeigndata,andthenadaptingthese.Adaptationwasperformedindifferentways.TheBuzoetal. (2012)systemusedMAPadaptation,firstlow-passfilteringbroad-banddata,thenmappingphonesusingInternational PhoneticAlphabettablesoraconfusionmatrix,tuningthephoneerrorrateontheMediaEvaldevelopmentset.The

Barnardetal.(2011)systemMAP-adaptedthesamesourcedatatothefourtargetlanguagestocreatefourseparate tokenizers.TheSzökeetal.(2011)systemusedlanguage-specificKarhunen–Loevetransformsduringadaptation. 7. Normalizationanddecisionmaking

Asaresultofthe searchstep,asetof possiblematchesisobtainedfor everyqueryterm.Forscoring,amatch mustincludethematchingutteranceIDandthestart-endpointswherethequeryisthoughttoappear,togetherwitha relevancescore.Severalmethodshavebeenproposedinbothsymbol-basedandframe-basedapproachestoimprove thematchingresultsatthispoint.Theseincludescorenormalizationtechniquesandfusionofdifferentsystemoutputs.

7.1. Scorenormalization

Thequeriesthatwereusedintheevaluationhaveverydifferentacousticcharacteristics.Ontheonehand,theirlength (beforeanyvoiceactivity detectionwasapplied)rangedfrom0.39sto4.12sinthe developmentset,andbetween 0.38sand5.96sintheevaluationset.Ontheotherhand,somequeriesconsistedofonesinglewordwhileothershad two,ormorewords.Inaddition,eachphoneticclasshasdifferentaveragematchingscores:stablepartsinvowelsand silenceshaveaverygoodintra-classmatch,forexample,whileconsonantsachievelowerdirectmatchingscores.For thesereasons,thedistributionsof scoresforreferencematchingsequencestoeach queryusuallydiffersquiteabit amongqueries.

Several methodshavebeenproposedtonormalizesuch scoresinordertoallowfor the applicationof asingle optimaldetectionthreshold.InWangandLee(2012),Joderetal.(2012),andJansenetal.(2012)variousflavorsof

z-normnormalizationwereapplied.InWangandLee(2012)thenormalizationmeanandvarianceforeachqueryin thetestwasestimatedbyusingdevelopmentdata.Onthecontrary,Jansenetal.(2012)usedthesetofpossiblematches oftestquerieswiththetestdatatocomputesuchparameters.Similarly,inJoderetal.(2012)thetestdatawasusedto findappropriatenormalizationparameters,althoughtheauthorsavoidusingthe10%best-matchingscorestoavoida biaswiththeactualmatches.

AtotallydifferentapproachwasfollowedbySzökeetal.(2012),wherealinearregressionmodelwastrainedusing developmentdatatopredicttheidealthreshold.Parameterssuchasthequerylength,totalamountofdetectedsilence inthequery,numberofphonemes,andsoonwereused.Inallcases,thesystemswerereportedtoimproveresultsby usingsuchapproaches.

7.2. Intra-systemfusion

Several groupssubmittedthe outputof afusion ofmultiple systemsas their primary submission,andreported consistentgains.Amultitudeoftechniquesweretried:

WangandLee(2012)achievedsystemcombination(orfusion)byaveragingthedistancematricescomputedwith differenttokenizersbeforecomputingDTW(Wangetal.,2013).Individualtokenizerswerebasedondifferenttraining data,or targetsets. Whengoing froma two-systemcombination toaseven-system combination, gains reach 0.1 (intermsofATWV)onthedevdata,and0.15ontheevaldata,whengoingfromafive-systemcombinationtothe seven-systemcombination,gainsarelessthanorequalto0.02.

AbadandAstudillo(2012)exploredsystemcombinationusing“AND”,“OR”,and“MAJORITY”operationson fourindividualsub-systemoutputs.Majorityvotingwasfoundtogivethebestresult,butnoperformancenumbers havebeenpublishedfortheindividualsub-systems.

Inadifferentlineofthought,pseudo-relevancefeedbackwasusedinWangandLee(2012),wherethetopmatches obtainedfromtheoriginalquerywereusedtorescoretheremainingmatches,withthegoaltoobtainabetterscore

(10)

5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Miss probability (in %)

False Alarm probability (in %)

Random Performance BUT-HCTLabs: MTWV = 0.131 IIITH: MTWV = 0.000 IRISA: MTWV = 0.000 MUST: MTWV = 0.114 TID: MTWV = 0.173

Figure2.Results(ATWVplotandMTWV)forevaluationtermsonthe2011evaldata(Metzeetal.,2012).Theoperatingpointoftheparticipants’ submissionsfortheirrespectiveATWVisindicatedbythemarkerontheline.

estimation.InAnguera(2012)andMantenaetal.(2011)overlapdetectionontheresultingmatcheswasusedtomerge overlappingresults.

8. Discussion

ResultsforallprimaryandthemostrelevantcontrastivesystemsarepresentedinFigures2and3forthe2011and 2012benchmarksrespectively.WealsoreportATWVinTable3forthe2012data.Notethatthe2011and2012results wereestablishedontwodistinctdatasets,usingdifferentscoringparameters,andarethereforenotdirectlycomparable. Inspiteofthisdifferenceinthedataset,itisclearthatprogresshasbeenmadebetweenthetwobenchmarks,withnew

5 10 20 40 60 80 90 95 98 .0001 .001 .004.01.02 .05 .1 .2 .5 1 2 5 10 20 40

Random Performance ARF MTWV = 0.310 BUT MTWV = 0.530 BUT-g MTWV = 0.488 CUHK MTWV = 0.762 CUHK-g MTWV = 0.643 JHU-HLTCOE MTWV = 0.384 L2F MTWV = 0.523 TUKE MTWV = 0.000 TUM MTWV = 0.296 TID MTWV = 0.342 GTTS MTWV = 0.081

Figure3.ATWVplotsandMTWVresultsforevaluationtermsonthe2012evaldata(Metzeetal.,2013).Operatingpointsarenotshownforclarity ofpresentation.Figures2and4showthedifferencebetweenwell-tunedoperatingpointsforthescoringdefinedon“Indian”and“African”data setsbytheorganizers.

(11)

Table3

Results(actualTWV)forselectedSWS2012systems(Metzeetal.,2013).

System Type Dev Eval See

cuhk phnrecgmmasm p-fusionprf(CUHK) Open 0.782 0.743 WangandLee(2012)

cuhk spchp-gmmasmprf(CUHK-g) Restricted 0.678 0.635 WangandLee(2012)

l2f12spchp-phonetic4fusionmv Open 0.531 0.520 AbadandAstudillo(2012)

butspchp-akws-devterms(BUT) Open 0.488 0.492 Szökeetal.(2012)

butspchg-DTW-devterms(BUT-g) Open 0.443 0.448 Szökeetal.(2012)

jhuallspchp-rails(JHU-HLTCOE) Restricted 0.381 0.369 Jansenetal.(2012)

tidsws2012IRDTW Restricted 0.387 0.330 Anguera(2012)

tumspchp-cdtw Restricted 0.263 0.290 Joderetal.(2012)

arfspchp-asrDTWAlignw15a08b04 Open 0.411 0.245 Buzoetal.(2012)

gttsspchp-phonelattice Open 0.098 0.081 Varonaetal.(2012)

tukespchp-dtwsvm Restricted 0 0 Vavreketal.(2012)

pre-processing,tokenizationandnormalizationtechniquesappearingin2012.Inthefollowing,wefirstreviewand commentonevaluationresults,beforepresentingfusionexperimentsandinitialanalysisofhowlanguageproperties impactdetectionperformance.

While in 2011 few systems implemented frame-based approaches using pattern matching techniques, such approaches were implemented in the majority of the 2012 submissions. Moreover,in bothbenchmarks, the best resultswereobtained bytemplatematchingsystems.Frame-based,templatematchingtechniquesaregaining inter-estinthecommunityandcanachievethesameperformanceassymbol-basedapproachesonthe“operatingpoint” chosenforMediaEvalwithrespecttoamountandkindofdata.TheBUTexperiments(Szökeetal.,2012)showthat nothaving alexiconavailablegreatlyimpacts theperformanceof AKWSsystems,whicharelimitedtoextracting informationfromone,isolatedqueryonlyinthisscenario.Computationmaybeanissueforframe-basedsystems,but techniqueshavebeendevelopedtosearchevenlargeamountsofdataefficiently(Jansenetal.,2012).Thebest sys-temstendtocombinemultiplerepresentationsandtechniques,achievingsignificantgainsintheprocess;wespeculate thatSWScouldbeaninteresting,low-complexitytest-bedtodevelopcomplementarydatarepresentations,matching techniques,andsearch approaches.It ishoweverinterestingtonotethat undertheSWSevaluationconditions,the zero-knowledge(“restricted”)approachesperformedquitesimilarlyto“open”(typicallymodel-based) approaches, whichtypicallyrelyontheavailabilityofmatchingdatafromotherlanguages.Onthe2012evaluation,thedifference inATWVisabout0.1forthetwoCUHKsystems(WangandLee,2012)and0.05forthetwoBUTsystems(Szöke etal.,2012).

Asafirstobservation,thelargersetofparticipantsalsoresultedinavarietyofnewsignal-processingtechniques beingintroducedoroldtechniquesbeingre-introduced,suchas VocalTractLengthNormalization(VTLN),which boosted performancesignificantlyfor theCUHKsystem(WangandLee,2012),althoughnodetailedanalysis has beenperformed onthisisolated aspect.Similarly,cepstral meansubtraction andvariance normalizationhavebeen successfullyappliedtomostpatternmatchingsystems.2011resultshadalsoalertedparticipantstotheimportance ofsilencesegmentationforthistypeoftask,assilencesegmentsshouldnotbycountedinanydistanceormatching function.Withmodel-basedtokenizersincreasinglyexploitingtechniquesfromspeechandspeakerrecognition,we expextthatnormalizationtechniques,suchasVTLN,SATorfactoranalysiswillbeadaptedtothespokenWebsearch taskinthenearfuture.

Secondly, efforts have been devotedbetween 2011 and2012 to the selection of suitable acoustic units inthe tokenizers.Agreater varietyoflanguageswereused, anddata-driven unitshavesuccessfullybeencombinedwith phoneticunitsinthe“open”condition.Thebestperformingsystemin2012isaframe-basedsystemlinearlycombining similaritymatricesobtainedindifferentways,eitherinarestrictedmode(GMM,self-trainedacousticunits)orinan openmode(phonemodelsfromvariouslanguages).IntermediateresultsreportedinWangandLee(2012)showthatthe combinationofphonemodelsfrom5differentlanguages(cz,hu,ru,ma,en)givesbetterresultsthanthecombination ofGMMandASMposteriorgrams(ATWV0.72vs.0.59onthe2012evaluationdata),whileusingall7tokenizers outperformsbothsettings(ATWVof 0.74).Amoredetaileddescription of thissystemisavailable inWang etal. (2013).Thecombinationofdistancematricesobtainedfromposteriorgramswithdifferenttokenizerscouldalsohave contributedtoanincreasedrobustnesstospeakervariability.

(12)

Table4

TWVvaluesonthe2012datasetforthedifferentAfricanlanguages.

Language ATWV MTWV

Open Restricted Open Restricted

IsiNdebele 0.609 0.512 0.717 0.593

Siswati 0.644 0.583 0.782 0.709

Tshivenda 0.718 0.604 0.718 0.612

Xitsonga 0.650 0.559 0.650 0.613

All 0.698 0.586 0.763 0.658

Finally,scorenormalizationtechniquesderivedfromspeakerverificationwereintroducedbymostparticipantsin 2012soastonormalizescoresonaper-querybasis,togoodeffect.

Mostparticipantschoseaz-norm-likescheme,normalizingscoreswithrespecttothequery,whichisfairlyeasyto implementandreportedbenefitsofsuchanormalizationintheworkingnotepapers(MediaEvalBenchmark,2011, 2012).Variousflavorsofz-normwereused(underdifferentnames),buttheabsenceof contrastiveresultsdoesnot allowtovouchforoneoranother.Atthe2012evaluationworkshop,participantsexpressedaninterestinexploring

t-normtechniques,atnormalizingthedecisionscorewithrespecttothedocumentinwhichthequeryissearchedfor, therebyintegratingmoreinsightfromspeakerrecognitionwork.Whilesuchtechniqueswouldundoubtedlyimprove all approaches,their implementation isboth algorithmically complex and computationally intensive and was not consideredsofar.

Inlightoftheseelementsofanalysis,webelievethatspeakernormalizationtechniques(VTLN,meanandvariance normalization),alongwiththecombinationofmultipledistancematricesobtainedfromdifferentposteriorgramsare thekeyfeatures explainingthe successoftheCUHK system(Wang andLee, 2012;Wangetal.,2013)overother participants.

Focusingon frame-based systems,results from TID (Anguera, 2012; Mantena andAnguera, 2013) and JHU-HLTCOE(Jansenetal.,2012)areofparticularinterest:bothsystemsimplementindexingtechniquesforefficientpattern matching,whereotherframe-basedapproachesrelyonsegmentalvariantsofthedynamictimewarpingalgorithm. Suchvariantsaretypicallycomputationallyexpensiveandseverelylimitthescalability offrame-basedapproaches. Onthecontrary,indexingtechniquesenabletheefficientcomputationofasparsesimilaritymatrix,whosesparsity inturnenablesfast matching.Whilethetwoapproaches exploitingindexingtechniquesdonotoutperform others, theyexhibitafairperformancelevelwhilebeingscalable.Thesetwosystemsthusclearlydemonstratethatcombining indexingtechniquessuchas LSHorefficientapproximatenearestneighborsearchwithpatternmatchingisavalid researchtrendforfastandscalablelanguage-independentspokentermdetection.Duringtheevaluation,participants wereencouragedtomeasureandreportthecomputationalrequirementsoftheirapproaches;however,thewidevariety ofresourcesusedmakeafaircomparisonbetweensystemsdifficultatthispoint.

8.1. Multi-sitefusion

Similarlytothewell-knownROVERapproachtocombiningmultiplespeech-to-textsystems(Fiscus,1997),the resultsofmultiple,independentSTDsystemscanalsobecombined.Severalmethodscanbeemployedtogenerate appropriatecombinationweights,suchasmaximumentropyorlinearregression(Norouzianetal.,2012).For symbolic-basedsystems,the combinationofpronunciation dictionaries,whichweregenerated usingdifferent approachesis viable(Wang,2010).Thisapproachhowever isnotsuitable fordictionary-lesssystems,such asthe“frame-based” approachesdiscussedhere.Inanycase,scoresorposteriorsneedtobenormalizedsuitablyacrosssystemsforsuccessful normalization.Severalwell-performingparticipantsalsoperformedsystemcombinationusingscoreaveraging(Wang andLee,2012)orvoting(AbadandAstudillo,2012).

TocombinetheoutputofallMediaEvalsubmissions,weemployedtheCombMNZalgorithm(FoxandShaw,1994). CombMNZisageneraldata-fusiontechnique,whichstillrequiressomescorenormalization,aspreviouslydiscussed. Intheauthors’ownworkonIARPA’sBabelprogram(IARPA,2011),thisalgorithmprovidedalmostalwaysthebest performanceacrossawiderangeofconditions,andwascertainlythemostrobustfusiontechnique.Table4showsthe

(13)

5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Random Performance ALL: MTWV = 0.763 IsiNdebele: MTWV = 0.717 Siswati: MTWV = 0.782 Tshivenda: MTWV = 0.718 Xitsonga: MTWV = 0.650

Figure4.Results(ATWVplotsandMTWV)onthe2012evaluationdataonaperlanguagebasisforacombinationofopensystems.

resultsofthis“fused”systemfor the2012evaluationdata.Figures4and5showthatthecorrespondingcurvesare somewhatmore“well-behaved”,eveniftheMaximumTWVcouldnotbeimproved,sothereissomebenefitfrom systemfusioneveninthiscase.

GiventhelargeadvantagethattheCUHKsystems(WangandLee,2012)enjoyedovertheotherparticipantsinthe 2012evaluation,theorganizerswerenotabletoimprovetheperformancesignificantlybymergingmultiplesystem outputs(ranked lists),butthe combinationofseveralothersystems providedgoodgains,for boththe“open” and “restricted”cases.ItmaybepossibletoachievegainsbyusingamatrixcombinationtechniqueasdescribedinWang etal.(2013). 5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Random Performance ALL: MTWV = 0.658 IsiNdebele MTWV = 0.593 Siswati: MTWV = 0.709 Tshivenda: MTWV = 0.612 Xitsonga: MTWV = 0.559

(14)

8.2. Inﬂuenceoflanguage

Figures4and5provideabreakdownofthe2012evaluationresultsforthefourlanguagesrepresentedinthedatabase. Resultsareprovidedforafusionofthebestopenresourcessystemsandafusionofthebestrestrictedsystems,open systemsperformingslightlybetterthanrestrictedones.Inbothcases,Xitsongaappearsmoredifficultthantheother languageswhileSiswatiyieldsthebestperformance.Webelievethattheseresultscanbepartiallyexplainedbythe averagewordlengthwhichdiffersbetweenlanguages,wherelongeraveragewordleadstobetterresults.Regardlessof thewordlengthconsideration,onecanexpectthatlanguagesmostsimilartoGermaniclanguages,i.e.Tshivendaand Xitsonga,benefitmostfromtheopencondition(Zuluetal.,2008).However,thislinguisticallymotivatedexpectation wasnotmet(seeTable4),probablybecauseoftheinfluenceofstrongernon-linguisticfactorssuchaswordlengthand therandomnessduetothechoiceofthequeries.

9. Conclusionandoutlook

The capabilityto detectspoken terms or recognize keywords inlow or zeroresource settingsis an important capabilitywhichcouldboosttheuseofspeechtechnologyindevelopingregionssignificantly.Whenthereareneither experts,whocoulddevelopspeechrecognitionsystemsandmaintaintheinfrastructurerequiredfordesigningspeech dialogsandindexingaudiocontent,nordatabasesonwhichspeechinterfacescouldbedeveloped,zeroresourcespoken termdetectionaspresentedherecouldprovidea“winningcombination”.Inthispaperwepresentedthemainfindings ofthe“SpokenWebSearch”taskwithintheMediaEvalevaluationbenchmark,asrunin2011and2012.Webelieve theresultsachievedintheevaluationsshowthatthesetechniquescouldbeappliedinpracticalsettings,eventhough usertestsarecertainlyneededtodeterminetheoverallperformance,andanacceptableratiooffalsealarmsandmissed detectionsforagivenapplication.Itisinterestingtonotethattheverydiversesystemspresented,whichcoverawide rangeof possibleapproaches, couldachieve very similarresults,andfutureworkshouldincludemore evaluation criteria,suchasamountofexternaldataused,processingtime(s),etc.,whichweredeliberatelyleftunrestrictedinthis evaluation,toencourageparticipation.

Withrespecttotheamountofactualdataavailable,theSWStaskismuchharderthantheresearchgoalsproposedby forexampleIARPA’sBabel(IARPA,2011)program,whereupto100hoftranscribeddataperlanguageareavailable, andthelanguageofatestqueryisknown.TheSWStaskistargetedprimarilyatcommunitiesthatcurrentlydonot haveaccesstotheInternetatall.Manytargetusershavelowliteracyskills,andmanyspeakinlanguagesordialectsfor whichfullydevelopedspeechrecognitionsystemswillnotexistevenforyearstocome.Wehopethattherecentsurge ofactivityinzeroresourceapproaches(seee.g.Zeroresource,2012;Jansenetal.,2013)willresultinfurtherprogress, whichwilladvancethestateoftheartinspokentermdetectionanddocumentretrievalsignificantly,specificallywhen largedatasetsanddatabasesarenotavailable.

Acknowledgments

TheauthorswouldliketoacknowledgetheMediaEvalMultimediaBenchmark(MediaEvalBenchmark,2012).We especiallythankMarthaLarsonfromTUDelftfororganizingthisevent,andtheparticipantsfortheirhardworkonthis evaluation.The“SpokenWebSearch”taskwasoriginallyproposedbyresearchersfromIBMIndia(Agarwaletal., 2010).North-WestUniversityandIBMResearchIndiacollectedandprovidedtheaudiodataandreferencesusedin theseevaluations.

References

Abad, A., Astudillo, R.F., 2012. The L2F spoken web search system for MediaEval 2012. In: Proc. MediaEval 2012, http://www. multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Agarwal,S.,Kumar,A.,Nanavati,A.A.,Rajput,N.,2009.Contentcreationanddisseminationby-and-forusersinruralareas.In:Proc.Intl.Conf. InformationandCommunicationTechnologiesandDevelopment(ICTD),Doha,Qatar.

Agarwal,S.,Dhanesha,K.,Jain,A.,Kumar,A.,Menon,S.,Nanavati,A.,Rajput,N.,Srivastava,K.,Srivastava,S.,2010.Organizational,socialand operationalimplicationsindeliveringICTsolutions:atelecomwebcase-study.In:Proc.ICTD,London,UK.

(15)

Agarwal,S.,Jain,A.,Kumar,A.,Rajput,N.,2010.Theworldwidetelecomwebbrowser.In:Proc.FirstACMSymposiumonComputingfor Development.ACM,London,UK.

Anguera,X.,Ferrarons,M.,2013.MemoryefficientsubsequenceDTWforquery-by-examplespokentermdetection.In:ICME:International ConferenceonMultimediaandExpo,SanJose,CA,USA.

Anguera, X., 2012. Telefonica system for the spoken web search task at MediaEval 2011. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/

Anguera, X., 2012. Telefonica research system for the spoken web search task at MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Barnard,E.,Davel,M.H.,vanHeerden,C.,2009.ASRcorpusdesignforresource-scarcelanguages.In:Proc.INTERSPEECH.ISCA,Brighton, UK,pp.2847–2850.

Barnard,E.,vanSchalkwyk,J.,vanHeerden,C.,Moreno,P.J.,2010.Voicesearchfordevelopment.In:Proc.INTERSPEECH.ISCA,Makuhari, Japan,pp.282–285.

Barnard,E.,Davel,M.H.,vanHeerden,C.,Kleynhans,N.,Bali,K.,2011.Phonerecognitionforspokenwebsearch.In:Proc.MediaEval,2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/

Buzo,A.,Cucu,H.,Safta,M.,Ionescu,B.,Burileanu,C.,2012.ARF@MediaEval2012:aRomanianASR-basedapproachtospokentermdetection. In:Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Chelba,C.,Hazen,T.J.,Sarac¸lar,M.,2008.Retrievalandbrowsingofspokencontent.IEEESign.Process.Mag.25(3),39–49.

deVries,N.J.,Badenhorst,J.,Davel,M.H.,Barnard,E.,deWaal,A.,2011.Woefzela–anopen-sourceplatformforASRdatacollectioninthe developingworld.In:Proc.INTERSPEECH.ISCA,Florence,Italy,pp.3177–3180.

Diao,M.,Mukherjea,S.,Rajput,N.,Srivastava,K.,2010.Facetedsearchandbrowsingofaudiocontentonspokenweb.In:CIKM’10:Proceedings ofthenineteenthinternationalconferenceonInformationandknowledgemanagement,Toronto,Canada.

Educationforallglobalmonitoringreport–reachingthemarginalized,2010.http://unesdoc.unesco.org/images/0018/001866/186606E.pdf.Last accessed:March1,2014.

Fiscus,J.,Ajot,J.,Garofolo,J.,Doddington,G.,2007.Resultsofthe2006spokentermdetectionevaluation.In:Proc.SSCS,Amsterdam,Netherlands. Fiscus,J.,1997.Apost-processingsystemtoyieldreducedworderrorrates:recognizeroutputvotingerrorreduction(ROVER).In:Proc.Automatic

SpeechRecognitionandUnderstandingWorkshop.IEEE,SantaBarbara,CA,USA,pp.347–354.

Fox,E.A.,Shaw,J.A.,1994.Combinationofmultiplesearches.In:Proc.2ndTextREtrievalConference(TREC-2),Gaithersburg,MD,USA,pp. 243–252.

Hazen,T.J.,Shen,W.,White,C.,2009.Query-by-examplespokentermdetectionusingphoneticposteriorgramtemplates.In:Proc.ASRU.IEEE, Merano,Italy.

Hughes,T.,Nakajima,K.,Ha,L.,Vasu,A.,Moreno,P.,LeBeau,M.,2010.Buildingtranscribedspeechcorporaquicklyandcheaplyformany languages.In:Proceedingsofthe11thAnnualConferenceoftheInternationalSpeechCommunicationAssociation(INTERSPEECH2010), Makuhari,Japan,pp.1914–1917.

IntelligenceAdvancedResearchProjectsActivity,2011.IARPA-BAA-11-02,http://www.iarpa.gov/Programs/ia/Babel/babel.html.Lastaccessed: March1,2014.

InternetUsageWorld-WidebyCountry,2010.http://www.infoplease.com/ipa/A0933606.html.Lastaccessed:March1,2014.

Jansen,A.,Durme,B.V.,2012.Indexingrawacousticfeaturesforscalablezeroresourcesearch.In:Proc.INTERSPEECH.ISCA,Portland,OR, USA.

Jansen,A.,vanDurme,B.,Clark,P., 2012.TheJHU-HLTCOEspokenwebsearchsystemforMediaEval2012.In:Proc.MediaEval2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Jansen,A.,Dupoux,E.,Goldwater,S.,Johnson,M.,Khudanpur,S.,Church,K.,Feldman,N.,Hermansky,H.,Metze,F.,Rose,R.,Seltzer,M., Clark,P.,McGraw,I.,Varadarajan,B.,Bennett,E.,Borschinger,B.,Chiu,J.,Dunbar,E.,Fourtassi,A.,Harwath,D.,Lee,C.-Y.,Levin,K., Norouzian,A.,Peddinti,V.,Richardson,R.,Schatz,T.,Thomas,S.,2013.Asummaryofthe2012JHUCLSPworkshoponzeroresourcespeech technologiesandmodelsofearlylanguageacquisition.In:Proc.ICASSP.IEEE,Vancouver,BC,Canada.

Joder,C.,Weninger,F.,Wöllmer,M.,Schuller,B.,2012.TheTUMcumulativeDTWapproachfortheMediaEval2012spokenwebsearchtask.In: Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Kotkar,P.,Thies,W.,Amarasinghe,S.,2008.AnaudioWikiforpublishinguser-generatedcontentinthedevelopingworld.In:HCIforCommunity andInternationalDevelopment(WorkshopatCHI),Florence,Italy.

Kumar,A.,Rajput,N.,Chakraborty,D.,Agarwal,S.,Nanavati,A.A.,2007.WWTW:aworldwidetelecomwebfordevelopingregions.In:ACM SIGCOMMWorkshoponNetworkedSystemsForDevelopingRegions,Kyoto,Japan.

Kumar,A.,Rajput,N.,Agarwal,S.,Chakraborty,D.,Nanavati,A.A.,2008.Organizingtheunorganized–employingittoempowerthe under-privileged.In:ProceedingsoftheWorldWideWeb,Beijing,China.

Kumar,A.,Metze,F.,Wang,W.,Kam,M.,2013.Formalizingexpertknowledgefordevelopingaccuratespeechrecognizers.In:Proc.INTERSPEECH. ISCA,Lyon,France.

Larson,M.,deJong,F.,Kraaij,W.,Renals,S.,2012.Introductiontospecialissueonsearchingspeech.ACMTrans.Inform.Systems30(3). Mangu,L.,Brill,E.,Stolcke,A.,2000.Findingconsensusinspeechrecognition:worderrorminimizationandotherapplicationsofconfusion

networks.Comput.SpeechLanguage14(4),373–400.

Mantena,G.,Anguera,X.,2013.Speedimprovementstoinformationretrieval-baseddynamictimewarpingusinghierarchicalk-meansclustering. In:Proc.ICASSP.IEEE,Vancouver,Canada.

Mantena, G.V., Babu, B., Prahallad, K., 2011. SWS task: articulatory phonetic units and sliding DTW. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/

(16)

MediaEvalBenchmark,MediaEval2012Workshop,http://www.multimediaeval.org/mediaeval2013/;http://ceur-ws.org/Vol-1043/ MediaEvalBenchmark,MediaEval2013Workshop,2013,http://www.multimediaeval.org/mediaeval2013/;http://ceur-ws.org/Vol-1043/ MediaEvalBenchmark,2014.http://www.multimediaeval.org/

Metze,F.,Rajput,N.,Anguera,X.,Davel,M.,Gravier,G.,vanHeerden,C.,Mantena,G.V.,Muscariello,A.,Prahallad,K.,Szöke,I.,Tejedor, J.,2012.ThespokenwebsearchtaskatMediaEval2011.In:Proc.ICASSP.IEEE,Kyoto,Japan.

Metze,F.,Barnard,E.,Davel,M.,vanHeerden,C.,Anguera,X.,Gravier,G.,Rajput,N.,2012.Thespokenwebsearchtask.In:Proc.MediaEval Workshop,http://www.multimediaeval.org/mediaeval2012/;http://www.multimediaeval.org/mediaeval2012/sws2012/

Metze,F.,Anguera,X.,Barnard,E.,Davel,M.,Gravier,G.,2013.ThespokenwebsearchtaskatMediaEval2012.In:Proc.ICASSP.IEEE, Vancouver,BC,Canada.

Miller,D.R.H.,Kleber,M.,Kao,C.-L.,Kimball,O.,Colthurst,T.,Lowe,S.A.,Schwartz,R.M.,Gish,H.,2007.Rapidandaccuratespokenterm detection.In:Proc.INTERSPEECH.ISCA,Antwerpen,Belgium.

Muscariello, A., Gravier, G., 2012. IRISA MediaEval 2011 spoken web search system. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/

Muscariello,A.,Gravier,G.,Bimbot,F.,2011.Azero-resourcesystemforaudio-onlyspokentermdetectionusingacombinationofpatternmatching techniques.In:Proc.INTERSPEECH.ISCA,Florence,Italy.

Muscariello,A.,Gravier,G.,Bimbot,F.,2012.Unsupervisedmotifacquisitioninspeechviaseededdiscoveryandtemplatematchingcombination. IEEETrans.AudioSpeechLanguage20(7),2031–2044.

Norouzian,A.,Jansen,A.,Rose,R.,Thomas,S.,2012.Exploiting discriminativepointprocessmodelsforspokentermdetection.In:Proc. INTERSPEECH.ISCA,Portland,OR,USA.

Patel,N.,Chittamuru,D.,Jain,A.,Dave,P.,Parikh,T.S.,2010.AvaajOtalo:afieldstudyofaninteractivevoiceforumforsmallfarmersinrural India.In:CHI’10:Proceedingsofthe28thInternationalConferenceonHumanFactorsinComputingSystems.ACM,Atlanta,GA,USA,pp. 733–742.

Phonemerecognizerbasedonlongtemporalcontext,2009.http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context. Lastaccessed:March1,2014.

Rajput, N., Metze, F., 2011. Spoken web search. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/; http://ceur-ws.org/Vol-807/

Raza,A.A.,Haq,F.U.,Tariq,Z.,Pervaiz,M.,Razaq,S.,Saif,U.,Rosenfeld,R.,2013.Jobopportunitiesthroughentertainment:virallyspread speech-basedservicesforlow-literateusers.In:Proc.CHI.ACM,Paris,France.

Schultz,T.,2000.MultilingualeSpracherkennung:KombinationakustischerModellezurPortierungaufneueSprachen.UniversitätKarlsruhe, InstitutfürLogik,KomplexitätundDeduktionssysteme(Ph.D.thesis).

Schwarz,P.,2009.Phonemerecognitionbasedonlongtemporalcontext.FacultyofInformationTechnology,BrnoUniversityofTechnology(BUT) (Ph.D.thesis).http://www.fit.vutbr.cz/research/viewpub.php?id=9132

Shen,W.,White,C.,Hazen,T.J.,2009.Acomparisonofquery-by-examplemethodsforspokentermdetection.In:Proc.INTERSPEECH.ISCA, Brighton,UK.

Sherwani,J.,Ali,N.,Mirza,S.,Fatma,A.,Memon,Y.,Karim,M.,Tongia,R.,Rosenfeld,R.,2007.Healthline:speech-basedaccesstohealth informationbylow-literateusers.In:Proc.IEEE/ACMInt’lConferenceonInformationandCommunicationTechnologiesandDevelopment, Bangalore,India.

Szöke,I.,Schwarz,P.,Matˇejka,P.,Burget,L.,Karafiát,M., ˇCernock´y,J.,2005.Phonemebasedacousticskeywordspottingininformalcontinuous speech.LNAI3658,302–309,http://www.fit.vutbr.cz/research/viewpub.php?id=7882

Szöke, I., Tejedor, J., Fapˇso, M., Colás, J., 2011. BUT-HCTLab approaches for spoken web search. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/

Szöke, I., Fapˇso, M., Vesel´y, K., 2012. BUT 2012 approaches for spoken web search – MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Szöke, I., 2010. Hybrid word-subword spoken term detection. Faculty of Information Technology BUT (Ph.D. thesis). http://www.fit.vutbr.cz/research/viewpub.php?id=9375

Varona,A.,Penagarikano,M.,Rodriguez-Fuentes,L.J.,Bordel,G.,Diez,M.,2012.GTTSsystemforthespokenwebsearchtaskatMediaEval 2012.In:Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Vavrek,J.,Pleva,M.,Juhár,J.,2012.TUKEMediaEval2012:spokenwebsearchusingDTWandunsupervisedSVM.In:Proc.MediaEval2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Wallace,R.G.,Vogt,R.J.,Sridharan,S.,2007.Aphoneticsearchapproachtothe2006NISTspokentermdetectionevaluation.In:Proc. INTER-SPEECH.ISCA,Antwerpen,Belgium.

Wang, H., Lee, T., 2012. CUHK system for the spoken web search task at MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/

Wang,D.,King,S.,Frankel,J.,2011.Stochasticpronunciationmodellingforout-of-vocabularyspokentermdetection.IEEETrans.Audio,Speech, LanguageProcess.9(4),http://homepages.inf.ed.ac.uk/v1dwang2/public/tools/index.html

Wang,H.,Lee,T.,Leung,C.-C.,Ma,B.,Li,H.,2013.UsingparalleltokenizerswithDTWmatrixcombinationforlow-resourcespokenterm detection.In:Proc.ICASSP.IEEE,Vancouver,Canada.

Wang,D.,2010.Out-of-vocabularyspokentermdetection.UniversityofEdinburgh(Ph.D.thesis).

Zeroresourcespeechtechnologiesandmodelsofearlylanguageacquisition,2012. http://www.clsp.jhu.edu/workshops/archive/ws-12/groups/mini-workshop/Lastaccessed:March1,2014.

Zhang,Y.,Glass,J.,2009.UnsupervisedspokenkeywordspottingviasegmentalDTWonGaussianposteriorgrams.In:Proc.ASRU,IEEE,Merano, Italy.

(17)

Zhang, Y., Salakhutdinov, R., Chang, H.-A., Glass, J., 2012. Resource configurable spoken query detection using deep Boltz-mann machines. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5161–5164.

Zulu,P.N.,Botha,G.,Barnard,E.,2008.OrthographicmeasuresoflanguagedistancesbetweentheofficialsouthAfricanlanguages.Literator29 (1),1–20.