ScienceDirect
ComputerSpeechandLanguagexxx(2014)xxx–xxx
Language
independent
search
in
MediaEval’s
Spoken
Web
Search
task
Florian
Metze
a,∗,
Xavier
Anguera
c,
Etienne
Barnard
b,
Marelie
Davel
b,
Guillaume
Gravier
daCarnegieMellonUniversity,Pittsburgh,PA,USA
bNorth-WestUniversity,Vanderbijlpark,SouthAfrica
cTelefonicaResearch,Barcelona,Spain
dCNRS–IRISA,Rennes,France
Received2August2012;receivedinrevisedform13October2013;accepted29December2013
Abstract
Inthispaper,wedescribeseveralapproachestolanguage-independentspokentermdetectionandcomparetheirperformance onacommontask,namely“SpokenWebSearch”.ThegoalofthispartoftheMediaEvalinitiativeistoperformlow-resource language-independentaudiosearchusingaudioasinput.Thedatawastakenfrom“spokenweb”materialcollectedovermobile phoneconnectionsbyIBMIndiaaswellasfromtheLWAZIcorpusofAfricanlanguages.Aspartofthe2011and2012MediaEval benchmarkcampaigns,anumberofdiversesystemswereimplementedbyindependentteams,andsubmittedtothe“SpokenWeb Search”task.Thispaperpresentsthe2011and2012results,andcomparestherelativemeritsandweaknessesofapproachesdeveloped byparticipants,providinganalysisanddirectionsforfutureresearch,inordertoimprovevoiceaccesstospokeninformationinlow resourcesettings.
©2014ElsevierLtd.Allrightsreserved.
Keywords: Low-resourcespeechtechnology;Evaluation;Spokenweb;Spokentermdetection
1. Introduction
In recentyears, speechtechnology has emerged as an enabling technology for increasing the accessibility of informationforanumberofquitediverseusecases.Theseincludesearchinglargearchivesofaudio-visualmaterial, dialogsystemsforaccesstopersonalinformationand(mobile)websearch,aswellasapplicationsinlanguagelearning andpronunciationtraining.Aparticularlydeservingaspectoftheseisthepotentialofspeechtechnologiestofoster participationofdisabled,low-literate,or“minority”usersintheinformationsociety.
Thelastcasehasproventobeparticularlychallenging,becauseresourcesofanykindareusuallyscarceforminority languages,dialectsandothernon-mainstreamconditions,whichcanthereforenotbeapproachedwiththetypical“there isnodatalikemoredata”engineeringapproach.Clearly,societywouldbenefitgreatlyfromtheabilitytoeasilyprocess
∗Correspondingauthor.Tel.:+14122688984.
E-mailaddress:fmetze@cs.cmu.edu(F.Metze).
0885-2308/$–seefrontmatter©2014ElsevierLtd.Allrightsreserved. http://dx.doi.org/10.1016/j.csl.2013.12.004
audioinanylanguage(ordialect),orlanguageindependently,withouthavingtospendresourcesonlanguage-specific development,butsignificantresearchisstillneededinthatarea.
“SpokenWebSearch”involvessearchingforaudiocontent,withinaudiocontent,usinganaudioquery,inalanguage, dialect,ordomainforwhichonlyverylimitedresourcesareavailable.Theoriginalmotivationforthistaskwastobe abletoprovidevoice-basedaccesstospokendocumentscreatedbylocalcommunityeffortsinruralIndia.Because noexperts areavailable tocreateand maintaindedicated speechdialog systems, acousticsimilarity andkeyword searchcouldbeusedtoaccessinformation.Acallerwouldforexamplesay“weathertomorrow”andretrieveaspoken documentwhichcontainsthephrase“theweathertomorrowwillbe...”(inamatchingdialect).Whilesucha retrieval-basedapproachisclearlylimitedwhencomparedtofullydevelopeddialogsystems,itisstillpreferabletonothaving anycapabilitytoaccessinformationatall.Thefactthatusersinsuchapplicationswilloftenberepeatcallers,and thereforewillbefamiliarwiththesystem,alsoenhancesthepotentialefficacyofthisapproach.
Themainchallengeisthereforetodevelopapproachestospokentermdetection(ratherthanfullspeech-to-text) thatscaletomanylanguages,dialects,anddomainsveryquickly,withoutrequiringdataandlanguageortechnology experts.Anefficientlarge-scale deployment withmanyusers isdesirable,butnot theprimary goal. To solvethis problem,tworesearchavenuespresentthemselves:portexistingspeechrecognitionapproaches,or builddedicated solutions.Whenstartingfromexisting speechrecognitionapproaches andresources,techniques havetobefound whichwillmakethemusefulinlow-resourcesettings.Multi-lingualmodelingandcross-lingual(or-dialectal)transfer havebeenusedinthepasttodothat.Asanalternative,limitedkeywordsearchoracousticpatternmatchingsystems canbetailoredspecificallytothetargetusecase.Itmaythereforebepossibletodevelopthemwithrelativelylittle data,orevenzero(outside)resources.
Tocomparethesetwoapproaches,andanalyzethetrade-offsentailedbysuchadesigndecision,“SpokenWeb Search”(SWS)wasrunasachallenge-styletaskatMediaEval2011(RajputandMetze,2011)andMediaEval2012 (Metzeetal.,2012).Thisevaluationattemptstoprovideacommon evaluationcorpusandbaselinefor researchon language-independentsearchandretrievalofreal-worldspeechdata,withaspecialfocusonlow-resourcelanguages. InSection2,wesurveythefieldoflow-resourceacousticpatternmatchingandspokentermdetection.Section3
presentsthe“SpokenWebSearch”taskasapublicdatasetandevaluationcampaigntoinitiateresearchanddiscussion inthisresearcharea.Aunifiedviewanddiscussionofthe approachesimplementedbytheparticipantsisgivenin
Sections4–7,followingthedifferentstepsofatypicalsystem.InSection8,wediscussresultsachievedinthe2011 and2012evaluations,andprovideresearchdirectionsforthefuture.
2. Relatedwork
TheWorldWideWebhaschangedthe informationlandscapefor thedevelopedworld, wherecitizensare used toaccessinginformationaboutalmostanything,anytimeandonanydevice.TheWeb2.0hasdemocratizedtheweb furtherbyenablinguser-generatedcontentthroughwikis,blogs,andmorerecentlythroughsocialnetworkingwebsites. However,thedevelopingregionsstillfaceseveralchallengesinbeingpartofthisinformationrevolution:lowliteracy (Education,2010)andlackofinternetpenetration(Internet,2010)beingsomeofthem.
Ontheotherhand,India,China,BrazilandIndonesiatakentogetherhavemoremobilephonesthantheTop50 developedcountriestakentogether(Internet, 2010).Voice-basedinformationsystems arethereforeevolvingas an alternativetothetext/visualweb,andcouldpotentiallyachievesignificantpenetration.Usersindevelopingregions arenowabletocreate,accessandsharecontentusingjusttheirvoiceandaphone.Severalaudio-basedinformation systemshavebeendeployedinthelastfiveyears. TheHealthline(Sherwanietal., 2007)systemprovidesreliable healthinformationforcommunityhealthworkersindevelopingcountries.TheAudioWikisystem(Kotkaretal.,2008) providesarepositoryofspokencontentthatcanbemodifiedthroughalow-costphoneandcansupportanylanguage. AdoptionissueshaverecentlybeenstudiedinRazaetal.(2013).TheSpokenWebsystemhasbeenusedinagriculture (Pateletal.,2010),employment(Kumaretal.,2008),social(Agarwaletal.,2009,2010)andseveralothersettings (Agarwaletal.,2010;Diaoetal.,2010).
Whiletheseexamplesclearlyillustratetheusefulnessofavoice-basedinformationsystem,theyalsoposechallenging researchproblems.Insomeofthesesystems,mostofthecontentisuser-generatedandinlocallanguagesanddialects. Thecontentisthereforemostlyspontaneousandcolloquialinstyle,withalotandahighvarietyofbackgroundnoise. Voicebeingasequentialmodality,navigationbycommandsisoftenachallenge,suggestingthataudiocontentsearch beappliedintheseinformationsystems.Leavingasidequestionsofusabilityanddeployment,efficientkeyword-style
search is akey requirement inorderto createausable andlow-cost (for boththe provider and consumer)audio informationsystemforthetargetusergroup.
The“SpokenWebSearch”taskthereforesitsattheintersectionoftworesearchdomainsthathaverecentlyseen significantactivity,namelyspeechtechnologyforunder-resourcedlanguages,asdiscussedabove,andspokenterm detection(STD).
Mostresearchinspeechtechnologyhastraditionallybeenconductedonasmallsetofwell-resourcedlanguages, buttheneed toextendsuch technologytomanymorelanguagesiswidelyrecognized(Pateletal.,2010;Barnard etal.,2010).Avarietyofapproachesfortaskssuchascorpusdevelopment(deVriesetal.,2011)andrapidrecognizer bootstrapping(Hughesetal.,2010;Schultz,2000)havebeendevelopedandapplied.SpokenTermDetectionfor under-resourcedlanguageshasrecentlyattractedattention,intheformofsystems(Hazenetal.,2009;ZhangandGlass,2009; Muscarielloetal.,2011)thatemploydynamictimewarpingtofindmatchesbetweenquerytermsandthematerialto besearched.Inordertoachievespeakerindependence,bothqueryandreferencespeecharerepresentedintermsof
posteriorgrams–thatis,aframe-synchronousseriesofvectors,eachcontainingestimatedposteriorprobabilities.In
Hazenetal.(2009),theseareposteriorprobabilitiesofthephoneticclasses,whereas(ZhangandGlass,2009;Zhang etal.,2012)employposteriorprobabilitiesofclassesdeterminedwithunsupervisedclustering.Notethatthequery isitselfassumedtoexistinspokenform–hence,thesemethodsstartfromsomewhatdifferentassumptionsthanthe conventionalSTDsystems,whichassumequeriesintextform.Spokeninputofquerytermsischaracterizedas“query byexample”inHazenetal.(2009).
Conventionalstate-of-the-artSTDmethodsemploygeneral-purposespeech-recognitionsystemstogeneratealattice of wordhypotheses (Miller etal., 2007;Chelba etal., 2008).The latticeisthenused tocreatean indexof word occurrences within the audio data,andthe corresponding confidence scores for each of thesedetections. During retrieval,itisthensimplyamatteroffindingthosewordstringsthatcorrespondtothequeryterms.Onecomplication tothisconceptuallysimpleapproachisthatthesearchtermsmaynotbepresentintherecognitionvocabularythat wasusedforspeechrecognition.Hence,alternativerepresentationsthatcanalsocaptureout-of-vocabularywordsare required–forexample,eachwordintherecognizedlatticecanbeexpressedintermsofitsconstituentphones.During retrieval,out-of-vocabularywordscanthenbefoundbymatchingtheirsub-word(forexample,phonetic)representation againstsuch phone-basedlattices(Wallaceetal.,2007).Othersub-worddecompositionshavealsobeenemployed successfully(Szöke,2010),butinmostcasesdetectionofout-of-vocabularywordsremainssubstantiallyinferiorto thatofin-vocabularywords.
AlthoughthequerybyexamplemethodsinHazenetal.(2009),ZhangandGlass(2009),Zhangetal.(2012)achieve promisingdetectionrates,retrievalissignificantlymoredemandingcomputationallythanwithindex-basedapproaches (Wallaceetal.,2007).InHazenetal.(2009),itisthereforerecommendedthatsuchmethodsbeusedasarescoring mechanismfortermsretrievedbyacruder(index-based)approach.Also,thesemethodshavetodatebeenassessed onwell-resourcedlanguages;hence,issuesintheirapplicationinrealunder-resourcedenvironmentshavenotbeen explored.
TheworkpresentedinthispaperisdifferentfromIARPA’scurrentlyactive“Babel”program(IARPA,2011)intwo keyaspects:weusetypicallyanorderofmagnitudelessdataperlanguageandthefocusisonlanguageindependent approaches,ratherthanacapabilitytorapidlybootstrapsystemsinnewlanguages.SWS-liketechnologiescouldalso behelpfultotheintelligenceormilitarycommunity,forexampleinquicklychangingtheatersofoperationsthatspan multiplelinguisticgroups, todeveloptriage capabilitiesfor intercepted radiocommunications,or aspart of other tacticalsolutions.Anotherusefulaspectofthedescribedworkisthatitcouldbeveryusefultoimplement(initial) speechprocessingcapabilitiesforusebynon-speechrecognitionexpertstoapplyinotherresearchcontexts(Kumar etal.,2013).MorediscussionofrelatedworkinASRcanbefoundinMilleretal.(2007),Shenetal.(2009),Larson etal.(2012),and(Zeroresource,2012).
3. SpokenWebSearchatMediaEval
MediaEvalisabenchmarkinginitiativededicatedtoevaluatingnewalgorithmsformultimediaaccessandretrieval (MediaEval,2014).The“SpokenWebSearch”(SWS)taskwasrunin2011(RajputandMetze,2011)and2012(Metze etal.,2012),2013(MediaEval,2013),andwillagainberuninOctober2014,followingamodelinwhichparticipants receivelabeleddevelopmentdata,beforereceivingunseenevaluationdataacoupleof monthslater,onwhichthey blindlysubmitresultstotheorganizersforscoring.
Table1
Development(Dev)andevaluation(Eval)corporausedforthe“SpokenWebSearchTask”atMediaEval(Metzeetal.,2012,2013).Eventhough weposetheevaluationasa“SpokenTermDetection”problem,SWScanalsobeseenasanInformationRetrieval(IR)problem,involving“queries” and“documents”.
Category 2011(“Indian”) 2012(“African”)
#Utts Total(h) Avg.(s) #Utts Total(h) Avg.(s)
Devdocs 400 2:02:22 18.3 1,580 3:41:52 8.4
Devqueries 64 0:01:19 1.2 100 0:02:22 1.4
Evaldocs 200 0:47:04 14.1 1,660 3:52:32 8.4
Evalqueries 36 0:00:58 1.6 100 0:02:32 1.5
Total 700 2:51:42 14.7 3,440 7:35:18 7.9
Bydesign,thepilot“Indian”dataset,whichwasusedinthe2011evaluation(RajputandMetze,2011),andwas retainedasa“progress”setforthe2012evaluation(Metzeetal.,2012),consistedofonly700utterancesintelephony quality(8kHz/16bit)fromfourIndianlanguages(English,Gujarati,Hindi,Telugu).Thedatawasprovidedbythe SpokenWebteamatIBMResearchIndia(Kumaretal.,2007)forresearchpurposes.The“African”datasetreplaced theIndiandataasprimaryconditionin2012toprovidemoredataandvariety,whileattemptingtomatchtheIndian dataset’soverall characteristics.It comprisesmorethan3000utterancesfrom isiNdebele,Siswati,Tshivenda,and Xitsonga,takenfromtheLWAZIcorpus(Barnardetal.,2009).
FortheAfricandataset,querytermsrangingfromonetothreewordspertermwereselectedfromthedevelopment andevaluationsets,insuchawayastoproducearangeofoccurrencefrequenciesinboththesesets.Nospeakeroverlap wasallowedbetweenspeakersinthedevelopmentset,theevaluationand/orthequeryset.Assomeofthelanguages areagglutinative,possiblesearchtermsthatoverlapwithveryfrequentlyoccurringwordswereexcludedfromtheset ofqueriesused.Ifforexample‘asajamile’occursfrequently,then‘ajamile’wouldbeexcluded.Askingparticipantsto makethisdistinctionwouldrequirethemtoperformmorphologicalanalysis–ataskoutsidethescopeofthecurrent challenge.Table1liststhecharacteristicsoftheSWSdatasets.
Languageswererepresentedequally withinthe respectivesets, andlanguage labelswereprovided onlyon the developmentdata,inordertorepresentarealisticscenario.Word-leveltranscriptions(andcorresponding phonetic dictionaryentries)were alsomade available for the training anddevelopmentsections of the dataonly. The task thereforerequiredresearcherstobuildalanguage-independentaudiosearchsystemsothat,givenaquery,itshouldbe abletofindtheappropriateaudiofile(s)andthe(approximate)locationofaquerytermwithintheaudiofile(s).
Targetsweredefinedbyanexactstringmatchofthequeryterminthereferencetranscription.Atargetwas“hit”,if thesystemreturnedapositivedetectiondecisionwithinatemporalwindowaroundthereferencealignment,whileit was“missed”ifnopositivedetectionwasreturned.Asexplainedabove,thetemporal“cushion”wassettoalargevalue forthe“Indian”data,becausenoexacttemporalalignmentwasavailable,andutteranceswerequiteshortanyway.
Bydesign,performinglanguageidentification,followedbystandardspeech-to-textisnotanappropriateapproach toSWS,becausefull-fledgedrecognizersaretypicallynotavailableintheselanguages.
Inordertonotrestrictparticipation,theuseofexternalresourceswaspermitted,aslongastheirusehadbeendeclared (“open”condition).Systemsthatwereonlydevelopedontheprovideddataarecalled“restricted”.Throughoutthis paper,wewillusethenomenclatureshowninTable2whendiscussingandcomparingapproachesandsystems.
3.1. Evaluationandscoring
SWSexperimentswerescoredwithamodifiedversionoftheNIST2006STDevaluationscoringsoftware(Fiscus etal.,2007).TheprimaryevaluationmetricwasATWV(ActualTerm-WeightedValue),whileMTWV(Maximum Term-WeightedValue)wasalsoreported.AccordingtoFiscusetal.(2007),theTerm-Weightedvalueiscomputedas afunctionof themissandfalsealarmprobabilities(Pmiss,PFA)averagedoverallqueryterms.The“Actual”TWV
isobtainedbycomputingthevalueforagivenoperatingpointsetbytheparticipant,“Max”TWVisdeterminedby selectingtheoptimumoperatingpointoverallqueries.
Asnoaccuratewordalignmentwasavailableforthe“Indian”data,ahypothesizedmatchwasconsideredcorrect providedthatitoccuredinthecorrectreferencefile,bysettingwidetemporalpaddingparameters.Onthe“African”
Table2
Classificationof(primary)SWSsubmissions:“open”meansthatexternaldatasourceswereused,“restricted”meansthatonlytheresourcesprovided duringthatyear’sevaluationwereused.A“symbol-based”systemcomputessimilaritiesatsomesymboliclevel,whilea“frame-based”system computesandsumsdistancesovertimeduringdecisionmaking.
Approach Condition
2011 2012
Open Restricted Open Restricted
Symbol-based Barnardetal.(2011), Szökeetal.(2011),and Mantenaetal.(2011)
– AbadandAstudillo(2012),
Varonaetal.(2012),and Buzoetal.(2012)
Szökeetal.(2012)
Frame-based – Anguera(2011)and
MuscarielloandGravier (2011)
WangandLee(2012) Anguera(2012),Jansenetal. (2012),Joderetal.(2012), andVavreketal.(2012)
data,accurategroundtruthalignmentswereavailable,thereforetheseparametersweresetbacktotheNISTstandard parameters.Additionally,in2012amodificationwasincorporatedtoweightmissedandfalsealarmdetections dif-ferentlyinthefinalmetric.InthedefaultNISTsetting,theimpactofafalsealarmisthreeordersofmagnitudethat ofamiss.Whilethesesettingsareadequateforalargedatamonitoring scenario,theymightnotbeappripriatefor aretrievalscenario,liketheoneproposedinthisevaluation.Theexactimpactofanindividualfalsealarmdetection varieswiththelengthofthereferencedatabaseandthenumberofquerytermsused,sothatcarehastobetakenwhen partitioningorcombiningdatasetsorquerylists.ForSWS2012,settingsintheNISTscoringscriptsweremodified toensurethatPmiss=PFAforanequalnumberofmissesandfalsealarmsintheoutput.
Apatchedscoringpackagewasdistributedalongwiththedata.
3.2. Overallsystemarchitecture
The2011pilotevaluationattractedtheinterestof5sites,while9teamsparticipatedinthe2012evaluation. Com-petingsystemsimplementedawiderangeofsolutions.Nevertheless,allsystemsfollowthesameoverallarchitecture, illustratedinFigure1.Thefirststepisfront-endprocessingwhichcomprisesseveralstepssuchassilencedetection, featureextractionandnormalization.Thesecondstepcorrespondstotheactualsearchwherethedatabaseismatched againstthequeryusingeitherasymbolicrepresentationorframe-basedpatternmatching.Thesearchprocedure com-putesascoreforeachattemptedmatchfortheutterancesinthedatabase.Thelaststepisthedecisionmakingstep whereactualanswerstothequeryareselectedfromthesearchresult.Thisstepisoftenlimitedtothecomparisontoa decisionthreshold,possiblyafterscorenormalization.
Werevieweachofthesethreestepsinturninthenextsections:“Front-End”and“Database”inSection4, Frame-based“Search”inSection5,Symbolic“Search”inSection6,andthe“Decision”stepinSection7.
4. Front-end
Inthissection, we reviewsuch fundamentalunderlyingtechniques usedfor low-resourceSTD such as feature extraction,featurenormalization,voiceactivitydetectionandtokenizerdevelopment.
4.1. Featureextraction
MostcommonacousticfeaturesincludedstandardMFCC(Joderetal.,2012;WangandLee,2012;Vavreketal., 2012;Barnardetal.,2011;Mantenaetal.,2011),Bottle-neck(Szökeetal.,2011,2012),Frequencydomainlinear prediction(FDLP-S)(Jansenetal.,2012)andPLPfeatures (AbadandAstudillo, 2012).Apartfrombeingusedin the developmentof full-fledged tokenizers (c.f. Section4.4), thesefeatures were also used indirect frame-based comparisons,orasapreliminarysteptoobtainposteriorgram-basedfeatures(WangandLee,2012;Muscarielloand Gravier,2011).
Posteriorgrams were obtained in different manners. One approachwas the use of phonemetokenizers trained usingexternaldata.Inthiscasetheposteriorprobabilitiesofeachphonemewereconcatenatedintoafeaturevector. Asphonemetokenizers are inherentlydependent ona language,othercommon approaches weretouse posterior probabilitiesfromlanguageindependentmodelssuchasGMMsdirectlytrainedonthedata(Anguera,2011,2012; MuscarielloandGravier,2011;WangandLee,2012)orusingautomaticallyderivedacousticunitsobtainedonthe developmentdataset(WangandLee,2012).
4.2. Featurenormalization
Somesystemsappliedmethodstonormalizethefeaturessoastominimizedependenciestospeakersandacoustic conditions,whichisespeciallyrelevant forframe-basedsystems. Inparticular, it wasobservedthat applying cep-stralmeanandvariancenormalizationimprovedthematchingaccuracies(MuscarielloandGravier,2011;Anguera, 2012;Joderetal.,2012).Inaddition,twosystemsshowedsubstantialimprovementsbyapplyingvocaltractlength normalization(VTLN)(WangandLee,2012;Szökeetal.,2012).
4.3. Voiceactivitydetection
GiventhespontaneousnatureoftheacousticdataandthefactthatinMediaEval2012someofthequeriescontained multiplewords,variableamountsofnon-speechwerepresentintherecordings,whichalwayscausesaproblemfor frame-basedmatchingsystemsandcanpotentiallyleadsymbol-basedsystemstoobtainwrongdecodings.Forthis reason,severalsystemsproposedandimplementedvariousvoiceactivitydetectionalgorithms(Anguera,2012;Wang andLee,2012;Jansenetal.,2012;Szökeetal.,2012).Examplesincludeunsupervisedtrainingofa2-class speech/non-speechclassifierusingGMMsonMFCCfeatures(Anguera,2012),ortheuseofposteriorsfromgivenphonetokenizers (Szökeetal.,2012).
4.4. Tokenizerdevelopment
Allsymbol-basedandafewframe-basedapproachesrequiredthedevelopmentofafull-fledgedphonetictokenizer, orasetoftokenizers.Forsymbol-basedapproaches,thetokenizersproducephonestringsorlatticesforfurtheranalysis; intheframe-basedapproach,thetokenizerisusedtogenerateposteriorgrams.
ThevastmajorityofsystemsbasedphonetictokenizersonhiddenMarkovmodels(Mantenaetal.,2011;Barnard etal.,2011;Varonaetal.,2012;Szökeetal.,2011),usingstandardtrainingprocedures.Mostteamsutilizedexternal datainordertooptimizetheaccuracyoftheirtokenizers,apartfromWangandLee(2012)whereautomaticallyderived phonetic-likeunitswerederivedfromthedevelopmentdata.ThisisdescribedinmoredetailinSection6.3.
Systemsvariedquitesignificantlywithregardtotheactualnumberofbaseunitsinthemodel;typically, symbol-basedtechniquesareverysensitivetowardthisparameter.Baseunitswereinallcasesapproximationsofphonesor phonegroups,withphonesetsmostlyreduced,oftenquiteaggressively,forexample,from77to28inthecaseofBuzo etal.(2012)and62to43,andthento21inBarnardetal.(2011).Systemstriedtocompensateforlimitedacousticdata (andlanguagemismatchwhenusingexternaldata)bymodelingbroadphonemicclassesratherthandetailedphonemes.
ThisreductionwasmostlyachievedbymergingsimilarphonesbasedontheirIPAidentity(Buzoetal.,2012;Varona etal.,2012;Barnardetal.,2011),andinonecase,byusingarticulatoryfeaturestocombinemodels(Mantenaetal., 2011).
5. Frame-basedapproaches
Frame-basedapproaches(oftenalsobeingreferredtoaspattern-basedorpatternmatchingapproaches)perform thematchingofqueryaudioandthereferencedataattheacousticfeatureframelevel.Suchmatchingreliesonlyonthe localsimilaritybetweenframepairsandposteriortime-alignment(allowingforframeinsertionsanddeletions)ofpairs insequence.Allframe-basedapproachessubmittedtotheevaluationusedynamicprogrammingalgorithmsinspired byDynamicTimeWarping(DTW)forposteriortimealignments,oftenimplementingsegmentalversionstoaccount forunknownstartandendpointsofmatchesinthesearchdatabase.Severalvariationswereproposed,dependingon thefront-endprocessingandonthetime-alignmentalgorithm.
WhenusedinMediaEval’sframe-basedapproaches,posteriorgramsturnedouttogenerallyoutperformrawacoustic featuressuchasMFCCsduetotheirincreasedrobustnesstospeakerandacousticvariability.
Manyvariantsofdynamicalignmentswereusedforframe-basedsearch.StandardDTWwasconsideredinsome systems,withstartandendpointsdeterminedpriortoDTW.InMantenaetal.(2011),aroughacousticdecodingstep basedonarticulatoryfeaturesisusedtofindputativematchingregionsonwhichDTWistobeapplied.Abrute-force approachwastakeninSzökeetal.(2011)wheretheDTWsimilarityiscalculatedforallthepossiblecombinations ofstartandendpoints.AsanalternativetostandardDTW,segmentalvariantswereused,wherestartandendpoints weredecidedaspartoftheoptimalalignmentsratherthandefinedinapre-processingstep.InMuscarielloandGravier (2011),segmentallocallynormalizedDTW(Muscarielloetal.,2012)wasusedtofindpotentialmatchesofthequery. InAnguera(2011),asub-sequenceDTWalgorithm(AngueraandFerrarons,2013)wasused,usingeitherposterior probabilityfeaturesorbinaryfeaturesforefficiency.AnothervariantofDTW,called“cumulativeDTW”,wasusedin
Joderetal.(2012)wheretheusualmaximizationisreplacedbyasoft-maxrule.Moreover,thepairwiselocaldistances werereplacedbystepfunctionsresultingfromthecombinationoffeaturefunctionswherethecombinationparameters werelearnedfromthedata.
Interestingly,adifferentsearchstrategywasusedinJansenetal.(2012)basedonHoughtransformstofind near-diagonal lines in asparse similarity matrix obtained from locality sensitive hashing (LSH) of raw features. The combinationofafastHoughtransformandframeindexing(JansenandDurme,2012)offerssubstantialpotentialin termsofspeedandscalability.
Anyofthesearchstrategiesmentionedabovereturnedasetofputativematchingsegmentswhichwere,inmost cases, post-processedtorefinethe matchesbefore makingadecision.In Vavreketal.(2012),SVM classification was appliedtotheoutput of theDTWalignment todecide whetherthe alignmentcorrespondstoamatchor not. InMuscarielloandGravier(2011),animage-basedcomparisonofthequeryandreferencesegmentsrepresentedas self-similaritymatriceswasperformedtoincreaserobustnesstospeakerandacousticconditions(Muscarielloetal., 2012).Aself-similaritymatrixisasquare,symmetricmatrixofdistancescomputedbetweentheindividualframesof anutterance.Iftwosequencesaresimilar,soaretherespectiveself-similaritymatrices,whichcanthereforebeusedas adistancemeasureitself.Similarly,putativematchesresultingfromtheHoughtransforminJansenetal.(2012)were furthervalidatedusingstandardDTW.
6. Symbol-basedapproaches
Symbol-basedapproachestotheSWStaskfirstconvertthequeryandthecontentintoasymbolicrepresentation, on whichthebest matchisthen computedwithouttaking temporal alignmentinto account. Whilethesesymbols can,inprinciple,beanycategoricalrepresentation,allSWSsubmissionsusedsymbolsetsthatwereeitherbasedon phonesorbroadphonemicclasses,definedatacoarsergranularitythanwouldtypicallybeusedinstandardspeech recognition.
Sincethesuccessofsymbol-basedapproachesreliesheavilyontheaccuracyofthetokenizer,itisnotsurprisingthat mostparticipantsusedtokenizersfromthe“open”category,andusedadditionalresourcesduringsystemdevelopment. Onlyoneparticipant(Szökeetal.,2012)createdasymbol-basedsystemwithoututilizinganyexternalresources.
Twomainapproacheswereused:
• Acoustickeywordspotting(AKWS),inwhichatokenizedquerystringcompeteswithabackgroundand/orfiller modelduringdecoding(AbadandAstudillo,2012;Szökeetal.,2012,2011).
• Stringmatching,usingaformofDTWatthesymbollevel.Queriesweremostlyconvertedtosinglestrings,while contentutteranceswerealternativelyrepresentedasstrings,n-bestlists,latticesorconfusionnetworks(Varonaetal., 2012;Buzoetal.,2012;Barnardetal.,2011;Mantenaetal.,2011).
6.1. Acoustickeywordspotting
Apartfromthedevelopmentoftheactualtokenizers(c.f.Section4.4)approachesdifferedmainlywithregardto thearchitectureoftheAKWSsystemandnormalizationsapplied(c.f.Section7).
ThesysteminSzökeetal.(2012)createdanHMMforeachquery,andcalculatedthelog-likelihoodratiobetween thequeryandabackground/fillermodel(Schwarz,2009;Szökeetal.,2005),implementingthebackgroundmodelas afreephoneloopwithoutanyweighting.InAbadandAstudillo(2012),ahybridANN/HMMapproachisemployed, e.g.,theHMMisusedtomodelthespeechsignal,andtheANNtoestimatetheposteriorphoneprobabilities.Asliding windowwasusedtoprocesseachfile,withauniform1-glanguagemodelformedbythetargetqueryandacompeting speechbackgroundmodelbeingused,andtheweightofthebackgroundmodeltunedonthedevelopmentset.
Inallcasesquerypronunciationswereobtainedbytokenizingtheaudioautomatically(Szökeetal.,2012,2011), with(Abad andAstudillo, 2012)optimizingonadevelopmentsettoobtainthe correct numberofphonemes. All systemsusedsinglepronunciations,withsomeexperimentsincreatingadditionalvariantsnotproducingawin(Abad andAstudillo,2012).Inaseparateexperiment(in-admissiblefortheprimarycondition)thesensitivityofthesystemwas testedtowardtheaccuracyofthederiveddictionary,findingthatforce-alignedtranscriptionsresultedinasignificant TWVincreaseofabout0.28(Szökeetal.,2012).
6.2. Stringmatching
AllstringmatchingapproachesusedsomeformofDTWtomatchquerytocontent,anddifferedmainlywithregard tothecontentrepresentationusedandthecostfunctionemployedduringDTW.Apartfromoneuseofan articulatory-inspiredphoneset(onlydifferentiatingbetweenarticulationeffects ratherthanindividualphones) (Mantenaetal., 2011),allsystemsusedfairlystandardphonesets,reducingthenumberofphonemesforincreasedgeneralizationand improvedaccuracygiventhatthesehadbeenusuallytrainedwithdatafromotherlanguages.
Basicstring-basedDTW,witheachcontentutterancealsobeingrepresentedasasinglestring,wasimplementedby mostparticipants(Buzoetal.,2012;Barnardetal.,2011;Mantenaetal.,2011).Stringsaretokenized,DTWisapplied withaslidingwindowandsomeformoffilteringusedtoremoveoverlappingqueries.Inaddition,DTWsearchwas alsoperformedwithinconfusionnetworks(Manguetal.,2000),withthescoreweightedaccordingtothealignment ofthequerywiththeconfusionnetwork(Buzoetal.,2012),andwithinlattices(Barnardetal.,2011;Varonaetal., 2012).Specifically,theVaronaetal.(2012)systemextractedtheNphonedecodingswiththehighestlikelihoodsfrom thephonelatticeandthenconvertedthemtomultigrams(Wangetal.,2011).Resultswerefilteredbasedonrankas wellasscore,afternormalization.
Oneofthekeyelement ofstringmatchingisthecostmatrixwhichholdsthecostofsubstitutingoneunit(e.g., phone)withanother.Cost matriceswereeitherflat(Mantenaetal.,2011),linguisticallymotivated(Barnardetal., 2011),orestimatedfromdevelopmentdata.
6.3. Cross-languagedatasharing
Whilethemainobjectiveofthe“SpokenWebSearch”taskistosearchlanguageswithverylimitedtrainingresources, manysystemsintheMediaEvalevaluationfounditusefultoutilizedataresourcesfromotherlanguages.
Systemsinthe“open”categorydifferedsubstantiallywithregardtotheamountandtypeofdatathatwere incor-poratedfromadditionalsources; anapproachthat was onlyused as basisfor symbol-basedtechniques. Onlytwo groupsuseddataofthesamelanguagefamilyasthetargetdata,thesebeingTelugu(Mantenaetal.,2011)andHindi (Barnardetal.,2011)inMediaEval2011.Fortherest,datawassourcedfromvariousotherlanguages(Romanian,
Czech,English,Hungarian,Levantine,Polish,Russian,Slovak,EuropeanandBrazilianPortuguese,EuropeanSpanish, AmericanEnglish,etc.),oftenbasedontheBUTphonerecognizers(Phonemerecognizer,2009).
Externaldatawas incorporatedintoprimarysystems byeitherusing theforeignmodelsdirectly,tokenizingthe queriesandcontentusingtheforeigntokenizersdirectly(Szökeetal.,2011;AbadandAstudillo,2012),orbydeveloping tokenizerswithforeigndata,andthenadaptingthese.Adaptationwasperformedindifferentways.TheBuzoetal. (2012)systemusedMAPadaptation,firstlow-passfilteringbroad-banddata,thenmappingphonesusingInternational PhoneticAlphabettablesoraconfusionmatrix,tuningthephoneerrorrateontheMediaEvaldevelopmentset.The
Barnardetal.(2011)systemMAP-adaptedthesamesourcedatatothefourtargetlanguagestocreatefourseparate tokenizers.TheSzökeetal.(2011)systemusedlanguage-specificKarhunen–Loevetransformsduringadaptation. 7. Normalizationanddecisionmaking
Asaresultofthe searchstep,asetof possiblematchesisobtainedfor everyqueryterm.Forscoring,amatch mustincludethematchingutteranceIDandthestart-endpointswherethequeryisthoughttoappear,togetherwitha relevancescore.Severalmethodshavebeenproposedinbothsymbol-basedandframe-basedapproachestoimprove thematchingresultsatthispoint.Theseincludescorenormalizationtechniquesandfusionofdifferentsystemoutputs.
7.1. Scorenormalization
Thequeriesthatwereusedintheevaluationhaveverydifferentacousticcharacteristics.Ontheonehand,theirlength (beforeanyvoiceactivity detectionwasapplied)rangedfrom0.39sto4.12sinthe developmentset,andbetween 0.38sand5.96sintheevaluationset.Ontheotherhand,somequeriesconsistedofonesinglewordwhileothershad two,ormorewords.Inaddition,eachphoneticclasshasdifferentaveragematchingscores:stablepartsinvowelsand silenceshaveaverygoodintra-classmatch,forexample,whileconsonantsachievelowerdirectmatchingscores.For thesereasons,thedistributionsof scoresforreferencematchingsequencestoeach queryusuallydiffersquiteabit amongqueries.
Several methodshavebeenproposedtonormalizesuch scoresinordertoallowfor the applicationof asingle optimaldetectionthreshold.InWangandLee(2012),Joderetal.(2012),andJansenetal.(2012)variousflavorsof
z-normnormalizationwereapplied.InWangandLee(2012)thenormalizationmeanandvarianceforeachqueryin thetestwasestimatedbyusingdevelopmentdata.Onthecontrary,Jansenetal.(2012)usedthesetofpossiblematches oftestquerieswiththetestdatatocomputesuchparameters.Similarly,inJoderetal.(2012)thetestdatawasusedto findappropriatenormalizationparameters,althoughtheauthorsavoidusingthe10%best-matchingscorestoavoida biaswiththeactualmatches.
AtotallydifferentapproachwasfollowedbySzökeetal.(2012),wherealinearregressionmodelwastrainedusing developmentdatatopredicttheidealthreshold.Parameterssuchasthequerylength,totalamountofdetectedsilence inthequery,numberofphonemes,andsoonwereused.Inallcases,thesystemswerereportedtoimproveresultsby usingsuchapproaches.
7.2. Intra-systemfusion
Several groupssubmittedthe outputof afusion ofmultiple systemsas their primary submission,andreported consistentgains.Amultitudeoftechniquesweretried:
WangandLee(2012)achievedsystemcombination(orfusion)byaveragingthedistancematricescomputedwith differenttokenizersbeforecomputingDTW(Wangetal.,2013).Individualtokenizerswerebasedondifferenttraining data,or targetsets. Whengoing froma two-systemcombination toaseven-system combination, gains reach 0.1 (intermsofATWV)onthedevdata,and0.15ontheevaldata,whengoingfromafive-systemcombinationtothe seven-systemcombination,gainsarelessthanorequalto0.02.
AbadandAstudillo(2012)exploredsystemcombinationusing“AND”,“OR”,and“MAJORITY”operationson fourindividualsub-systemoutputs.Majorityvotingwasfoundtogivethebestresult,butnoperformancenumbers havebeenpublishedfortheindividualsub-systems.
Inadifferentlineofthought,pseudo-relevancefeedbackwasusedinWangandLee(2012),wherethetopmatches obtainedfromtheoriginalquerywereusedtorescoretheremainingmatches,withthegoaltoobtainabetterscore
5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Miss probability (in %)
False Alarm probability (in %)
Random Performance BUT-HCTLabs: MTWV = 0.131 IIITH: MTWV = 0.000 IRISA: MTWV = 0.000 MUST: MTWV = 0.114 TID: MTWV = 0.173
Figure2.Results(ATWVplotandMTWV)forevaluationtermsonthe2011evaldata(Metzeetal.,2012).Theoperatingpointoftheparticipants’ submissionsfortheirrespectiveATWVisindicatedbythemarkerontheline.
estimation.InAnguera(2012)andMantenaetal.(2011)overlapdetectionontheresultingmatcheswasusedtomerge overlappingresults.
8. Discussion
ResultsforallprimaryandthemostrelevantcontrastivesystemsarepresentedinFigures2and3forthe2011and 2012benchmarksrespectively.WealsoreportATWVinTable3forthe2012data.Notethatthe2011and2012results wereestablishedontwodistinctdatasets,usingdifferentscoringparameters,andarethereforenotdirectlycomparable. Inspiteofthisdifferenceinthedataset,itisclearthatprogresshasbeenmadebetweenthetwobenchmarks,withnew
5 10 20 40 60 80 90 95 98 .0001 .001 .004.01.02 .05 .1 .2 .5 1 2 5 10 20 40
Miss probability (in %)
False Alarm probability (in %)
Random Performance ARF MTWV = 0.310 BUT MTWV = 0.530 BUT-g MTWV = 0.488 CUHK MTWV = 0.762 CUHK-g MTWV = 0.643 JHU-HLTCOE MTWV = 0.384 L2F MTWV = 0.523 TUKE MTWV = 0.000 TUM MTWV = 0.296 TID MTWV = 0.342 GTTS MTWV = 0.081
Figure3.ATWVplotsandMTWVresultsforevaluationtermsonthe2012evaldata(Metzeetal.,2013).Operatingpointsarenotshownforclarity ofpresentation.Figures2and4showthedifferencebetweenwell-tunedoperatingpointsforthescoringdefinedon“Indian”and“African”data setsbytheorganizers.
Table3
Results(actualTWV)forselectedSWS2012systems(Metzeetal.,2013).
System Type Dev Eval See
cuhk phnrecgmmasm p-fusionprf(CUHK) Open 0.782 0.743 WangandLee(2012)
cuhk spchp-gmmasmprf(CUHK-g) Restricted 0.678 0.635 WangandLee(2012)
l2f12spchp-phonetic4fusionmv Open 0.531 0.520 AbadandAstudillo(2012)
butspchp-akws-devterms(BUT) Open 0.488 0.492 Szökeetal.(2012)
butspchg-DTW-devterms(BUT-g) Open 0.443 0.448 Szökeetal.(2012)
jhuallspchp-rails(JHU-HLTCOE) Restricted 0.381 0.369 Jansenetal.(2012)
tidsws2012IRDTW Restricted 0.387 0.330 Anguera(2012)
tumspchp-cdtw Restricted 0.263 0.290 Joderetal.(2012)
arfspchp-asrDTWAlignw15a08b04 Open 0.411 0.245 Buzoetal.(2012)
gttsspchp-phonelattice Open 0.098 0.081 Varonaetal.(2012)
tukespchp-dtwsvm Restricted 0 0 Vavreketal.(2012)
pre-processing,tokenizationandnormalizationtechniquesappearingin2012.Inthefollowing,wefirstreviewand commentonevaluationresults,beforepresentingfusionexperimentsandinitialanalysisofhowlanguageproperties impactdetectionperformance.
While in 2011 few systems implemented frame-based approaches using pattern matching techniques, such approaches were implemented in the majority of the 2012 submissions. Moreover,in bothbenchmarks, the best resultswereobtained bytemplatematchingsystems.Frame-based,templatematchingtechniquesaregaining inter-estinthecommunityandcanachievethesameperformanceassymbol-basedapproachesonthe“operatingpoint” chosenforMediaEvalwithrespecttoamountandkindofdata.TheBUTexperiments(Szökeetal.,2012)showthat nothaving alexiconavailablegreatlyimpacts theperformanceof AKWSsystems,whicharelimitedtoextracting informationfromone,isolatedqueryonlyinthisscenario.Computationmaybeanissueforframe-basedsystems,but techniqueshavebeendevelopedtosearchevenlargeamountsofdataefficiently(Jansenetal.,2012).Thebest sys-temstendtocombinemultiplerepresentationsandtechniques,achievingsignificantgainsintheprocess;wespeculate thatSWScouldbeaninteresting,low-complexitytest-bedtodevelopcomplementarydatarepresentations,matching techniques,andsearch approaches.It ishoweverinterestingtonotethat undertheSWSevaluationconditions,the zero-knowledge(“restricted”)approachesperformedquitesimilarlyto“open”(typicallymodel-based) approaches, whichtypicallyrelyontheavailabilityofmatchingdatafromotherlanguages.Onthe2012evaluation,thedifference inATWVisabout0.1forthetwoCUHKsystems(WangandLee,2012)and0.05forthetwoBUTsystems(Szöke etal.,2012).
Asafirstobservation,thelargersetofparticipantsalsoresultedinavarietyofnewsignal-processingtechniques beingintroducedoroldtechniquesbeingre-introduced,suchas VocalTractLengthNormalization(VTLN),which boosted performancesignificantlyfor theCUHKsystem(WangandLee,2012),althoughnodetailedanalysis has beenperformed onthisisolated aspect.Similarly,cepstral meansubtraction andvariance normalizationhavebeen successfullyappliedtomostpatternmatchingsystems.2011resultshadalsoalertedparticipantstotheimportance ofsilencesegmentationforthistypeoftask,assilencesegmentsshouldnotbycountedinanydistanceormatching function.Withmodel-basedtokenizersincreasinglyexploitingtechniquesfromspeechandspeakerrecognition,we expextthatnormalizationtechniques,suchasVTLN,SATorfactoranalysiswillbeadaptedtothespokenWebsearch taskinthenearfuture.
Secondly, efforts have been devotedbetween 2011 and2012 to the selection of suitable acoustic units inthe tokenizers.Agreater varietyoflanguageswereused, anddata-driven unitshavesuccessfullybeencombinedwith phoneticunitsinthe“open”condition.Thebestperformingsystemin2012isaframe-basedsystemlinearlycombining similaritymatricesobtainedindifferentways,eitherinarestrictedmode(GMM,self-trainedacousticunits)orinan openmode(phonemodelsfromvariouslanguages).IntermediateresultsreportedinWangandLee(2012)showthatthe combinationofphonemodelsfrom5differentlanguages(cz,hu,ru,ma,en)givesbetterresultsthanthecombination ofGMMandASMposteriorgrams(ATWV0.72vs.0.59onthe2012evaluationdata),whileusingall7tokenizers outperformsbothsettings(ATWVof 0.74).Amoredetaileddescription of thissystemisavailable inWang etal. (2013).Thecombinationofdistancematricesobtainedfromposteriorgramswithdifferenttokenizerscouldalsohave contributedtoanincreasedrobustnesstospeakervariability.
Table4
TWVvaluesonthe2012datasetforthedifferentAfricanlanguages.
Language ATWV MTWV
Open Restricted Open Restricted
IsiNdebele 0.609 0.512 0.717 0.593
Siswati 0.644 0.583 0.782 0.709
Tshivenda 0.718 0.604 0.718 0.612
Xitsonga 0.650 0.559 0.650 0.613
All 0.698 0.586 0.763 0.658
Finally,scorenormalizationtechniquesderivedfromspeakerverificationwereintroducedbymostparticipantsin 2012soastonormalizescoresonaper-querybasis,togoodeffect.
Mostparticipantschoseaz-norm-likescheme,normalizingscoreswithrespecttothequery,whichisfairlyeasyto implementandreportedbenefitsofsuchanormalizationintheworkingnotepapers(MediaEvalBenchmark,2011, 2012).Variousflavorsofz-normwereused(underdifferentnames),buttheabsenceof contrastiveresultsdoesnot allowtovouchforoneoranother.Atthe2012evaluationworkshop,participantsexpressedaninterestinexploring
t-normtechniques,atnormalizingthedecisionscorewithrespecttothedocumentinwhichthequeryissearchedfor, therebyintegratingmoreinsightfromspeakerrecognitionwork.Whilesuchtechniqueswouldundoubtedlyimprove all approaches,their implementation isboth algorithmically complex and computationally intensive and was not consideredsofar.
Inlightoftheseelementsofanalysis,webelievethatspeakernormalizationtechniques(VTLN,meanandvariance normalization),alongwiththecombinationofmultipledistancematricesobtainedfromdifferentposteriorgramsare thekeyfeatures explainingthe successoftheCUHK system(Wang andLee, 2012;Wangetal.,2013)overother participants.
Focusingon frame-based systems,results from TID (Anguera, 2012; Mantena andAnguera, 2013) and JHU-HLTCOE(Jansenetal.,2012)areofparticularinterest:bothsystemsimplementindexingtechniquesforefficientpattern matching,whereotherframe-basedapproachesrelyonsegmentalvariantsofthedynamictimewarpingalgorithm. Suchvariantsaretypicallycomputationallyexpensiveandseverelylimitthescalability offrame-basedapproaches. Onthecontrary,indexingtechniquesenabletheefficientcomputationofasparsesimilaritymatrix,whosesparsity inturnenablesfast matching.Whilethetwoapproaches exploitingindexingtechniquesdonotoutperform others, theyexhibitafairperformancelevelwhilebeingscalable.Thesetwosystemsthusclearlydemonstratethatcombining indexingtechniquessuchas LSHorefficientapproximatenearestneighborsearchwithpatternmatchingisavalid researchtrendforfastandscalablelanguage-independentspokentermdetection.Duringtheevaluation,participants wereencouragedtomeasureandreportthecomputationalrequirementsoftheirapproaches;however,thewidevariety ofresourcesusedmakeafaircomparisonbetweensystemsdifficultatthispoint.
8.1. Multi-sitefusion
Similarlytothewell-knownROVERapproachtocombiningmultiplespeech-to-textsystems(Fiscus,1997),the resultsofmultiple,independentSTDsystemscanalsobecombined.Severalmethodscanbeemployedtogenerate appropriatecombinationweights,suchasmaximumentropyorlinearregression(Norouzianetal.,2012).For symbolic-basedsystems,the combinationofpronunciation dictionaries,whichweregenerated usingdifferent approachesis viable(Wang,2010).Thisapproachhowever isnotsuitable fordictionary-lesssystems,such asthe“frame-based” approachesdiscussedhere.Inanycase,scoresorposteriorsneedtobenormalizedsuitablyacrosssystemsforsuccessful normalization.Severalwell-performingparticipantsalsoperformedsystemcombinationusingscoreaveraging(Wang andLee,2012)orvoting(AbadandAstudillo,2012).
TocombinetheoutputofallMediaEvalsubmissions,weemployedtheCombMNZalgorithm(FoxandShaw,1994). CombMNZisageneraldata-fusiontechnique,whichstillrequiressomescorenormalization,aspreviouslydiscussed. Intheauthors’ownworkonIARPA’sBabelprogram(IARPA,2011),thisalgorithmprovidedalmostalwaysthebest performanceacrossawiderangeofconditions,andwascertainlythemostrobustfusiontechnique.Table4showsthe
5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Miss probability (in %)
False Alarm probability (in %)
Random Performance ALL: MTWV = 0.763 IsiNdebele: MTWV = 0.717 Siswati: MTWV = 0.782 Tshivenda: MTWV = 0.718 Xitsonga: MTWV = 0.650
Figure4.Results(ATWVplotsandMTWV)onthe2012evaluationdataonaperlanguagebasisforacombinationofopensystems.
resultsofthis“fused”systemfor the2012evaluationdata.Figures4and5showthatthecorrespondingcurvesare somewhatmore“well-behaved”,eveniftheMaximumTWVcouldnotbeimproved,sothereissomebenefitfrom systemfusioneveninthiscase.
GiventhelargeadvantagethattheCUHKsystems(WangandLee,2012)enjoyedovertheotherparticipantsinthe 2012evaluation,theorganizerswerenotabletoimprovetheperformancesignificantlybymergingmultiplesystem outputs(ranked lists),butthe combinationofseveralothersystems providedgoodgains,for boththe“open” and “restricted”cases.ItmaybepossibletoachievegainsbyusingamatrixcombinationtechniqueasdescribedinWang etal.(2013). 5 10 20 40 60 80 90 95 98 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Miss probability (in %)
False Alarm probability (in %)
Random Performance ALL: MTWV = 0.658 IsiNdebele MTWV = 0.593 Siswati: MTWV = 0.709 Tshivenda: MTWV = 0.612 Xitsonga: MTWV = 0.559
8.2. Influenceoflanguage
Figures4and5provideabreakdownofthe2012evaluationresultsforthefourlanguagesrepresentedinthedatabase. Resultsareprovidedforafusionofthebestopenresourcessystemsandafusionofthebestrestrictedsystems,open systemsperformingslightlybetterthanrestrictedones.Inbothcases,Xitsongaappearsmoredifficultthantheother languageswhileSiswatiyieldsthebestperformance.Webelievethattheseresultscanbepartiallyexplainedbythe averagewordlengthwhichdiffersbetweenlanguages,wherelongeraveragewordleadstobetterresults.Regardlessof thewordlengthconsideration,onecanexpectthatlanguagesmostsimilartoGermaniclanguages,i.e.Tshivendaand Xitsonga,benefitmostfromtheopencondition(Zuluetal.,2008).However,thislinguisticallymotivatedexpectation wasnotmet(seeTable4),probablybecauseoftheinfluenceofstrongernon-linguisticfactorssuchaswordlengthand therandomnessduetothechoiceofthequeries.
9. Conclusionandoutlook
The capabilityto detectspoken terms or recognize keywords inlow or zeroresource settingsis an important capabilitywhichcouldboosttheuseofspeechtechnologyindevelopingregionssignificantly.Whenthereareneither experts,whocoulddevelopspeechrecognitionsystemsandmaintaintheinfrastructurerequiredfordesigningspeech dialogsandindexingaudiocontent,nordatabasesonwhichspeechinterfacescouldbedeveloped,zeroresourcespoken termdetectionaspresentedherecouldprovidea“winningcombination”.Inthispaperwepresentedthemainfindings ofthe“SpokenWebSearch”taskwithintheMediaEvalevaluationbenchmark,asrunin2011and2012.Webelieve theresultsachievedintheevaluationsshowthatthesetechniquescouldbeappliedinpracticalsettings,eventhough usertestsarecertainlyneededtodeterminetheoverallperformance,andanacceptableratiooffalsealarmsandmissed detectionsforagivenapplication.Itisinterestingtonotethattheverydiversesystemspresented,whichcoverawide rangeof possibleapproaches, couldachieve very similarresults,andfutureworkshouldincludemore evaluation criteria,suchasamountofexternaldataused,processingtime(s),etc.,whichweredeliberatelyleftunrestrictedinthis evaluation,toencourageparticipation.
Withrespecttotheamountofactualdataavailable,theSWStaskismuchharderthantheresearchgoalsproposedby forexampleIARPA’sBabel(IARPA,2011)program,whereupto100hoftranscribeddataperlanguageareavailable, andthelanguageofatestqueryisknown.TheSWStaskistargetedprimarilyatcommunitiesthatcurrentlydonot haveaccesstotheInternetatall.Manytargetusershavelowliteracyskills,andmanyspeakinlanguagesordialectsfor whichfullydevelopedspeechrecognitionsystemswillnotexistevenforyearstocome.Wehopethattherecentsurge ofactivityinzeroresourceapproaches(seee.g.Zeroresource,2012;Jansenetal.,2013)willresultinfurtherprogress, whichwilladvancethestateoftheartinspokentermdetectionanddocumentretrievalsignificantly,specificallywhen largedatasetsanddatabasesarenotavailable.
Acknowledgments
TheauthorswouldliketoacknowledgetheMediaEvalMultimediaBenchmark(MediaEvalBenchmark,2012).We especiallythankMarthaLarsonfromTUDelftfororganizingthisevent,andtheparticipantsfortheirhardworkonthis evaluation.The“SpokenWebSearch”taskwasoriginallyproposedbyresearchersfromIBMIndia(Agarwaletal., 2010).North-WestUniversityandIBMResearchIndiacollectedandprovidedtheaudiodataandreferencesusedin theseevaluations.
References
Abad, A., Astudillo, R.F., 2012. The L2F spoken web search system for MediaEval 2012. In: Proc. MediaEval 2012, http://www. multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Agarwal,S.,Kumar,A.,Nanavati,A.A.,Rajput,N.,2009.Contentcreationanddisseminationby-and-forusersinruralareas.In:Proc.Intl.Conf. InformationandCommunicationTechnologiesandDevelopment(ICTD),Doha,Qatar.
Agarwal,S.,Dhanesha,K.,Jain,A.,Kumar,A.,Menon,S.,Nanavati,A.,Rajput,N.,Srivastava,K.,Srivastava,S.,2010.Organizational,socialand operationalimplicationsindeliveringICTsolutions:atelecomwebcase-study.In:Proc.ICTD,London,UK.
Agarwal,S.,Jain,A.,Kumar,A.,Rajput,N.,2010.Theworldwidetelecomwebbrowser.In:Proc.FirstACMSymposiumonComputingfor Development.ACM,London,UK.
Anguera,X.,Ferrarons,M.,2013.MemoryefficientsubsequenceDTWforquery-by-examplespokentermdetection.In:ICME:International ConferenceonMultimediaandExpo,SanJose,CA,USA.
Anguera, X., 2012. Telefonica system for the spoken web search task at MediaEval 2011. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/
Anguera, X., 2012. Telefonica research system for the spoken web search task at MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Barnard,E.,Davel,M.H.,vanHeerden,C.,2009.ASRcorpusdesignforresource-scarcelanguages.In:Proc.INTERSPEECH.ISCA,Brighton, UK,pp.2847–2850.
Barnard,E.,vanSchalkwyk,J.,vanHeerden,C.,Moreno,P.J.,2010.Voicesearchfordevelopment.In:Proc.INTERSPEECH.ISCA,Makuhari, Japan,pp.282–285.
Barnard,E.,Davel,M.H.,vanHeerden,C.,Kleynhans,N.,Bali,K.,2011.Phonerecognitionforspokenwebsearch.In:Proc.MediaEval,2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/
Buzo,A.,Cucu,H.,Safta,M.,Ionescu,B.,Burileanu,C.,2012.ARF@MediaEval2012:aRomanianASR-basedapproachtospokentermdetection. In:Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Chelba,C.,Hazen,T.J.,Sarac¸lar,M.,2008.Retrievalandbrowsingofspokencontent.IEEESign.Process.Mag.25(3),39–49.
deVries,N.J.,Badenhorst,J.,Davel,M.H.,Barnard,E.,deWaal,A.,2011.Woefzela–anopen-sourceplatformforASRdatacollectioninthe developingworld.In:Proc.INTERSPEECH.ISCA,Florence,Italy,pp.3177–3180.
Diao,M.,Mukherjea,S.,Rajput,N.,Srivastava,K.,2010.Facetedsearchandbrowsingofaudiocontentonspokenweb.In:CIKM’10:Proceedings ofthenineteenthinternationalconferenceonInformationandknowledgemanagement,Toronto,Canada.
Educationforallglobalmonitoringreport–reachingthemarginalized,2010.http://unesdoc.unesco.org/images/0018/001866/186606E.pdf.Last accessed:March1,2014.
Fiscus,J.,Ajot,J.,Garofolo,J.,Doddington,G.,2007.Resultsofthe2006spokentermdetectionevaluation.In:Proc.SSCS,Amsterdam,Netherlands. Fiscus,J.,1997.Apost-processingsystemtoyieldreducedworderrorrates:recognizeroutputvotingerrorreduction(ROVER).In:Proc.Automatic
SpeechRecognitionandUnderstandingWorkshop.IEEE,SantaBarbara,CA,USA,pp.347–354.
Fox,E.A.,Shaw,J.A.,1994.Combinationofmultiplesearches.In:Proc.2ndTextREtrievalConference(TREC-2),Gaithersburg,MD,USA,pp. 243–252.
Hazen,T.J.,Shen,W.,White,C.,2009.Query-by-examplespokentermdetectionusingphoneticposteriorgramtemplates.In:Proc.ASRU.IEEE, Merano,Italy.
Hughes,T.,Nakajima,K.,Ha,L.,Vasu,A.,Moreno,P.,LeBeau,M.,2010.Buildingtranscribedspeechcorporaquicklyandcheaplyformany languages.In:Proceedingsofthe11thAnnualConferenceoftheInternationalSpeechCommunicationAssociation(INTERSPEECH2010), Makuhari,Japan,pp.1914–1917.
IntelligenceAdvancedResearchProjectsActivity,2011.IARPA-BAA-11-02,http://www.iarpa.gov/Programs/ia/Babel/babel.html.Lastaccessed: March1,2014.
InternetUsageWorld-WidebyCountry,2010.http://www.infoplease.com/ipa/A0933606.html.Lastaccessed:March1,2014.
Jansen,A.,Durme,B.V.,2012.Indexingrawacousticfeaturesforscalablezeroresourcesearch.In:Proc.INTERSPEECH.ISCA,Portland,OR, USA.
Jansen,A.,vanDurme,B.,Clark,P., 2012.TheJHU-HLTCOEspokenwebsearchsystemforMediaEval2012.In:Proc.MediaEval2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Jansen,A.,Dupoux,E.,Goldwater,S.,Johnson,M.,Khudanpur,S.,Church,K.,Feldman,N.,Hermansky,H.,Metze,F.,Rose,R.,Seltzer,M., Clark,P.,McGraw,I.,Varadarajan,B.,Bennett,E.,Borschinger,B.,Chiu,J.,Dunbar,E.,Fourtassi,A.,Harwath,D.,Lee,C.-Y.,Levin,K., Norouzian,A.,Peddinti,V.,Richardson,R.,Schatz,T.,Thomas,S.,2013.Asummaryofthe2012JHUCLSPworkshoponzeroresourcespeech technologiesandmodelsofearlylanguageacquisition.In:Proc.ICASSP.IEEE,Vancouver,BC,Canada.
Joder,C.,Weninger,F.,Wöllmer,M.,Schuller,B.,2012.TheTUMcumulativeDTWapproachfortheMediaEval2012spokenwebsearchtask.In: Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Kotkar,P.,Thies,W.,Amarasinghe,S.,2008.AnaudioWikiforpublishinguser-generatedcontentinthedevelopingworld.In:HCIforCommunity andInternationalDevelopment(WorkshopatCHI),Florence,Italy.
Kumar,A.,Rajput,N.,Chakraborty,D.,Agarwal,S.,Nanavati,A.A.,2007.WWTW:aworldwidetelecomwebfordevelopingregions.In:ACM SIGCOMMWorkshoponNetworkedSystemsForDevelopingRegions,Kyoto,Japan.
Kumar,A.,Rajput,N.,Agarwal,S.,Chakraborty,D.,Nanavati,A.A.,2008.Organizingtheunorganized–employingittoempowerthe under-privileged.In:ProceedingsoftheWorldWideWeb,Beijing,China.
Kumar,A.,Metze,F.,Wang,W.,Kam,M.,2013.Formalizingexpertknowledgefordevelopingaccuratespeechrecognizers.In:Proc.INTERSPEECH. ISCA,Lyon,France.
Larson,M.,deJong,F.,Kraaij,W.,Renals,S.,2012.Introductiontospecialissueonsearchingspeech.ACMTrans.Inform.Systems30(3). Mangu,L.,Brill,E.,Stolcke,A.,2000.Findingconsensusinspeechrecognition:worderrorminimizationandotherapplicationsofconfusion
networks.Comput.SpeechLanguage14(4),373–400.
Mantena,G.,Anguera,X.,2013.Speedimprovementstoinformationretrieval-baseddynamictimewarpingusinghierarchicalk-meansclustering. In:Proc.ICASSP.IEEE,Vancouver,Canada.
Mantena, G.V., Babu, B., Prahallad, K., 2011. SWS task: articulatory phonetic units and sliding DTW. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/
MediaEvalBenchmark,MediaEval2012Workshop,http://www.multimediaeval.org/mediaeval2013/;http://ceur-ws.org/Vol-1043/ MediaEvalBenchmark,MediaEval2013Workshop,2013,http://www.multimediaeval.org/mediaeval2013/;http://ceur-ws.org/Vol-1043/ MediaEvalBenchmark,2014.http://www.multimediaeval.org/
Metze,F.,Rajput,N.,Anguera,X.,Davel,M.,Gravier,G.,vanHeerden,C.,Mantena,G.V.,Muscariello,A.,Prahallad,K.,Szöke,I.,Tejedor, J.,2012.ThespokenwebsearchtaskatMediaEval2011.In:Proc.ICASSP.IEEE,Kyoto,Japan.
Metze,F.,Barnard,E.,Davel,M.,vanHeerden,C.,Anguera,X.,Gravier,G.,Rajput,N.,2012.Thespokenwebsearchtask.In:Proc.MediaEval Workshop,http://www.multimediaeval.org/mediaeval2012/;http://www.multimediaeval.org/mediaeval2012/sws2012/
Metze,F.,Anguera,X.,Barnard,E.,Davel,M.,Gravier,G.,2013.ThespokenwebsearchtaskatMediaEval2012.In:Proc.ICASSP.IEEE, Vancouver,BC,Canada.
Miller,D.R.H.,Kleber,M.,Kao,C.-L.,Kimball,O.,Colthurst,T.,Lowe,S.A.,Schwartz,R.M.,Gish,H.,2007.Rapidandaccuratespokenterm detection.In:Proc.INTERSPEECH.ISCA,Antwerpen,Belgium.
Muscariello, A., Gravier, G., 2012. IRISA MediaEval 2011 spoken web search system. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/
Muscariello,A.,Gravier,G.,Bimbot,F.,2011.Azero-resourcesystemforaudio-onlyspokentermdetectionusingacombinationofpatternmatching techniques.In:Proc.INTERSPEECH.ISCA,Florence,Italy.
Muscariello,A.,Gravier,G.,Bimbot,F.,2012.Unsupervisedmotifacquisitioninspeechviaseededdiscoveryandtemplatematchingcombination. IEEETrans.AudioSpeechLanguage20(7),2031–2044.
Norouzian,A.,Jansen,A.,Rose,R.,Thomas,S.,2012.Exploiting discriminativepointprocessmodelsforspokentermdetection.In:Proc. INTERSPEECH.ISCA,Portland,OR,USA.
Patel,N.,Chittamuru,D.,Jain,A.,Dave,P.,Parikh,T.S.,2010.AvaajOtalo:afieldstudyofaninteractivevoiceforumforsmallfarmersinrural India.In:CHI’10:Proceedingsofthe28thInternationalConferenceonHumanFactorsinComputingSystems.ACM,Atlanta,GA,USA,pp. 733–742.
Phonemerecognizerbasedonlongtemporalcontext,2009.http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context. Lastaccessed:March1,2014.
Rajput, N., Metze, F., 2011. Spoken web search. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/; http://ceur-ws.org/Vol-807/
Raza,A.A.,Haq,F.U.,Tariq,Z.,Pervaiz,M.,Razaq,S.,Saif,U.,Rosenfeld,R.,2013.Jobopportunitiesthroughentertainment:virallyspread speech-basedservicesforlow-literateusers.In:Proc.CHI.ACM,Paris,France.
Schultz,T.,2000.MultilingualeSpracherkennung:KombinationakustischerModellezurPortierungaufneueSprachen.UniversitätKarlsruhe, InstitutfürLogik,KomplexitätundDeduktionssysteme(Ph.D.thesis).
Schwarz,P.,2009.Phonemerecognitionbasedonlongtemporalcontext.FacultyofInformationTechnology,BrnoUniversityofTechnology(BUT) (Ph.D.thesis).http://www.fit.vutbr.cz/research/viewpub.php?id=9132
Shen,W.,White,C.,Hazen,T.J.,2009.Acomparisonofquery-by-examplemethodsforspokentermdetection.In:Proc.INTERSPEECH.ISCA, Brighton,UK.
Sherwani,J.,Ali,N.,Mirza,S.,Fatma,A.,Memon,Y.,Karim,M.,Tongia,R.,Rosenfeld,R.,2007.Healthline:speech-basedaccesstohealth informationbylow-literateusers.In:Proc.IEEE/ACMInt’lConferenceonInformationandCommunicationTechnologiesandDevelopment, Bangalore,India.
Szöke,I.,Schwarz,P.,Matˇejka,P.,Burget,L.,Karafiát,M., ˇCernock´y,J.,2005.Phonemebasedacousticskeywordspottingininformalcontinuous speech.LNAI3658,302–309,http://www.fit.vutbr.cz/research/viewpub.php?id=7882
Szöke, I., Tejedor, J., Fapˇso, M., Colás, J., 2011. BUT-HCTLab approaches for spoken web search. In: Proc. MediaEval 2011, http://www.multimediaeval.org/mediaeval2011/;http://ceur-ws.org/Vol-807/
Szöke, I., Fapˇso, M., Vesel´y, K., 2012. BUT 2012 approaches for spoken web search – MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Szöke, I., 2010. Hybrid word-subword spoken term detection. Faculty of Information Technology BUT (Ph.D. thesis). http://www.fit.vutbr.cz/research/viewpub.php?id=9375
Varona,A.,Penagarikano,M.,Rodriguez-Fuentes,L.J.,Bordel,G.,Diez,M.,2012.GTTSsystemforthespokenwebsearchtaskatMediaEval 2012.In:Proc.MediaEval2012,http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Vavrek,J.,Pleva,M.,Juhár,J.,2012.TUKEMediaEval2012:spokenwebsearchusingDTWandunsupervisedSVM.In:Proc.MediaEval2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Wallace,R.G.,Vogt,R.J.,Sridharan,S.,2007.Aphoneticsearchapproachtothe2006NISTspokentermdetectionevaluation.In:Proc. INTER-SPEECH.ISCA,Antwerpen,Belgium.
Wang, H., Lee, T., 2012. CUHK system for the spoken web search task at MediaEval 2012. In: Proc. MediaEval 2012, http://www.multimediaeval.org/mediaeval2012/;http://ceur-ws.org/Vol-927/
Wang,D.,King,S.,Frankel,J.,2011.Stochasticpronunciationmodellingforout-of-vocabularyspokentermdetection.IEEETrans.Audio,Speech, LanguageProcess.9(4),http://homepages.inf.ed.ac.uk/v1dwang2/public/tools/index.html
Wang,H.,Lee,T.,Leung,C.-C.,Ma,B.,Li,H.,2013.UsingparalleltokenizerswithDTWmatrixcombinationforlow-resourcespokenterm detection.In:Proc.ICASSP.IEEE,Vancouver,Canada.
Wang,D.,2010.Out-of-vocabularyspokentermdetection.UniversityofEdinburgh(Ph.D.thesis).
Zeroresourcespeechtechnologiesandmodelsofearlylanguageacquisition,2012. http://www.clsp.jhu.edu/workshops/archive/ws-12/groups/mini-workshop/Lastaccessed:March1,2014.
Zhang,Y.,Glass,J.,2009.UnsupervisedspokenkeywordspottingviasegmentalDTWonGaussianposteriorgrams.In:Proc.ASRU,IEEE,Merano, Italy.
Zhang, Y., Salakhutdinov, R., Chang, H.-A., Glass, J., 2012. Resource configurable spoken query detection using deep Boltz-mann machines. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5161–5164.
Zulu,P.N.,Botha,G.,Barnard,E.,2008.OrthographicmeasuresoflanguagedistancesbetweentheofficialsouthAfricanlanguages.Literator29 (1),1–20.