Query by Example Search on Speech at Mediaeval 2014 Xavier Anguera Luis Javier Igor Szöke Telefonica Research Rodriguez-Fuentes Brno University of Barcelona , Spain University of the Basque Technology xanguera@tid.es Country Brno, Czech Republic Leioa, Spain szoke@fit.vutbr.cz luisjavier.rodriguez@ehu.es Andi Buzo Florian Metze University Politehnica of Carnegie Mellon Bucharest University Bucharest, Romania Pittsburgh, PA, U.S.A andi.buzo@upb.ro fmetze@cs.cmu.edu ABSTRACT primary metric, whereas the Actual Term Weighted Value In this paper, we describe the “Query by Example Search (ATWV, used as primary metric in previous years) is kept on Speech Task” (QUESST, formerly SWS, “Spoken Web as a secondary metric for diagnostic purposes, which means Search”), held as part of the MediaEval 2014 evaluation that systems must provide not only scores, but also Yes/No campaign. As in previous years, the proposed task requires decisions. And third, three types of query matchings are performing language-independent audio search in a low re- considered: the first one is the “exact match” case used in source scenario. This year, the task has been designed to get previous years, whereas the second one, which allows for as close as possible to a practical use case scenario, in which inflectional variations of words, and the third one, which al- a user would like to retrieve, using speech, utterances con- lows for word re-orderings and some filler content between taining a given word or short sentence, including those with words, are “approximate matches” that simulate how we limited inflectional variations of words, some filler content imagine that users would want to use this technology. and/or word re-orderings. 2. BRIEF TASK DESCRIPTION 1. INTRODUCTION QUESST is part of the Mediaeval 2014 evaluation cam- After three years running as SWS (“Spoken Web Search”) paign1 . As usual, two separate sets of queries are provided, [4, 3, 1, 2], the task has been renamed to QUESST (“QUery for development and evaluation, along with a single set of by Example Search on Speech Task”) to better reflect its na- audio files, on which both sets of queries must be searched ture: to search FOR audio content WITHIN audio content on. The set of development queries and the set of audio files USING an audio query. As in previous years, the search are distributed early (June 2nd), including the groundtruth database was collected from heterogeneous sources, cover- and the scoring scripts, for the participants to develop and ing multiple languages, and under diverse acoustic condi- evaluate their systems. The set of evaluation queries is dis- tions. Some of these languages are resource-limited, some tributed one month later (July 1st). System results (for both are recorded in challenging acoustic conditions and some sets of queries) must be returned by the evaluation deadline contain heavily accented speech (typically from non-native (September 9th), including a likelihood score and a Yes/No speakers). No transcriptions, language tags or any other decision for each pair (query, audio file). Note that not ev- metadata are provided to participants. The task therefore ery query necessarily appears in the set of audio files, and requires researchers to build a language-independent audio- that several queries may appear in the same audio file. Also, to-audio search system. As in previous years, the database there could be some overlap between evaluation and devel- will be made publicly available for research purposes after opment queries. Multiple system results can be submitted the evaluation concludes. (up to 5), but one of them (presumably the best one) must Three main changes were introduced for this year’s evalu- be identified as primary. Also, although participants are en- ation, namely on the the search task, on the evaluation met- couraged to train their systems using only the data released rics, and on the types of query matchings. First, the task for this year’s evaluation, they are allowed to use any addi- no longer requires the localization (time stamps) of query tional resources they might have available, as long as their matchings within audio files (which, on the other hand, are use is documented in their system papers. System results relatively short: less than 30 seconds long). However, sys- are then scored and returned to participants (by September tems must provide a score (a real number) for each query 16th), who must prepare a working notes (two-page) paper matching, the higher (the more positive) the score, the more describing their systems and return it to the organizers (by likely that the query appears in the audio file. Second, September 28th). Finally, systems are presented and results the normalized cross entropy cost (Cnxe ) [5] is used as the discussed in the Mediaeval workshop, which serves to meet fellow participants, to share ideas and to bootstrap future collaborations. Copyright is held by the author/owner(s). 1 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain http://www.multimediaeval.org/mediaeval2014/ 3. THE QUESST 2014 DATASET robust techniques to account for partial matchings. 2 The QUESST 2014 dataset is the result of a joint effort Note also that, when producing the ground-truth for by several institutions to put together a sizable amount of this type of matchings, hits are were allowed to contain data to be used in this evaluation and for later research on a large amount of filler content between words. the topic of query-by-example search on speech. The search The ground truth was created either manually by na- corpus is composed of around 23 hours of audio (12492 tive speakers or automatically by speech recognition engines files) in the following 6 languages: Albanian, Basque, Czech, tuned to each particular language, and provided by the task non-native English, Romanian and Slovak, with different organizers, following the format of NIST’s Spoken Term De- amounts of audio per language. The search utterances, which tection evaluations. The development package contains a are relatively short (6.6 seconds long on average), were au- general ground-truth folder (the one that must be used to tomatically extracted from longer recordings and manually score system results on the development set) which consid- checked to avoid very short or very long utterances. The ers all types of matchings, but also three ground-truth fold- QUESST 2014 dataset includes 560 development queries and ers specific to each type of matchings, to allow participants 555 evaluation queries, the number of queries per language evaluate their progress on each condition during system de- being more or less balanced with the amount of audio avail- velopment. able in the search corpus. A big effort has been made to 5. PERFORMANCE METRICS manually record most of the queries, in order to avoid prob- lems observed in previous years due to acoustic context de- The primary metric used in QUESST 2014 is the normal- rived from cutting the queries from longer sentences. Speak- ized cross entropy cost (Cnxe ), already used in SWS 2013 ers recruited for recording the queries were asked to maintain as a secondary metric [1]. This metric has been used for a normal speaking speed and a clear speaking style. All au- several years in the language and speaker recognition fields dio files are PCM encoded at 8 kHz, 16 bits/sample, and to calibrate system scores, and shows interesting proper- stored in WAV format. ties. Furthermore, we found experimentally that Cnxe and ATWV performances correlate quite well. A scoring script 4. THE GROUND-TRUTH has been specifically prepared for this year’s evaluation, so The biggest novelty in this year’s evaluation comes from that NIST software is not required anymore3 . For the Cnxe the new (relaxed) concept of a query match, which strongly scores to be meaningful, participants are requested either to affects the ground-truth definition and thus the way systems return a score (that will be taken as a log-likelihood ratio) are expected to work. Besides the “exact matching” used in for every pair (query, audio file), or alternatively, to define previous years, two types of “approximate matchings” are a default (floor) score for all the pairs not included in the considered. We denote these matchings as of Type 1, 2 and results file. TWV metrics are computed with the follow- 3, respectively, and are defined as follows: ing application parameters: Ptarget = 0.0008, Cf a = 1 and Cmiss = 100. Participants are also required to report on Type 1 (Exact): Only occurrences that exactly match the their real-time running factor, hardware characteristics and lexical representation of the query are considered as peak memory requirements, in order to profile the different hits, just like in previous years. For example, the query approaches applied. See [5] for further information on how “white horse” would match the utterance “My white the metrics work and how they are computed. horse is beautiful”. Type 2 (Variant): In this case, query occurrences that 6. ACKNOWLEDGEMENTS slightly differ from its lexical representation, either at We would like to thank the Mediaeval organizers for their the beginning or at the end of the query, are consid- support and all the participants for their hard work. Data ered as hits. Systems therefore need to account for were provided by QUESST organizers and by the Techni- small portions of audio that do not match its lexi- cal University of Kosice (TUKE), Slovak Republic. Igor cal representation. When producing the ground-truth Szöke was supported by the Czech Science Foundation, un- for this type of matchings, the matching part of any der project GPP202/12/P567. query was required to exceed 5 phonemes (250 ms), and the non-matching part was required to be much 7. REFERENCES [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J. smaller than the matching part. For example, the Rodriguez-Fuentes. The Spoken Web Search Task. In Proc. query “researcher” would match an audio file contain- Mediaeval 2013 Workshop, 2013. ing “research” (note that the query “research” would [2] F. Metze, X. Anguera, E. Barnard, M. Davel, and also match an audio file containing “researcher”). G. Gravier. Language Independent Search in MediaEval’s Spoken Web Search Task. Computer Speech and Language, Type 3 (Reordering/Filler): Given a multi-word query, Special Issue on Information Extraction & Retrieval, 2014. a hit is required to contain all the words in the query, [3] F. Metze, E. Barnard, M. Davel, C. van Heerden, but possibly in a different order and with some small X. Anguera, G. Gravier, and N. Rajput. The Spoken Web amount of filler content between words; slight differ- Search Task. In Proc. Mediaeval 2012 Workshop, 2012. ences between word occurrences and their lexical repre- [4] N. Rajput and F. Metze. Spoken Web Search. In Proc. sentations are also allowed (like in Type 2). For exam- Mediaeval 2011 Workshop, 2011. [5] L. J. Rodriguez-Fuentes and M. Penagarikano. MediaEval ple the query “white snow” would match an utterance 2013 Spoken Web Search Task: System Performance containing either “snow is white”, “whitest snow” or Measures. Technical Report TR-2013-1, DEE, University of “whiter than snow”. Note that queries provided in this the Basque Country, 2013. Online: http://gtts.ehu.es/gtts/ evaluation are spoken continuously, with no silences NT/fulltext/rodriguezmediaeval13.pdf. between words, and thus participants should develop 3 Thanks to Mikel Peñagarikano, from the University of the 2 A download link will be provided after the evaluation. Basque Country, for creating the scoring script.