The Spoken Web Search Task Xavier Anguera* Florian Metze† Andi Buzo‡ Igor Szoke♯ Luis Javier Rodriguez-Fuentes§ ABSTRACT development output (i.e. the detection of development queries on In this paper, we describe the “Spoken Web Search” Task, which the data) for comparison purposes. is being held as part of the 2013 MediaEval campaign. The Participants can submit multiple systems, but need to designate purpose of this task is to perform audio search in multiple one primary system. Participants are encouraged to submit a languages and acoustic conditions, with very few resources being system trained only on data released for the 2013 SWS task, but available for each individual language. This year the data contains are allowed to use any additional resources they might have audio from nine different languages and is much bigger in size available, as long as their use is documented. than in previous years, mimicking realistic low/zero-resource settings. For the first time this year, a “Speech Recognition Virtual Kitchen” appliance [8] is made available to participants as a baseline system to experiment with. This consists of a Linux- 1. INTRODUCTION based virtual machine, running a complete SWS system. The “Spoken Web Search” (SWS) task of MediaEval 2013 [3] involves searching for audio content within audio content using an 3.1 Development and evaluation Data audio query. The task requires researchers to build a language- As a result of a joint effort between several institutions, a independent audio search system so that, given an audio query, it challenging new dataset, together with accompanying queries, has should be able to find the appropriate audio file(s) and the exact been put together for the 2013 evaluation. This dataset is location(s) of a query term within these audio file(s). Evaluation is composed of 20 hours of audio in the following 9 languages: performed using standard NIST metrics [1] in addition to some Albanian, Basque, Czech, non-native English, Isixhosa, Isizulu, other indicators. Romanian, Sepedi and Setswana. The recording acoustic The 2013 evaluation expands on the MediaEval 2011 and 2012 conditions are not constant for all languages; some being obtained “Spoken Web Search” tasks [6,7] by increasing the size of the test from in-room microphone recordings while others have been dataset and the number of languages (which were recorded in obtained through street recordings with cellphones. All data has different acoustic conditions). In addition, a baseline system is been converted to 8KHz/ 16bit WAV files. Moreover, the amount being offered this year to first-time participants as a virtual of audio available for each language is not the same for all kitchen appliance. languages. Such database is over 5 times the size of the 2012 databases. The development and evaluation queries are mutually 2. MOTIVATION and RELATED WORK exclusive segments defined within the same data collection. For Imagine you want to build a simple speech recognition system, or this reason, no information on the language being spoken or the at least a spoken term detection (STD) or keyword search (KWS) transcription of the files is released with the development runs. system in a new dialect, language or acoustic condition, for which We believe that with such a variety of data the concept of over- only very few audio examples are available. Maybe there even are fitting to the dev-test set is quite diluted and, if any, it should be no transcripts available for that data. Is it possible to do something seen as a good thing for systems to be able to take advantage from useful (e.g. identify the topic of a query) by using only those very knowing the possible acoustics of the test languages. limited resources available? Full-fledged speech recognition may Accompanying the dataset, two sets of queries have been created be unrealistic to be used for such a task, which may not be for use in the development and evaluation, each one containing required to solve a specific information access or search problem. two subsets of basic and extended queries. A basic set of 500+ This task was originally proposed by IBM Research India, who queries each are to be used by participants in their required runs. provided the 2011 data [2]. In 2012, the evaluation was performed In addition, for some of the basic queries, alternative spoken on new data gathered from 4 different African languages [5]. The instances of the same lexical terms have also been gathered and 2012 data is made available to participants to help them in their are made available to participants to be used (together with the system development. basic queries) in their extended runs. Such extended runs are intended to represent how results would vary if systems could take advantage of multiple repeated queries. 3. TASK DESCRIPTION Participants receive audio data as well as development and In addition to the main database used for this year, the 2012 evaluation (audio) queries, described in more detail below. Only “African” database [4] is also being made available to participants the occurrence of development queries in the data is provided. in hope it is of help in the development phase. It consists of over 1580 files and 100 queries both for development and evaluation, Participants are required to identify and submit which query (or recorded in 4 African Languages. Participants should note that the queries, from the set of evaluation queries) occur(s) in each acoustic conditions of this dataset only match those of a small part utterance (0-n matches per term, i.e. not every term necessarily of the 2013 dataset. occurs, but multiple matches are possible per utterance). There may be partial overlap between evaluation and development A "termlist" XML file and a transcription RTTM file are provided queries. In addition, participants are asked to submit their with the development data, following the guidelines of the NIST- STD 2006 evaluation [1]. For this year the reference files do not contain any information regarding the language or the content Copyright is held by the author/ owner(s). spoken in each file, and only the locations of the queries is given MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain. in the reference RTTM file. This is done so in order not to give friendly evaluation we expect to help push forward the state-of- away any extra information about the dataset when releasing the the-art in this area. development data, as it is shared with the evaluation queries. 6. ACKNOWLEDGMENTS 4. EVALUATION OF RESULTS The SWS task organizers would like to thank the Mediaeval The ground truth for this year has been created in a variety of organizers [3] and all the participants for putting in a lot of hard ways. Sometimes is has been created manually by native speakers work into submitting their systems. The “African” data [5] for this while in other cases a speech recognition system has been used to year and last year has kindly been collected by CSIR and made force-align the transcripts at word level. Note that word available by Charl van Heerden at NWU. Igor Szöke was alignments might not be perfect, which is why a margin of error is supported by Grant Agency of Czech Republic post-doctoral allowed by the scoring scripts. project No.GPP202/12/P567 The main evaluation metric this year remains the same as previous years by following the principles and using the tools of NIST's 7. REFERENCES Spoken Term Detection (STD) evaluations. The primary [1] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington, 2007, evaluation metric is ATWV (Actual Term Weighted Value), as "Results of the 2006 Spoken Term Detection Evaluation," used in the NIST 2006 Spoken Term Detection (STD) evaluation Proc. ACM SIGIR 2007, Workshop in Searching [1]. A scoring package with easy-to-use scripts and an example Spontaneous Conversational Speech (SSCS). scoring setup have been made available to participants with the [2] M. Diao, S. Mukherjea, N. Rajput, and K. development data. This year we are again applying a different Srivastava, "Faceted Search and Browsing of Audio Content scoring working point by modifying the miss and false alarm costs on Spoken Web," Proc. CIKM 2010. to better match the new test data. [3] http://www.multimediaeval.org/mediaeval2013/index.html In addition, two secondary metrics are being introduced this year. [4] F. Metze, E. Barnard, M. Davel, C.V.Heerden, X.Anguera G. On the one hand, the normalized cross-entropy metric Cnxe Gravier and N. Rajput, “The spoken web search task”, in evaluates the information provided by system scores (in contrast Workshop notes of Mediaeval 2012, Pisa, Italy to TWV, which uses system decisions). This metric originates from the NIST SRE evaluations and is computed assuming that [5] E. Barnard, M. Davel, and C. van Heerden, "ASR Corpus submitted scores can be interpreted as log-likelihood ratios. On design for resource-scarce languages," in Proc. the other hand, the real-time factor evaluates the required INTERSPEECH, Brighton, UK; Sep. 2009, pp. 2847-2850. resources used by the systems. In addition, participants are [6] F. Metze, N. Rajput, X. Anguera, M. Davel, G. Gravier, C. v. requested to indicate the type of machines used in the evaluation Heerden, G. V. Mantena, A. Muscariello, K. Prahallad, I. and (approximately) the peak memory usage in order for Szöke, and J. Tejedor. “The Spoken Web Search task at organizers to compute a global processing load metric per system. MediaEval 2011”. In Proc. ICASSP, Kyoto; Mar. 2012. See [9] for a detailed description on these metrics. IEEE. [7] F. Metze, X. Anguera, E. Barnard, M. Davel and G. Gravier. 5. OUTLOOK The Spoken Web Search task at MediaEval 2012. In Proc. Low (or even zero) resource speech recognition is currently ICASSP, Vancouver; May. 2013. IEEE. receiving a lot of attention and will soon reach maturity to be [8] F. Metze and E. Fosler-Lussier, “The Speech Recognition useful for real-life scenarios. The “Spoken Web Search” task Virtual Kitchen: An Initial Prototype”, in Proc. Interspeech originated as an alternative to standard techniques for low/zero- 2012, Portland, USA resourced languages where good speech recognizers do not exist. This year we have extended this paradigm to include audio data [9] Luis J. Rodriguez-Fuentes, M. Penagarikano, "MediaEval for which not much is known a priori, by mixing several 2013 Spoken Web Search Task: System Performance languages and acoustic conditions in the same test dataset. By Measures", n.TR-2013-1, Department of Electricity and comparing the results obtained by the different systems in this Electronics, University of the Basque Country, 2013. Link: http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf * Telefonica Research; Barcelona, Spain; xanguera@tid.es † Carnegie Mellon University; Pittsburgh, PA, U.S.A; fmetze@cs.cmu.edu ‡ University Politehnica of Bucharest, Romania, andi.buzo@upb.ro ♯ Brno University of Technology, Czech Republic; szoke@fit.vutbr.cz § University of the Basque Country, Spain; luisjavier.rodriguez@ehu.es