The Spoken Web Search Task
Xavier Anguera*               Florian Metze†           Andi Buzo‡        Igor Szoke♯           Luis Javier Rodriguez-Fuentes§

ABSTRACT                                                                 development output (i.e. the detection of development queries on
In this paper, we describe the “Spoken Web Search” Task, which           the data) for comparison purposes.
is being held as part of the 2013 MediaEval campaign. The                Participants can submit multiple systems, but need to designate
purpose of this task is to perform audio search in multiple              one primary system. Participants are encouraged to submit a
languages and acoustic conditions, with very few resources being         system trained only on data released for the 2013 SWS task, but
available for each individual language. This year the data contains      are allowed to use any additional resources they might have
audio from nine different languages and is much bigger in size           available, as long as their use is documented.
than in previous years, mimicking realistic low/zero-resource
settings.                                                                For the first time this year, a “Speech Recognition Virtual
                                                                         Kitchen” appliance [8] is made available to participants as a
                                                                         baseline system to experiment with. This consists of a Linux-
1. INTRODUCTION                                                          based virtual machine, running a complete SWS system.
The “Spoken Web Search” (SWS) task of MediaEval 2013 [3]
involves searching for audio content within audio content using an       3.1 Development and evaluation Data
audio query. The task requires researchers to build a language-
                                                                         As a result of a joint effort between several institutions, a
independent audio search system so that, given an audio query, it
                                                                         challenging new dataset, together with accompanying queries, has
should be able to find the appropriate audio file(s) and the exact
                                                                         been put together for the 2013 evaluation. This dataset is
location(s) of a query term within these audio file(s). Evaluation is
                                                                         composed of 20 hours of audio in the following 9 languages:
performed using standard NIST metrics [1] in addition to some
                                                                         Albanian, Basque, Czech, non-native English, Isixhosa, Isizulu,
other indicators.
                                                                         Romanian, Sepedi and Setswana. The recording acoustic
The 2013 evaluation expands on the MediaEval 2011 and 2012               conditions are not constant for all languages; some being obtained
“Spoken Web Search” tasks [6,7] by increasing the size of the test       from in-room microphone recordings while others have been
dataset and the number of languages (which were recorded in              obtained through street recordings with cellphones. All data has
different acoustic conditions). In addition, a baseline system is        been converted to 8KHz/ 16bit WAV files. Moreover, the amount
being offered this year to first-time participants as a virtual          of audio available for each language is not the same for all
kitchen appliance.                                                       languages. Such database is over 5 times the size of the 2012
                                                                         databases. The development and evaluation queries are mutually
2. MOTIVATION and RELATED WORK                                           exclusive segments defined within the same data collection. For
Imagine you want to build a simple speech recognition system, or         this reason, no information on the language being spoken or the
at least a spoken term detection (STD) or keyword search (KWS)           transcription of the files is released with the development runs.
system in a new dialect, language or acoustic condition, for which       We believe that with such a variety of data the concept of over-
only very few audio examples are available. Maybe there even are         fitting to the dev-test set is quite diluted and, if any, it should be
no transcripts available for that data. Is it possible to do something   seen as a good thing for systems to be able to take advantage from
useful (e.g. identify the topic of a query) by using only those very     knowing the possible acoustics of the test languages.
limited resources available? Full-fledged speech recognition may
                                                                         Accompanying the dataset, two sets of queries have been created
be unrealistic to be used for such a task, which may not be
                                                                         for use in the development and evaluation, each one containing
required to solve a specific information access or search problem.
                                                                         two subsets of basic and extended queries. A basic set of 500+
This task was originally proposed by IBM Research India, who             queries each are to be used by participants in their required runs.
provided the 2011 data [2]. In 2012, the evaluation was performed        In addition, for some of the basic queries, alternative spoken
on new data gathered from 4 different African languages [5]. The         instances of the same lexical terms have also been gathered and
2012 data is made available to participants to help them in their        are made available to participants to be used (together with the
system development.                                                      basic queries) in their extended runs. Such extended runs are
                                                                         intended to represent how results would vary if systems could take
                                                                         advantage of multiple repeated queries.
3. TASK DESCRIPTION
Participants receive audio data as well as development and               In addition to the main database used for this year, the 2012
evaluation (audio) queries, described in more detail below. Only         “African” database [4] is also being made available to participants
the occurrence of development queries in the data is provided.           in hope it is of help in the development phase. It consists of over
                                                                         1580 files and 100 queries both for development and evaluation,
Participants are required to identify and submit which query (or
                                                                         recorded in 4 African Languages. Participants should note that the
queries, from the set of evaluation queries) occur(s) in each
                                                                         acoustic conditions of this dataset only match those of a small part
utterance (0-n matches per term, i.e. not every term necessarily
                                                                         of the 2013 dataset.
occurs, but multiple matches are possible per utterance). There
may be partial overlap between evaluation and development                A "termlist" XML file and a transcription RTTM file are provided
queries. In addition, participants are asked to submit their             with the development data, following the guidelines of the NIST-
                                                                         STD 2006 evaluation [1]. For this year the reference files do not
                                                                         contain any information regarding the language or the content
 Copyright is held by the author/ owner(s).
                                                                         spoken in each file, and only the locations of the queries is given
 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain.
in the reference RTTM file. This is done so in order not to give           friendly evaluation we expect to help push forward the state-of-
away any extra information about the dataset when releasing the            the-art in this area.
development data, as it is shared with the evaluation queries.
                                                                           6. ACKNOWLEDGMENTS
4. EVALUATION OF RESULTS                                                   The SWS task organizers would like to thank the Mediaeval
The ground truth for this year has been created in a variety of            organizers [3] and all the participants for putting in a lot of hard
ways. Sometimes is has been created manually by native speakers            work into submitting their systems. The “African” data [5] for this
while in other cases a speech recognition system has been used to          year and last year has kindly been collected by CSIR and made
force-align the transcripts at word level. Note that word                  available by Charl van Heerden at NWU. Igor Szöke was
alignments might not be perfect, which is why a margin of error is         supported by Grant Agency of Czech Republic post-doctoral
allowed by the scoring scripts.                                            project No.GPP202/12/P567
The main evaluation metric this year remains the same as previous
years by following the principles and using the tools of NIST's            7. REFERENCES
Spoken Term Detection (STD) evaluations. The primary                       [1] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington, 2007,
evaluation metric is ATWV (Actual Term Weighted Value), as                     "Results of the 2006 Spoken Term Detection Evaluation,"
used in the NIST 2006 Spoken Term Detection (STD) evaluation                   Proc. ACM SIGIR 2007, Workshop in Searching
[1]. A scoring package with easy-to-use scripts and an example                 Spontaneous Conversational Speech (SSCS).
scoring setup have been made available to participants with the            [2] M. Diao, S. Mukherjea, N. Rajput, and K.
development data. This year we are again applying a different                  Srivastava, "Faceted Search and Browsing of Audio Content
scoring working point by modifying the miss and false alarm costs              on Spoken Web," Proc. CIKM 2010.
to better match the new test data.
                                                                           [3] http://www.multimediaeval.org/mediaeval2013/index.html
In addition, two secondary metrics are being introduced this year.
                                                                           [4] F. Metze, E. Barnard, M. Davel, C.V.Heerden, X.Anguera G.
On the one hand, the normalized cross-entropy metric Cnxe
                                                                               Gravier and N. Rajput, “The spoken web search task”, in
evaluates the information provided by system scores (in contrast
                                                                               Workshop notes of Mediaeval 2012, Pisa, Italy
to TWV, which uses system decisions). This metric originates
from the NIST SRE evaluations and is computed assuming that                [5] E. Barnard, M. Davel, and C. van Heerden, "ASR Corpus
submitted scores can be interpreted as log-likelihood ratios. On               design for resource-scarce languages," in Proc.
the other hand, the real-time factor evaluates the required                    INTERSPEECH, Brighton, UK; Sep. 2009, pp. 2847-2850.
resources used by the systems. In addition, participants are               [6] F. Metze, N. Rajput, X. Anguera, M. Davel, G. Gravier, C. v.
requested to indicate the type of machines used in the evaluation              Heerden, G. V. Mantena, A. Muscariello, K. Prahallad, I.
and (approximately) the peak memory usage in order for                         Szöke, and J. Tejedor. “The Spoken Web Search task at
organizers to compute a global processing load metric per system.              MediaEval 2011”. In Proc. ICASSP, Kyoto; Mar. 2012.
See [9] for a detailed description on these metrics.                           IEEE.
                                                                           [7] F. Metze, X. Anguera, E. Barnard, M. Davel and G. Gravier.
5. OUTLOOK                                                                     The Spoken Web Search task at MediaEval 2012. In Proc.
Low (or even zero) resource speech recognition is currently                    ICASSP, Vancouver; May. 2013. IEEE.
receiving a lot of attention and will soon reach maturity to be
                                                                           [8] F. Metze and E. Fosler-Lussier, “The Speech Recognition
useful for real-life scenarios. The “Spoken Web Search” task
                                                                               Virtual Kitchen: An Initial Prototype”, in Proc. Interspeech
originated as an alternative to standard techniques for low/zero-
                                                                               2012, Portland, USA
resourced languages where good speech recognizers do not exist.
This year we have extended this paradigm to include audio data             [9] Luis J. Rodriguez-Fuentes, M. Penagarikano, "MediaEval
for which not much is known a priori, by mixing several                        2013 Spoken Web Search Task: System Performance
languages and acoustic conditions in the same test dataset. By                 Measures", n.TR-2013-1, Department of Electricity and
comparing the results obtained by the different systems in this                Electronics, University of the Basque Country, 2013. Link:
                                                                               http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf


*
    Telefonica Research; Barcelona, Spain; xanguera@tid.es
†
    Carnegie Mellon University; Pittsburgh, PA, U.S.A; fmetze@cs.cmu.edu
‡
    University Politehnica of Bucharest, Romania, andi.buzo@upb.ro
♯
    Brno University of Technology, Czech Republic; szoke@fit.vutbr.cz
§
    University of the Basque Country, Spain; luisjavier.rodriguez@ehu.es