Query by Example Search on Speech at Mediaeval 2014

                        Xavier Anguera                 Luis Javier                     Igor Szöke
                       Telefonica Research          Rodriguez-Fuentes                Brno University of
                        Barcelona , Spain          University of the Basque             Technology
                          xanguera@tid.es                  Country                 Brno, Czech Republic
                                                        Leioa, Spain                 szoke@fit.vutbr.cz
                                                   luisjavier.rodriguez@ehu.es
                                           Andi Buzo                    Florian Metze
                                    University Politehnica of        Carnegie Mellon
                                          Bucharest                      University
                                     Bucharest, Romania            Pittsburgh, PA, U.S.A
                                         andi.buzo@upb.ro              fmetze@cs.cmu.edu


ABSTRACT                                                           primary metric, whereas the Actual Term Weighted Value
In this paper, we describe the “Query by Example Search            (ATWV, used as primary metric in previous years) is kept
on Speech Task” (QUESST, formerly SWS, “Spoken Web                 as a secondary metric for diagnostic purposes, which means
Search”), held as part of the MediaEval 2014 evaluation            that systems must provide not only scores, but also Yes/No
campaign. As in previous years, the proposed task requires         decisions. And third, three types of query matchings are
performing language-independent audio search in a low re-          considered: the first one is the “exact match” case used in
source scenario. This year, the task has been designed to get      previous years, whereas the second one, which allows for
as close as possible to a practical use case scenario, in which    inflectional variations of words, and the third one, which al-
a user would like to retrieve, using speech, utterances con-       lows for word re-orderings and some filler content between
taining a given word or short sentence, including those with       words, are “approximate matches” that simulate how we
limited inflectional variations of words, some filler content      imagine that users would want to use this technology.
and/or word re-orderings.
                                                                   2.     BRIEF TASK DESCRIPTION
1.   INTRODUCTION                                                     QUESST is part of the Mediaeval 2014 evaluation cam-
   After three years running as SWS (“Spoken Web Search”)          paign1 . As usual, two separate sets of queries are provided,
[4, 3, 1, 2], the task has been renamed to QUESST (“QUery          for development and evaluation, along with a single set of
by Example Search on Speech Task”) to better reflect its na-       audio files, on which both sets of queries must be searched
ture: to search FOR audio content WITHIN audio content             on. The set of development queries and the set of audio files
USING an audio query. As in previous years, the search             are distributed early (June 2nd), including the groundtruth
database was collected from heterogeneous sources, cover-          and the scoring scripts, for the participants to develop and
ing multiple languages, and under diverse acoustic condi-          evaluate their systems. The set of evaluation queries is dis-
tions. Some of these languages are resource-limited, some          tributed one month later (July 1st). System results (for both
are recorded in challenging acoustic conditions and some           sets of queries) must be returned by the evaluation deadline
contain heavily accented speech (typically from non-native         (September 9th), including a likelihood score and a Yes/No
speakers). No transcriptions, language tags or any other           decision for each pair (query, audio file). Note that not ev-
metadata are provided to participants. The task therefore          ery query necessarily appears in the set of audio files, and
requires researchers to build a language-independent audio-        that several queries may appear in the same audio file. Also,
to-audio search system. As in previous years, the database         there could be some overlap between evaluation and devel-
will be made publicly available for research purposes after        opment queries. Multiple system results can be submitted
the evaluation concludes.                                          (up to 5), but one of them (presumably the best one) must
   Three main changes were introduced for this year’s evalu-       be identified as primary. Also, although participants are en-
ation, namely on the the search task, on the evaluation met-       couraged to train their systems using only the data released
rics, and on the types of query matchings. First, the task         for this year’s evaluation, they are allowed to use any addi-
no longer requires the localization (time stamps) of query         tional resources they might have available, as long as their
matchings within audio files (which, on the other hand, are        use is documented in their system papers. System results
relatively short: less than 30 seconds long). However, sys-        are then scored and returned to participants (by September
tems must provide a score (a real number) for each query           16th), who must prepare a working notes (two-page) paper
matching, the higher (the more positive) the score, the more       describing their systems and return it to the organizers (by
likely that the query appears in the audio file. Second,           September 28th). Finally, systems are presented and results
the normalized cross entropy cost (Cnxe ) [5] is used as the       discussed in the Mediaeval workshop, which serves to meet
                                                                   fellow participants, to share ideas and to bootstrap future
                                                                   collaborations.
Copyright is held by the author/owner(s).                          1
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain         http://www.multimediaeval.org/mediaeval2014/
3.      THE QUESST 2014 DATASET                                          robust techniques to account for partial matchings.
                                2
   The QUESST 2014 dataset is the result of a joint effort               Note also that, when producing the ground-truth for
by several institutions to put together a sizable amount of              this type of matchings, hits are were allowed to contain
data to be used in this evaluation and for later research on             a large amount of filler content between words.
the topic of query-by-example search on speech. The search            The ground truth was created either manually by na-
corpus is composed of around 23 hours of audio (12492              tive speakers or automatically by speech recognition engines
files) in the following 6 languages: Albanian, Basque, Czech,      tuned to each particular language, and provided by the task
non-native English, Romanian and Slovak, with different            organizers, following the format of NIST’s Spoken Term De-
amounts of audio per language. The search utterances, which        tection evaluations. The development package contains a
are relatively short (6.6 seconds long on average), were au-       general ground-truth folder (the one that must be used to
tomatically extracted from longer recordings and manually          score system results on the development set) which consid-
checked to avoid very short or very long utterances. The           ers all types of matchings, but also three ground-truth fold-
QUESST 2014 dataset includes 560 development queries and           ers specific to each type of matchings, to allow participants
555 evaluation queries, the number of queries per language         evaluate their progress on each condition during system de-
being more or less balanced with the amount of audio avail-        velopment.
able in the search corpus. A big effort has been made to           5.   PERFORMANCE METRICS
manually record most of the queries, in order to avoid prob-
lems observed in previous years due to acoustic context de-           The primary metric used in QUESST 2014 is the normal-
rived from cutting the queries from longer sentences. Speak-       ized cross entropy cost (Cnxe ), already used in SWS 2013
ers recruited for recording the queries were asked to maintain     as a secondary metric [1]. This metric has been used for
a normal speaking speed and a clear speaking style. All au-        several years in the language and speaker recognition fields
dio files are PCM encoded at 8 kHz, 16 bits/sample, and            to calibrate system scores, and shows interesting proper-
stored in WAV format.                                              ties. Furthermore, we found experimentally that Cnxe and
                                                                   ATWV performances correlate quite well. A scoring script
4.      THE GROUND-TRUTH                                           has been specifically prepared for this year’s evaluation, so
   The biggest novelty in this year’s evaluation comes from        that NIST software is not required anymore3 . For the Cnxe
the new (relaxed) concept of a query match, which strongly         scores to be meaningful, participants are requested either to
affects the ground-truth definition and thus the way systems       return a score (that will be taken as a log-likelihood ratio)
are expected to work. Besides the “exact matching” used in         for every pair (query, audio file), or alternatively, to define
previous years, two types of “approximate matchings” are           a default (floor) score for all the pairs not included in the
considered. We denote these matchings as of Type 1, 2 and          results file. TWV metrics are computed with the follow-
3, respectively, and are defined as follows:                       ing application parameters: Ptarget = 0.0008, Cf a = 1 and
                                                                   Cmiss = 100. Participants are also required to report on
Type 1 (Exact): Only occurrences that exactly match the
                                                                   their real-time running factor, hardware characteristics and
      lexical representation of the query are considered as
                                                                   peak memory requirements, in order to profile the different
      hits, just like in previous years. For example, the query
                                                                   approaches applied. See [5] for further information on how
      “white horse” would match the utterance “My white
                                                                   the metrics work and how they are computed.
      horse is beautiful”.
Type 2 (Variant): In this case, query occurrences that             6.   ACKNOWLEDGEMENTS
      slightly differ from its lexical representation, either at     We would like to thank the Mediaeval organizers for their
      the beginning or at the end of the query, are consid-        support and all the participants for their hard work. Data
      ered as hits. Systems therefore need to account for          were provided by QUESST organizers and by the Techni-
      small portions of audio that do not match its lexi-          cal University of Kosice (TUKE), Slovak Republic. Igor
      cal representation. When producing the ground-truth          Szöke was supported by the Czech Science Foundation, un-
      for this type of matchings, the matching part of any         der project GPP202/12/P567.
      query was required to exceed 5 phonemes (250 ms),
      and the non-matching part was required to be much            7.   REFERENCES
                                                                   [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
      smaller than the matching part. For example, the                 Rodriguez-Fuentes. The Spoken Web Search Task. In Proc.
      query “researcher” would match an audio file contain-            Mediaeval 2013 Workshop, 2013.
      ing “research” (note that the query “research” would         [2] F. Metze, X. Anguera, E. Barnard, M. Davel, and
      also match an audio file containing “researcher”).               G. Gravier. Language Independent Search in MediaEval’s
                                                                       Spoken Web Search Task. Computer Speech and Language,
Type 3 (Reordering/Filler): Given a multi-word query,                  Special Issue on Information Extraction & Retrieval, 2014.
      a hit is required to contain all the words in the query,     [3] F. Metze, E. Barnard, M. Davel, C. van Heerden,
      but possibly in a different order and with some small            X. Anguera, G. Gravier, and N. Rajput. The Spoken Web
      amount of filler content between words; slight differ-           Search Task. In Proc. Mediaeval 2012 Workshop, 2012.
      ences between word occurrences and their lexical repre-      [4] N. Rajput and F. Metze. Spoken Web Search. In Proc.
      sentations are also allowed (like in Type 2). For exam-          Mediaeval 2011 Workshop, 2011.
                                                                   [5] L. J. Rodriguez-Fuentes and M. Penagarikano. MediaEval
      ple the query “white snow” would match an utterance
                                                                       2013 Spoken Web Search Task: System Performance
      containing either “snow is white”, “whitest snow” or             Measures. Technical Report TR-2013-1, DEE, University of
      “whiter than snow”. Note that queries provided in this           the Basque Country, 2013. Online: http://gtts.ehu.es/gtts/
      evaluation are spoken continuously, with no silences             NT/fulltext/rodriguezmediaeval13.pdf.
      between words, and thus participants should develop          3
                                                                     Thanks to Mikel Peñagarikano, from the University of the
2
    A download link will be provided after the evaluation.         Basque Country, for creating the scoring script.