=Paper=
{{Paper
|id=Vol-1436/Paper8
|storemode=property
|title=Query by Example Search on Speech at Mediaeval 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper8.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SzokeRBAMPLX15
}}
==Query by Example Search on Speech at Mediaeval 2015==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper8.pdf</pdf>
<pre>
     Query by Example Search on Speech at Mediaeval 2015

            Igor Szoke                    Luis Javier                       Andi Buzo                Xavier Anguera
         Brno University of            Rodriguez-Fuentes            University Politehnica of          Sinkronigo S.L.
            Technology                University of the Basque            Bucharest                   Barcelona, Spain
       Brno, Czech Republic                   Country                Bucharest, Romania              xanguera@gmail.com
          szoke@fit.vutbr.cz               Leioa, Spain                   andi.buzo@upb.ro
                                      luisjavier.rodriguez@ehu.es
          Florian Metze                   Jorge Proenca                    Martin Lojka                  Xiao Xiong
          Carnegie Mellon               University of Coimbra       Technical University of        Temasek laboratories,
             University                  Coimbra, Portugal                 Kosice                  Nanyang Technological
       Pittsburgh, PA, U.S.A               jproenca@co.it.pt           Kosice, Slovakia                 University
         fmetze@cs.cmu.edu                                               martin.lojka@tuke.sk           Singapore
                                                                                                     xiaoxiong@ntu.edu.sg


ABSTRACT                                                             itive) the score, the more likely it is that the query appears
In this paper, we describe the “Query by Example Search              in the audio file. The normalized cross entropy cost (Cnxe )
on Speech Task” (QUESST), held as part of the MediaEval              [3, 10] is used as the primary metric, whereas the Actual
2015 evaluation campaign. As in previous years, the pro-             Term Weighted Value (ATWV) is kept as a secondary met-
posed task requires performing language-independent audio            ric for diagnostic purposes, which means that systems must
search in a low resource scenario. This year, the task has           provide not only scores, but also Yes/No decisions. Three
been designed to get as close as possible to a practical use         types of query matchings are considered: the first one (T1 )
case scenario, in which a user would like to retrieve, us-           involves “exact matches” whereas the second one (T2 ) allows
ing speech, utterances containing a given word or short sen-         for inflectional variations of words or word re-orderings (that
tence, including those with limited inflectional variations of       is, “approximate matches”); the third one (T3 ) is similar to
words, some filler content and/or word re-orderings. We also         T2, but queries are drawn from conversational speech, thus
stressed a mismatch caused by noise and reverberations.              containing strong coarticulations and some filler content be-
                                                                     tween words.

1.   INTRODUCTION                                                    2. BRIEF TASK DESCRIPTION
  This is the fifth year of query-by-example search on speech
                                                                        QUESST is part of the Mediaeval 2015 evaluation cam-
evaluations [9, 6, 1, 2]. The task of QUESST (“QUery by
                                                                     paign1 . As usual, two separate sets of queries were pro-
Example Search on Speech Task”) is to search FOR audio
                                                                     vided, for development and evaluation, along with a single
content WITHIN audio content USING an audio query. As
                                                                     set of audio files, on which both sets of queries had to be
in previous years, the search database was collected from
                                                                     searched on. The set of development queries and the set of
heterogeneous sources, covering multiple languages, and un-
                                                                     audio files were distributed early (April 1st), including the
der diverse acoustic conditions. Some of these languages are
                                                                     groundtruth and the scoring scripts, for the participants to
resource-limited, some are recorded in challenging acoustic
                                                                     develop and evaluate their systems. The set of evaluation
conditions and some contain heavily accented speech (typi-
                                                                     queries was distributed one month later (May 1st). System
cally from non-native speakers). No transcriptions, language
                                                                     results (for both sets of queries) had to be returned by the
tags or any other metadata are provided to participants.
                                                                     evaluation deadline (July 22nd), including a likelihood score
The task therefore requires researchers to build a language-
                                                                     and a Yes/No decision for each pair (query, audio file). Note
independent audio-to-audio search system.
                                                                     that not every query necessarily appears in the set of audio
  Compared to the previous year, two main changes were in-
                                                                     files, and that several queries may appear in the same audio
troduced for this year’s evaluation. First, we provide queries
                                                                     file.
with longer context. So participants can use this surround-
                                                                        Multiple system results could be submitted (up to 5), but
ing speech to adapt their systems. Second, we artificially
                                                                     one of them (presumably the best one) had to be identified
add noises and reverberations into the data. This aims to
                                                                     as primary. Also, although participants were encouraged
measure robustness of particular feature extractions and al-
                                                                     to train their systems using only the data released for this
gorithms in heavy channel mismatch.
                                                                     year’s evaluation, they were allowed to use any additional
  As in the previous year, the proposed task does not require
                                                                     resources they might have available, as long as their use was
the localization (time stamps) of query matchings within
                                                                     documented in their system description papers. System re-
audio files. However, systems must provide a score (a real
                                                                     sults were then scored and returned to participants (by July
number) for each query matching. The higher (the more pos-
                                                                     29th), who had to prepare a working notes (two-page) paper
                                                                     describing their systems and return it to the organizers (by
                                                                     August 28th). Finally, systems were presented and results
Copyright is held by the author/owner(s).                            1
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany.         http://www.multimediaeval.org/mediaeval2015/
discussed in the Mediaeval workshop, which serves to meet        regard to the lexical form of the query.
fellow participants, to share ideas and to bootstrap future      Type 3 (Conversational queries in context): This type
collaborations.                                                  of search is another step towards realistic use-case scenarios.
                                                                 In this case, the spoken query is just part of a sentence, that
3.   THE QUESST 2015 DATASET                                     may contain silent/filled pauses and irrelevant words. For
   The QUESST 2015 dataset is the result of a joint effort by    example, “Google, let me find some red [uh] white [pau] horse
several institutions to put together a sizable amount of data    to ride today” could be one of these complex queries. As
to be used in this evaluation and for later research on the      it is extremely difficult to distinguish between query words
topic of query-by-example search on speech. The search cor-      (“white [pau] horse”) and “fillers” (“Google, let me find some
pus is composed of around 18 hours of audio (11662 files) in     red [uh]” and “to ride today”), we provide timing meta-data
the following 7 languages: Albanian, Czech, English, Man-        of the relevant segment inside the spoken query.
darin, Portuguese, Romanian and Slovak [8], with differ-            As Type 3 queries required timing meta-data, all the queries
ent amounts of audio per language. The search utterances,        were recorded within a context and timing information was
which are relatively short (5.8 seconds long on average), were   provided for all of them. The context of Types 1 and 2
automatically extracted from longer recordings and manu-         was “artificial”: speakers were asked to say several words
ally checked to avoid very short or very long utterances. The    before and after the query, with significant pauses around
QUESST 2015 dataset includes 445 development queries and         the query to avoid coarticulations. Participants are free to
447 evaluation queries, the number of queries per language       use the “context” of the spoken query (e.g. for adaptation).
being more or less balanced with the amount of audio avail-         The ground truth was created either manually by na-
able in the search corpus. A big effort has been made to         tive speakers or automatically by speech recognition engines
manually record the queries, in order to avoid problems ob-      tuned to each particular language, and provided by the task
served in previous years due to acoustic context derived from    organizers, following the format of NIST’s Spoken Term De-
cutting queries from longer sentences. Speakers recruited        tection evaluations. The development package contains a
for recording the queries were asked to maintain a normal        general ground-truth folder (the one that must be used to
speaking speed and a clear speaking style. All audio files       score system results on the development set) which consid-
are PCM encoded at 8 kHz, 16 bits/sample, and stored in          ers all types of matchings, but also three ground-truth fold-
WAV format.                                                      ers specific to each type of matchings, to allow participants
                                                                 evaluate their progress on each condition during system de-
   The data was then artificially noised and reverberated
with equal amounts of clean, noisy, reverb and noisy+reverb      velopment.
speech. We used both stationary and transient noises down-
loaded from https://www.freesound.org. Reverberation             5. PERFORMANCE METRICS
was obtained by passing the audio through a filter with an          In QUESST 2015, Cnxe and ATWV are used as primary
artificially generated room impulse response (RIR) [5].          and secondary metrics, respectively. For the Cnxe scores to
                                                                 be meaningful, participants are requested either to return a
                                                                 score (that will be taken as a log-likelihood ratio) for every
4.   THE GROUND-TRUTH                                            pair (query, audio file), or alternatively, to define a default
   Similarly to the last year’s evaluation, we have applied      (floor) score for all the pairs not included in the results file.
a relaxed concept of a query match, which strongly affects       Participants are also required to report on their real-time
the ground-truth definition and thus the way systems are         running factor, hardware characteristics and peak memory
expected to work. Besides “exact matchings” (Type 1), two        requirements, in order to profile the different approaches
types of “approximate matchings” (Types 2 and 3) are con-        applied. See [10] for further information on how the metrics
sidered, which are defined as follows:                           work and how they are computed.
Type 1 (Exact match): Occurrences of single or mul-
tiple word queries in utterances should exactly match the        6. PROVIDED TOOLS
lexical representation of the query. An example of this case
                                                                   We offered some of the basic tools for paricipants to make
is the query “white horse” that should match the utterance
                                                                 their “first contact” with the QUESST easier. We provided
“My white horse is beautiful” but should not match to “The
                                                                 Bottle-Neck feature [4] extraction tool trained on Russian
whiter horse is faster”.
                                                                 and Hungarian Speechdat-E. Next, calibration and fussion
Type 2 (Re-ordering and small lexical variations):               script based on logistic regresion [11] and DTW search [12],
Here the search algorithm should cope with:                      both developed at BUT, were provided. Finally, the data
• Lexical variations. Occurrences of single/multiple word        and all the scripts were setup in a Virtual Machine which was
 queries might differ slightly, with small portions of audio     provided to the participants through the Speech Recognition
 either at the beginning or the end of the segment that do       Virtual Kitchen (http://speechkitchen.org/, [7]).
 not match the lexical form of the reference. An example
 of this type of search would be “researcher” matching an        7. ACKNOWLEDGEMENTS
 utterance containing “research” (note that the inverse would
 also be possible).                                                We would like to thank the Mediaeval organizers for their
                                                                 support and all the participants for their hard work. Por-
• Word re-orderings and small filler content between words.
                                                                 tuguesse data were provided with thanks to Tecnovoz project
 For example, when searching for the query “white horse”,
                                                                 PMDT No. 03/165
 systems should be able to match both “My horse is white”
 and “I have two white and beautiful horses”. Note that
 the matching words may also contain slight variations with
8.   REFERENCES
 [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
     Rodriguez-Fuentes. The Spoken Web Search Task. In
     Proc. Mediaeval 2013 Workshop.
 [2] X. Anguera, J. L. Rodriguez-Fuentes, I. Szoke,
     A. Buzo, and F. Metze. Query by Example Search on
     Speech at Mediaeval 2014. In Proc. Mediaeval 2014
     Workshop.
 [3] X. Anguera, L.-J. Rodriguez-Fuentes, A. Buzo,
     F. Metze, I. Szoke, and M. Penagarikano.
     QUESST2014: Evaluating Query-by-Example Speech
     Search in a Zero-Resource Setting with Real-Life
     Queries. In Proc. ICASSP, pages 5833–5837, 2015.
 [4] F. Grézl, M. Karafiát, S. Kontár, and J. Černocký.
     Probabilistic and bottle-neck features for lvcsr of
     meetings. In Proc. IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP
     2007), pages 757–760. IEEE Signal Processing Society,
     2007.
 [5] S. G. McGovern. Fast image method for impulse
     response calculations of box-shaped rooms. Applied
     Acoustics, 70(1):182–189, 2009.
 [6] F. Metze, E. Barnard, M. Davel, C. van Heerden,
     X. Anguera, G. Gravier, and N. Rajput. The Spoken
     Web Search Task. In Proc. Mediaeval 2012 Workshop.
 [7] F. Metze, E. Riebling, E. Fosler-Lussier, A. Plummer,
     and R. Bates. The speech recognition virtual kitchen
     turns one. In to appear in Proc. INTERSPEECH.
     ISCA, 2015.
 [8] M. Pleva and J. Juhar. Tuke-bnews-sk: Slovak
     broadcast news corpus construction and evaluation. In
     N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson,
     B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and
     S. Piperidis, editors, Proceedings of the Ninth
     International Conference on Language Resources and
     Evaluation (LREC’14), Reykjavik, Iceland, may 2014.
     European Language Resources Association (ELRA).
 [9] N. Rajput and F. Metze. Spoken Web Search. In Proc.
     Mediaeval 2011 Workshop.
[10] L. J. Rodriguez-Fuentes and M. Penagarikano.
     MediaEval 2013 Spoken Web Search Task: System
     Performance Measures. Technical Report TR-2013-1,
     DEE, University of the Basque Country, 2013. Online:
     http://gtts.ehu.es/gtts/
     NT/fulltext/rodriguezmediaeval13.pdf.
[11] I. Szoke, L. Burget, F. Grézl, J. Černocký, and
     L. Ondel. Calibration and fusion of query-by-example
     systems - but sws 2013. In Proceedings of ICASSP
     2014, pages 7899–7903. IEEE Signal Processing
     Society, 2014.
[12] I. Szoke, M. Skacel, L. Burget, and J. H. Cernocky.
     Coping with Channel Mismatch in Query-by-Example
     - BUT QUESST 2014. In Proc. ICASSP, pages
     5838–5843, 2015.

</pre>