<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query by Example Search on Speech at Mediaeval 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xavier Anguera</string-name>
          <email>xanguera@tid.es</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Javier</string-name>
          <email>luisjavier.rodriguez@ehu.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Szöke</string-name>
          <email>szoke@ t.vutbr.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andi Buzo</string-name>
          <email>andi.buzo@upb.ro</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Metze</string-name>
          <email>fmetze@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brno University of, Technology</institution>
          ,
          <addr-line>Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Carnegie Mellon, University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Rodriguez-Fuentes, University of the Basque</institution>
          ,
          <addr-line>Country, Leioa</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Telefonica Research</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University Politehnica of</institution>
          ,
          <addr-line>Bucharest, Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper, we describe the \Query by Example Search on Speech Task" (QUESST, formerly SWS, \Spoken Web Search"), held as part of the MediaEval 2014 evaluation campaign. As in previous years, the proposed task requires performing language-independent audio search in a low resource scenario. This year, the task has been designed to get as close as possible to a practical use case scenario, in which a user would like to retrieve, using speech, utterances containing a given word or short sentence, including those with limited in ectional variations of words, some ller content and/or word re-orderings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        After three years running as SWS (\Spoken Web Search")
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref6">4, 3, 1, 2</xref>
        ], the task has been renamed to QUESST (\QUery
by Example Search on Speech Task") to better re ect its
nature: to search FOR audio content WITHIN audio content
USING an audio query. As in previous years, the search
database was collected from heterogeneous sources,
covering multiple languages, and under diverse acoustic
conditions. Some of these languages are resource-limited, some
are recorded in challenging acoustic conditions and some
contain heavily accented speech (typically from non-native
speakers). No transcriptions, language tags or any other
metadata are provided to participants. The task therefore
requires researchers to build a language-independent
audioto-audio search system. As in previous years, the database
will be made publicly available for research purposes after
the evaluation concludes.
      </p>
      <p>
        Three main changes were introduced for this year's
evaluation, namely on the the search task, on the evaluation
metrics, and on the types of query matchings. First, the task
no longer requires the localization (time stamps) of query
matchings within audio les (which, on the other hand, are
relatively short: less than 30 seconds long). However,
systems must provide a score (a real number) for each query
matching, the higher (the more positive) the score, the more
likely that the query appears in the audio le. Second,
the normalized cross entropy cost (Cnxe) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is used as the
primary metric, whereas the Actual Term Weighted Value
(ATWV, used as primary metric in previous years) is kept
as a secondary metric for diagnostic purposes, which means
that systems must provide not only scores, but also Yes/No
decisions. And third, three types of query matchings are
considered: the rst one is the \exact match" case used in
previous years, whereas the second one, which allows for
in ectional variations of words, and the third one, which
allows for word re-orderings and some ller content between
words, are \approximate matches" that simulate how we
imagine that users would want to use this technology.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>BRIEF TASK DESCRIPTION</title>
      <p>QUESST is part of the Mediaeval 2014 evaluation
campaign1. As usual, two separate sets of queries are provided,
for development and evaluation, along with a single set of
audio les, on which both sets of queries must be searched
on. The set of development queries and the set of audio les
are distributed early (June 2nd), including the groundtruth
and the scoring scripts, for the participants to develop and
evaluate their systems. The set of evaluation queries is
distributed one month later (July 1st). System results (for both
sets of queries) must be returned by the evaluation deadline
(September 9th), including a likelihood score and a Yes/No
decision for each pair (query, audio le). Note that not
every query necessarily appears in the set of audio les, and
that several queries may appear in the same audio le. Also,
there could be some overlap between evaluation and
development queries. Multiple system results can be submitted
(up to 5), but one of them (presumably the best one) must
be identi ed as primary. Also, although participants are
encouraged to train their systems using only the data released
for this year's evaluation, they are allowed to use any
additional resources they might have available, as long as their
use is documented in their system papers. System results
are then scored and returned to participants (by September
16th), who must prepare a working notes (two-page) paper
describing their systems and return it to the organizers (by
September 28th). Finally, systems are presented and results
discussed in the Mediaeval workshop, which serves to meet
fellow participants, to share ideas and to bootstrap future
collaborations.
1http://www.multimediaeval.org/mediaeval2014/</p>
    </sec>
    <sec id="sec-3">
      <title>THE QUESST 2014 DATASET</title>
      <p>The QUESST 2014 dataset2 is the result of a joint e ort
by several institutions to put together a sizable amount of
data to be used in this evaluation and for later research on
the topic of query-by-example search on speech. The search
corpus is composed of around 23 hours of audio (12492
les) in the following 6 languages: Albanian, Basque, Czech,
non-native English, Romanian and Slovak, with di erent
amounts of audio per language. The search utterances, which
are relatively short (6.6 seconds long on average), were
automatically extracted from longer recordings and manually
checked to avoid very short or very long utterances. The
QUESST 2014 dataset includes 560 development queries and
555 evaluation queries, the number of queries per language
being more or less balanced with the amount of audio
available in the search corpus. A big e ort has been made to
manually record most of the queries, in order to avoid
problems observed in previous years due to acoustic context
derived from cutting the queries from longer sentences.
Speakers recruited for recording the queries were asked to maintain
a normal speaking speed and a clear speaking style. All
audio les are PCM encoded at 8 kHz, 16 bits/sample, and
stored in WAV format.</p>
    </sec>
    <sec id="sec-4">
      <title>THE GROUND-TRUTH</title>
      <p>The biggest novelty in this year's evaluation comes from
the new (relaxed) concept of a query match, which strongly
a ects the ground-truth de nition and thus the way systems
are expected to work. Besides the \exact matching" used in
previous years, two types of \approximate matchings" are
considered. We denote these matchings as of Type 1, 2 and
3, respectively, and are de ned as follows:
Type 1 (Exact): Only occurrences that exactly match the
lexical representation of the query are considered as
hits, just like in previous years. For example, the query
\white horse" would match the utterance \My white
horse is beautiful".</p>
      <p>Type 2 (Variant): In this case, query occurrences that
slightly di er from its lexical representation, either at
the beginning or at the end of the query, are
considered as hits. Systems therefore need to account for
small portions of audio that do not match its
lexical representation. When producing the ground-truth
for this type of matchings, the matching part of any
query was required to exceed 5 phonemes (250 ms),
and the non-matching part was required to be much
smaller than the matching part. For example, the
query \researcher" would match an audio le
containing \research" (note that the query \research" would
also match an audio le containing \researcher").
Type 3 (Reordering/Filler): Given a multi-word query,
a hit is required to contain all the words in the query,
but possibly in a di erent order and with some small
amount of ller content between words; slight di
erences between word occurrences and their lexical
representations are also allowed (like in Type 2). For
example the query \white snow" would match an utterance
containing either \snow is white", \whitest snow" or
\whiter than snow". Note that queries provided in this
evaluation are spoken continuously, with no silences
between words, and thus participants should develop
2A download link will be provided after the evaluation.
robust techniques to account for partial matchings.
Note also that, when producing the ground-truth for
this type of matchings, hits are were allowed to contain
a large amount of ller content between words.</p>
      <p>The ground truth was created either manually by
native speakers or automatically by speech recognition engines
tuned to each particular language, and provided by the task
organizers, following the format of NIST's Spoken Term
Detection evaluations. The development package contains a
general ground-truth folder (the one that must be used to
score system results on the development set) which
considers all types of matchings, but also three ground-truth
folders speci c to each type of matchings, to allow participants
evaluate their progress on each condition during system
development.</p>
    </sec>
    <sec id="sec-5">
      <title>5. PERFORMANCE METRICS</title>
      <p>
        The primary metric used in QUESST 2014 is the
normalized cross entropy cost (Cnxe), already used in SWS 2013
as a secondary metric [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This metric has been used for
several years in the language and speaker recognition elds
to calibrate system scores, and shows interesting
properties. Furthermore, we found experimentally that Cnxe and
ATWV performances correlate quite well. A scoring script
has been speci cally prepared for this year's evaluation, so
that NIST software is not required anymore3. For the Cnxe
scores to be meaningful, participants are requested either to
return a score (that will be taken as a log-likelihood ratio)
for every pair (query, audio le), or alternatively, to de ne
a default ( oor) score for all the pairs not included in the
results le. TWV metrics are computed with the
following application parameters: Ptarget = 0:0008, Cfa = 1 and
Cmiss = 100. Participants are also required to report on
their real-time running factor, hardware characteristics and
peak memory requirements, in order to pro le the di erent
approaches applied. See [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for further information on how
the metrics work and how they are computed.
6.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>We would like to thank the Mediaeval organizers for their
support and all the participants for their hard work. Data
were provided by QUESST organizers and by the
Technical University of Kosice (TUKE), Slovak Republic. Igor
Szoke was supported by the Czech Science Foundation,
under project GPP202/12/P567.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The Spoken Web Search Task</article-title>
          .
          <source>In Proc. Mediaeval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          , E. Barnard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          .
          <article-title>Language Independent Search in MediaEval's Spoken Web Search Task</article-title>
          . Computer Speech and Language, Special Issue on Information Extraction &amp; Retrieval,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          , E. Barnard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          , C. van
          <string-name>
            <surname>Heerden</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Anguera</surname>
            , G. Gravier, and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Rajput</surname>
          </string-name>
          .
          <article-title>The Spoken Web Search Task</article-title>
          .
          <source>In Proc. Mediaeval 2012 Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <article-title>Spoken Web Search</article-title>
          .
          <source>In Proc. Mediaeval 2011 Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano. MediaEval 2013 Spoken Web</surname>
          </string-name>
          <article-title>Search Task: System Performance Measures</article-title>
          .
          <source>Technical Report TR-2013-1</source>
          , DEE, University of the Basque Country,
          <year>2013</year>
          . Online: http://gtts.ehu.es/gtts/ NT/fulltext/rodriguezmediaeval13.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>3Thanks to Mikel Pen~agarikano, from the University of the Basque Country, for creating the scoring script</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>