<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Florian Metze</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andi Buzo</string-name>
          <email>andi.buzo@upb.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper, we describe the “Spoken Web Search” Task, which is being held as part of the 2013 MediaEval campaign. The purpose of this task is to perform audio search in multiple languages and acoustic conditions, with very few resources being available for each individual language. This year the data contains audio from nine different languages and is much bigger in size than in previous years, mimicking realistic low/zero-resource settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The “Spoken Web Search” (SWS) task of MediaEval 2013 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
involves searching for audio content within audio content using an
audio query. The task requires researchers to build a
languageindependent audio search system so that, given an audio query, it
should be able to find the appropriate audio file(s) and the exact
location(s) of a query term within these audio file(s). Evaluation is
performed using standard NIST metrics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in addition to some
other indicators.
      </p>
      <p>
        The 2013 evaluation expands on the MediaEval 2011 and 2012
“Spoken Web Search” tasks [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ] by increasing the size of the test
dataset and the number of languages (which were recorded in
different acoustic conditions). In addition, a baseline system is
being offered this year to first-time participants as a virtual
kitchen appliance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. MOTIVATION and RELATED WORK</title>
      <p>
        Imagine you want to build a simple speech recognition system, or
at least a spoken term detection (STD) or keyword search (KWS)
system in a new dialect, language or acoustic condition, for which
only very few audio examples are available. Maybe there even are
no transcripts available for that data. Is it possible to do something
useful (e.g. identify the topic of a query) by using only those very
limited resources available? Full-fledged speech recognition may
be unrealistic to be used for such a task, which may not be
required to solve a specific information access or search problem.
This task was originally proposed by IBM Research India, who
provided the 2011 data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In 2012, the evaluation was performed
on new data gathered from 4 different African languages [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
2012 data is made available to participants to help them in their
system development.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. TASK DESCRIPTION</title>
      <p>Participants receive audio data as well as development and
evaluation (audio) queries, described in more detail below. Only
the occurrence of development queries in the data is provided.
Participants are required to identify and submit which query (or
queries, from the set of evaluation queries) occur(s) in each
utterance (0-n matches per term, i.e. not every term necessarily
occurs, but multiple matches are possible per utterance). There
may be partial overlap between evaluation and development
queries. In addition, participants are asked to submit their
Copyright is held by the author/ owner(s).
development output (i.e. the detection of development queries on
the data) for comparison purposes.</p>
      <p>Participants can submit multiple systems, but need to designate
one primary system. Participants are encouraged to submit a
system trained only on data released for the 2013 SWS task, but
are allowed to use any additional resources they might have
available, as long as their use is documented.</p>
      <p>
        For the first time this year, a “Speech Recognition Virtual
Kitchen” appliance [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is made available to participants as a
baseline system to experiment with. This consists of a
Linuxbased virtual machine, running a complete SWS system.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Development and evaluation Data</title>
      <p>As a result of a joint effort between several institutions, a
challenging new dataset, together with accompanying queries, has
been put together for the 2013 evaluation. This dataset is
composed of 20 hours of audio in the following 9 languages:
Albanian, Basque, Czech, non-native English, Isixhosa, Isizulu,
Romanian, Sepedi and Setswana. The recording acoustic
conditions are not constant for all languages; some being obtained
from in-room microphone recordings while others have been
obtained through street recordings with cellphones. All data has
been converted to 8KHz/ 16bit WAV files. Moreover, the amount
of audio available for each language is not the same for all
languages. Such database is over 5 times the size of the 2012
databases. The development and evaluation queries are mutually
exclusive segments defined within the same data collection. For
this reason, no information on the language being spoken or the
transcription of the files is released with the development runs.
We believe that with such a variety of data the concept of
overfitting to the dev-test set is quite diluted and, if any, it should be
seen as a good thing for systems to be able to take advantage from
knowing the possible acoustics of the test languages.</p>
      <p>Accompanying the dataset, two sets of queries have been created
for use in the development and evaluation, each one containing
two subsets of basic and extended queries. A basic set of 500+
queries each are to be used by participants in their required runs.
In addition, for some of the basic queries, alternative spoken
instances of the same lexical terms have also been gathered and
are made available to participants to be used (together with the
basic queries) in their extended runs. Such extended runs are
intended to represent how results would vary if systems could take
advantage of multiple repeated queries.</p>
      <p>
        In addition to the main database used for this year, the 2012
“African” database [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is also being made available to participants
in hope it is of help in the development phase. It consists of over
1580 files and 100 queries both for development and evaluation,
recorded in 4 African Languages. Participants should note that the
acoustic conditions of this dataset only match those of a small part
of the 2013 dataset.
      </p>
      <p>
        A "termlist" XML file and a transcription RTTM file are provided
with the development data, following the guidelines of the
NISTSTD 2006 evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For this year the reference files do not
contain any information regarding the language or the content
spoken in each file, and only the locations of the queries is given
in the reference RTTM file. This is done so in order not to give
away any extra information about the dataset when releasing the
development data, as it is shared with the evaluation queries.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. EVALUATION OF RESULTS</title>
      <p>The ground truth for this year has been created in a variety of
ways. Sometimes is has been created manually by native speakers
while in other cases a speech recognition system has been used to
force-align the transcripts at word level. Note that word
alignments might not be perfect, which is why a margin of error is
allowed by the scoring scripts.</p>
      <p>
        The main evaluation metric this year remains the same as previous
years by following the principles and using the tools of NIST's
Spoken Term Detection (STD) evaluations. The primary
evaluation metric is ATWV (Actual Term Weighted Value), as
used in the NIST 2006 Spoken Term Detection (STD) evaluation
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A scoring package with easy-to-use scripts and an example
scoring setup have been made available to participants with the
development data. This year we are again applying a different
scoring working point by modifying the miss and false alarm costs
to better match the new test data.
      </p>
      <p>
        In addition, two secondary metrics are being introduced this year.
On the one hand, the normalized cross-entropy metric Cnxe
evaluates the information provided by system scores (in contrast
to TWV, which uses system decisions). This metric originates
from the NIST SRE evaluations and is computed assuming that
submitted scores can be interpreted as log-likelihood ratios. On
the other hand, the real-time factor evaluates the required
resources used by the systems. In addition, participants are
requested to indicate the type of machines used in the evaluation
and (approximately) the peak memory usage in order for
organizers to compute a global processing load metric per system.
See [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for a detailed description on these metrics.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. OUTLOOK</title>
      <p>Low (or even zero) resource speech recognition is currently
receiving a lot of attention and will soon reach maturity to be
useful for real-life scenarios. The “Spoken Web Search” task
originated as an alternative to standard techniques for
low/zeroresourced languages where good speech recognizers do not exist.
This year we have extended this paradigm to include audio data
for which not much is known a priori, by mixing several
languages and acoustic conditions in the same test dataset. By
comparing the results obtained by the different systems in this
friendly evaluation we expect to help push forward the
state-ofthe-art in this area.</p>
    </sec>
    <sec id="sec-7">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>
        The SWS task organizers would like to thank the Mediaeval
organizers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and all the participants for putting in a lot of hard
work into submitting their systems. The “African” data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for this
year and last year has kindly been collected by CSIR and made
available by Charl van Heerden at NWU. Igor Szöke was
supported by Grant Agency of Czech Republic post-doctoral
project No.GPP202/12/P567
      </p>
      <p>Carnegie Mellon University; Pittsburgh, PA, U.S.A; fmetze@cs.cmu.edu
‡</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fiscus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ajot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garofolo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Doddington</surname>
          </string-name>
          ,
          <year>2007</year>
          ,
          <article-title>"Results of the 2006 Spoken Term Detection Evaluation,"</article-title>
          <source>Proc. ACM SIGIR</source>
          <year>2007</year>
          , Workshop in Searching Spontaneous Conversational Speech (SSCS).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Diao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukherjea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>"Faceted Search and Browsing of Audio Content on Spoken Web,"</article-title>
          <source>Proc. CIKM</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>[3] http://www.multimediaeval.org/mediaeval2013/index.html</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          , E. Barnard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.V.</given-names>
            <surname>Heerden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.Anguera G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          , “
          <article-title>The spoken web search task”</article-title>
          ,
          <source>in Workshop notes of Mediaeval</source>
          <year>2012</year>
          , Pisa, Italy
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Barnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. van Heerden</surname>
          </string-name>
          ,
          <article-title>"ASR Corpus design for resource-scarce languages,"</article-title>
          <source>in Proc. INTERSPEECH</source>
          , Brighton, UK; Sep.
          <year>2009</year>
          , pp.
          <fpage>2847</fpage>
          -
          <lpage>2850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          , G. Gravier,
          <string-name>
            <surname>C.</surname>
          </string-name>
          v. Heerden,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Mantena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muscariello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Prahallad</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szöke</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Tejedor.</surname>
          </string-name>
          “
          <source>The Spoken Web Search task at MediaEval</source>
          <year>2011</year>
          ”.
          <source>In Proc. ICASSP</source>
          , Kyoto; Mar.
          <year>2012</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          , E. Barnard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davel</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Gravier.</surname>
          </string-name>
          <article-title>The Spoken Web Search task at MediaEval 2012</article-title>
          .
          <source>In Proc. ICASSP</source>
          , Vancouver; May.
          <year>2013</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Fosler-Lussier</surname>
          </string-name>
          , “
          <article-title>The Speech Recognition Virtual Kitchen: An Initial Prototype”</article-title>
          ,
          <source>in Proc. Interspeech</source>
          <year>2012</year>
          , Portland, USA
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Luis</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez-Fuentes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <article-title>"MediaEval 2013 Spoken Web Search Task: System Performance Measures"</article-title>
          , n.
          <source>TR-2013-1</source>
          , Department of Electricity and Electronics, University of the Basque Country,
          <year>2013</year>
          . Link: http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf * Telefonica Research; Barcelona, Spain; xanguera@tid.es †
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>