<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Telefonica Research Spoken Web Search System for MediaEval 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xavier Anguera</string-name>
          <email>xanguera@tid.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miroslav Skácel</string-name>
          <email>Speech@FIT</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volker Vorwerk</string-name>
          <email>volker.vorwerk@altran.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Luque</string-name>
          <email>jls@tid.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Altran</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Brno</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Telefonica Research</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper we describe the system proposed by Telefonica research for the Spoken Web Search (SWS) task [3] within the Mediaeval 2013 evaluation. This is the third year we participate in the evaluation and this time we have submitted a system based on the recently proposed Information Retrieval-based Dynamic Time Warping (IR-DTW) Algorithm. This algorithm performs a pattern matching search at frame level similar to the DTW algorithm, but with advantages in memory usage and the possibility to with preindex the search corpora and use fast retrieval techniques. Results obtained this year have been poorer than expected, most probably due to the use of a global voice activity detector that was not adequate to the varying nature of the di erent acoustic conditions in this year's search corpora.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In this paper we present the Telefonica research system
proposed for the Spoken Web Search (SWS) task within
the Mediaeval 2013 evaluation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The task of
Query-byExample Spoken Term Detection (QbE-STD), also coined
as Spoken Audio Search (SAS), has gained quite a bit of
interest in the latter years within the scienti c community as it
allows for information to be obtained from audio documents
whose language and/or acoustic conditions are not matched
with those for which plenty of resources are available, and
thus systems based on supervised training techniques can
not be usually employed.
      </p>
      <p>
        To tackle this year's evaluation we have implemented a
zero-resources (no external data is used) system based on a
frame-based pattern matching approach using the recently
proposed IR-DTW algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] . The IR-DTW algorithm
is inspired on the subsequence-DTW algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to which
we add the option to pre-index the search corpus for faster
retrieval and a dynamic programming algorithm inspired on
information retrieval techniques which allows us to perform
a time-warped matching with very limited memory
requirements. Results for this year's evaluation are poorer than
we expected, partly due to the use of a global voice activity
detector that is not well coupled with the varying acoustic
conditions of this year's test data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE SWS SYSTEM DESCRIPTION</title>
      <p>
        Figure 1 shows the main components of the system we
presented his year. We rst perform feature extraction on
the audio signal to obtain standard MFCC features (12
Cepstra+energy + + ) which are mean and variance
normalized. Then, we obtain 64-dimensional posterior
probabilities from these features using a modi ed background
model described in Section 2.1. Optionally, the search
corpus features can be indexed into a hierarchical tree structure
to speedup later search, as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Next, we label
the audio into speech/non-speech frames using a global voice
activity detector (Section 2.2) and eliminate the non-speech
frames from the pattern matching steps. Next we apply the
IR-DTW algorithm (Subsection 2.3 ) to nd all matches for
every given query sequence. Finally, the top 500 matches are
processed using a more exhaustive local subsequence-DTW
matching to obtain exact start-end points and their scores.
Z-Norm is applied to the nal set of results to make scores
comparable across queries. In the case that we have
multiple instances per query we use the standard system on each
query and then fuse the results as described in Section 2.4.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background Model Creation</title>
      <p>
        The usage of a Gaussian Mixture Model (GMM) to
obtain posterior probability features from the input features
was introduced to QbE-STD by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we proposed an
alternative to the standard GMM model by training a
background model that focuses on creating more discriminative
Gaussian mixtures. A novelty we introduce this year for
our system is inspired on last year's system submission from
CUHK [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] where VTLN is applied to obtain a vocal tract
normalized GMM model. In our system we initially train
a background model as described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] by using all search
corpus data. Then we estimate VTLN coe cients for each
utterance in the corpus by using this model. The
VTLNnormalized features are then used to train a new background
model from scratch. The process is repeated 3-4 times until
convergence.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Voice Activity Detection</title>
      <p>It is well known that silence/non-speech regments are not
suitable for frame-based pattern-based matching as they tend
to generate many false alarms. For this reason, it is
mandatory to detect such frames and mask them out in subsequent
matching steps. To determine whether a frame is speech
or non-speech we use two global GMM models trained on
the whole search corpus for speech and non-speech. The
initialization of the models is done by using the 10% least
energy frames for the silence model, and the rest for the
speech model, followed by several decoding and retraining
iterations. For the evaluation we used 2 Gaussians for
nonspeech and 16 Gaussians for speech. Assignment of an input
frame to speech or non-speech is done through hypothesis
testing with both models. A nal morphological lter is run
to smooth the speech and non-speech regions.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>IR-DTW and Overlap Postprocessing</title>
      <p>
        The IR-DTW algorithm used is an evolution of the
algorithm presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The IR-DTW algorithm performs
a sparse dynamic programming pattern-matching algorithm
only on the pairs of query-search corpus frames that exceed
a certain similarity threshold, combining them to form
possible sequences matching both query and corpus utterances.
In addition, through a projection of the similarity matrix
into a one-dimensional structure, the IR-DTW algorithm
has much smaller memory requirement. Given all possible
matching subsequences in the search corpora, many are in
temporal overlap with each other. We lter out these
overlapping paths in two steps: rst we merge all those paths
that have the same start time both in the query and in the
search corpora and then we merge those matching paths that
highly overlap (more than 50%) with each other.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Multiple Query Instances Fusion</title>
      <p>In addition to the regular system, we implemented a
straightforward fusion to take advantage of cases when multiple
instances of a query are available. In such cases we run the
system independently for each query and then merge
together the z-normalized results for each of the instances. In
the process we merge any overlapped results, keeping the
one with highest score.</p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>Table 1 shows the results obtained by our system on the
dev and eval queries, both for the core set (only one
instance of each query is available) and extended set. Results
are lower than expected. After some preliminary testing we
observed that the selection of a global VAD was a bad choice,
as it is not able to do a good job in classifying
speech/nonspeech for the di erent acoustic conditions present in the
database. In addition, we observe that we did not do a good
job setting the optimum threshold on the ATWV score
(neither with the dev or eval queries). In a positive tone, the
extended queries results are consistently better (on MTWV)
than the single queries, which means that our late fusion of
results works.</p>
      <p>To run the system we used standard desktop linux
machines (approx. 2.5GHz processor speed) executing the
program in single core. The amount of RAM memory used
(discounting the storage of the database in memory) is around
110MB in average. The run-time factor is around 1e-3 RT.
cial Evaluation Results
dev queries eval queries
0.1158 0.0925
0.0961 0.0793
0.1303 0.1035
0.0845 0.0928
for the core set and 8.3e-4 RT for the extended set (core +
extended queries) for the online search (feature extraction
and background model computation are excluded from these
measurements). These values are improved from last year's
system thanks to a big rework we did of our system and
the optimization of all algorithms. We still feel that current
processing speeds should be increased for the system to be
usable in a real-life implementation..
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper we presented the Telefonica system
proposed for the SWS task within Mediaeval 2013. The
system is based on frame-based pattern-matching using a novel
DTW algorithm called IR-DTW. The results are far from
expected, due in part to the use of a global VAD that does
not adapt well to the varied acoustic conditions present in
the search corpus. We intend to understand and solve these
problems and to extend the IR-DTW matching algorithm to
perform dynamic programming symbol-based matching over
phoneme lattices in order to have a second set of results that
can be later fused with the frame-based results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          .
          <article-title>Speaker Independent Discriminant Feature Extraction for Acoustic Pattern-Matching</article-title>
          .
          <source>In Proc. ICASSP</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          .
          <article-title>Information retrieval-based dynamic time warping</article-title>
          .
          <source>In Proc. Interspeech</source>
          , Lyon, France,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The Spoken Web Search Task</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mantena</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          .
          <article-title>Speed Improvements to Information Retrieval-Based Dynamic Time Warping using Hierarchical k-Means Clustering</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Dynamic Time Warping, chapter 4</article-title>
          .
          <source>In Information Retrieval for Music and Motion</source>
          , pages
          <volume>69</volume>
          {
          <fpage>84</fpage>
          . Springer-Verlag, Berlin, Germany,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>CUHK System for the Spoken Web Search task at Mediaeval 2012</article-title>
          .
          <source>In in Proc. Mediaeval workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams</article-title>
          .
          <source>In Proc. ASRU</source>
          , pages
          <volume>398</volume>
          {
          <fpage>403</fpage>
          ,
          <string-name>
            <surname>Merano</surname>
          </string-name>
          , Italy,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>