<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Telefonica System for the Spoken Web Search Task at Mediaeval 2011</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barcelona</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain xanguera@tid.es</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>This working paper describes the system proposed by Telefonica research for the task of spoken voice search within the Mediaeval benchmarking evaluation campaign in 2011. The proposed system is based exclusively on a pattern matching approach, which is able to perform a query-by-example search with no prior knowledge of the acoustics or language being spoken. The system's main contributions are the usage of a novel method to obtain speaker independent acoustic features to later perform the matching through a DTWlike matching algorithm. Obtained results are promising and show, in our opinion, the potential of such class of techniques for this task.</p>
      </abstract>
      <kwd-group>
        <kwd>Pattern matching</kwd>
        <kwd>query-by-example</kwd>
        <kwd>spoken query</kwd>
        <kwd>search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The objective of the spoken web search task is to search for
some given audio query within a set of given audio content,
for a detailed explanation refer to [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]. The audio content
in this particular evaluation contains phone call excerpts
recorded in 4 di erent languages within the World Wide
Telecom Web project [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ] conducted by IBM. The system
we propose to tackle this task is based on audio
patternmatching between the query and the audio content to
retrieve putative matches. No information at all is used
regarding the language that the queries are spoken in or the
content (i.e. the transcription).
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>The proposed system can be split into two main blocks:
the acoustic feature extraction and the query search. For
the acoustic feature extraction the goal is to obtain features
that contain information about what has been said while
they are speaker independent so that the system is able to
recognize two instances of the same spoken word, even if
they were spoken by di erent speakers. The query search
doe a search for every particular query over all acoustic
material to identify whether (and where) the query appears.
Transversal to both modules we applied a simple silence
detection algorithm to eliminate long silences in the queries
and in the audio content. Next we will describe these three
modules more in detail.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Silence Detection and Removal</title>
      <p>Early on in our development we noticed that most queries
were spoken in isolation. This means that the spoken query
is always accompanied with some silence at the beginning
and end. In addition, some phone call excerpts also
contained non-wanted long amounts of silence frames. In order
to eliminate most silence regions without jeopardizing the
non-silence ones we applied a simple energy-based
thresholding algorithm, individually in every le, as follows: rst,
we compute the average energy of the signal over windows of
200ms, every 5ms. Then we search for the smallest energy
value and the average of the top 1% highest energy values
(we do not choose a single value in order to mellow down
the e ect of outliers) . Next we compute a threshold at the
5% of the resulting dynamic range, above the minimum
energy value. Finally we apply such threshold to every 5ms of
the input signal to di erentiate between speech and silence.
To avoid fast changes between speech/silence we apply a
top-hat algorithm with a window of 100ms to the binary
output of the previous step to ensure that no silence/speech
segments are output with less than 100ms length.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Acoustic Features Extraction</title>
      <p>Most of our e ort in this year's evaluation went to design
a good acoustic feature extraction module. Our goal was
to extract from the audio signal some features that retained
all acoustic information about what was said while being
speaker and background independent. As a side objective,
we also wanted to be as much independent as possible to
outside training data.</p>
      <p>
        We focused the design of our feature extractor in previous
work that started with [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on using phone posterior
probabilities as features, which was then extended by [5] to apply
it to the automatic word discovery task. Similarly to [5],
for our main submission we construct a Gaussian Mixture
Model and store the Gaussian posterior probabilities
(normalized to sum 1) as our features. In our case we decided to
only use the development data available for the SWS task,
therefore no external data was used on the training of this
model. In addition, once the GMM has been trained with
the EMML algorithm we perform a hard assignment of each
frame to their most likely Gaussian and retrain the
Gaussian's mean and variance to optimally model these frames.
This last step tries to solve the problem most EMML
systems have, which is focusing on optimizing the Gaussians
parameters to maximize the overall likelihood of the model
on the input data, but not to discriminate between the
different sounds in it. By performing the last assignment and
retraining step we push Gaussians apart from each other to
better model individual groups of frames depending on their
location and density.
      </p>
      <p>
        Alternatively, we also submitted a contrastive system that
consists on the binarization of the posterior probabilities for
each frame to binary form. This is inspired by our recent
developments in speaker veri cation [
        <xref ref-type="bibr" rid="ref2 ref6">2</xref>
        ] where we show that
we can e ectively build binary models to identify between
speakers. Such representations are much smaller for storage
purposes and can be processed much faster as binary
distances are usually very fast. In this case, for every posterior
probabilities vector we turned to 1 the 20%-best
probabilities, and to 0 the rest. The chosen distance between two
binary vectors x and y was de ned as
(1)
where ^ indicates the boolean AND operator and _ indicates
the boolean OR operator.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Query search Algorithm</title>
      <p>Given two sequences, X and Y of posterior probabilities,
respectively obtained from the query and any given phone
recording, we compare them using a DTW-like algorithm.
The standard DTW algorithm returns the optimum
alignment between any two sequences by nding the optimum
path between their start (0; 0) and end (xend; yend) points.
In our case we constraint the query signal to match between
start and end, but we allow the phone recording to start
its alignment at any position (0; y) and nish its alignment
in whenever the dynamic programming algorithm reaches
x = xend. Although we do not set any global constraints,
the local constraints are set so that at maximum 2-times or
21 -times warping is allowed by choosing the path that
minimizes the cost to reach position (i; j) as
8 D(i
&lt;</p>
      <p>D(i
: D(i
2; j
2; j
1; j
1))=(#(i
2))=(#(i
2))=(#(i
2; j
2; j
1; j
cost(i; j) = (d(i; j)+min
(2)
Where D(i; j) is the accumulated (non-normalized) distance
of all optimum paths until position (i; j), d(i; j) is the local
distance between frames xi and yj from both compared
sequences, and #(i; j) is the number of jumps of the optimum
path until that point. Note than when normalizing the
different possible paths we slightly favor the diagonal match.
stead of the actual value as we did not place much emphasis
on in the development stage at nding an optimum
threshold for our system. Still, we observed that for any given set
threshold the results remain similar both in dev-dev and in
eval-eval.</p>
      <p>In general, we nd results for dev-dev and eval-eval to be
very acceptable. On the other hand we were surprised to see
that our system does not work nearly as well for the cross
conditions. We have observed that channel missmatch might
have played a major role in these results, as we observed in
several cases that development les contain many recordings
with a much poorer signal quality than those from evaluation
les. We consider we have achieved a reasonable speaker
independence with our features but we are still to apply
ways to compensate for di erences in the channel.</p>
      <p>Comparing the two submissions we observe that the
binary features are always outperforming the standard
posteriorgrams. In our point of view this is a very interesting
nding that can be used in the near future to speedup the
spoken word search and automatic pattern discovery
systems, which together with the proposed novel way to
compute the GMM model can achieve fast and quite accurate
results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aradilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vepa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Bourlard</surname>
          </string-name>
          .
          <article-title>Using posterior-based features in template matching for speech recognition</article-title>
          .
          <source>In Proc. ICSLP</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Bonastre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Sierra</surname>
          </string-name>
          , and P.-M.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bousquet</surname>
          </string-name>
          .
          <article-title>Speaker modeling using local binary decisions</article-title>
          .
          <source>In Proc. Interspeech</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Nanavati</surname>
          </string-name>
          . Wwtw:
          <article-title>The world wide telecom web</article-title>
          .
          <source>In Proc. NSDR 2007 (SIGCOMM workshop)</source>
          , Kyoto, Japan,
          <year>August 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <article-title>Spoken websearch</article-title>
          .
          <source>In 1) + 3)</source>
          MediaEval 2011 Workshop, Pisa, Italy,
          <source>September 1-2 2) + 4)</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          2) + 3) [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>In Proc. ASRU</source>
          , pages
          <volume>398</volume>
          {
          <fpage>403</fpage>
          ,
          <string-name>
            <surname>Merano</surname>
          </string-name>
          , Italy,
          <year>December 2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>