<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The CMTECH Spoken Web Search System for MediaEval 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xavier Anguera Telefonica Research Barcelona</string-name>
          <email>xavier.binefa@upf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain xanguera@tid.es</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ciro Gracia University Pompeu fabra Barcelona</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Xavier Binefa University Pompeu fabra Barcelona</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We present a system for query by example on zero-resources languages. The system compares speech patterns by fusing the contributions of two acoustic models to cover both their spectral characteristics and their temporal evolution. The spectral model uses standard Gaussian mixtures to model classical MFCC features. We introduce phonetic priors in order to bias the unsupervised training of the model. In addition, we extend the standard similarity metric used comparing vector posteriors by incorporating inter cluster distances. To model temporal evolution patterns we use long temporal context models. We combine the information obtained by both models when computing the similarity matrix to allow subsequence-DTW algorithm to nd optimal subsequece alignment paths between query and reference data. Resulting alignment paths are locally ltered and globally normalized. Our experiments on Mediaeval data shows that this approach provides state of the art results and signi cantly improves the single model and the standard metric baseline.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The task of searching for speech queries within a speech
corpus without a priori knowledge of the language or
acoustic conditions of the data is gaining interest in the scienti c
community. Within the Spoken Web Search task (SWS)
in the Mediaeval evaluation campaign for 2013 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] systems
are given a set of acoustic queries that have to be searched
for within a corpus of audio composed of several languages
and di erent recording conditions. No information about
the transcription of the queries or speech corpus, nor the
language spoken is given.
      </p>
      <p>To tackle this task we propose a system using a zero
resources approach by extending some ideas from the state of
the art.</p>
      <p>
        We adopt posteriorgram features[
        <xref ref-type="bibr" rid="ref5 ref9">9, 5</xref>
        ] in order to improve
comparison between speech features. Posteriorgram features
are obtained from an acoustic model and allow to consistenly
compare acoustic vectors by removing factors of feature
variance. The di culty at this point relies into how to obtain
meaningful acoustic models in an unsupervised manner and
how to properly compare posterior features. The di culty at
this point rely into how to obtain meaningful acoustic
models unsupervisedly and how to properly compare posterior
features. In order to obtain meaninfull acoustic models with
unsupervised data we introduce linguistic prior information
to the unsupervised training by using an speci c pre-trained
model as initialization. In addition, instead of use standard
dot product to compare normalized posteriorgram vectors,
we extend this approach by incorporating to the comparison
a specially crafted matrix de ning an inter-cluster similarity.
      </p>
      <p>
        Previous approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to Mediaeval data have shown
that using di erent acoustic models to fuse di erent sources
of knowledge provides a signi cant improvement on
evaluation. Despite of that, it is important to determine which
types of information can complement each other in order
to guarantee a gain for the extra computational cost. Our
appoach to fusion is to combine temporal and spectral
information. As stated above, one of the models is focused into
spectral con guration of the acoustic vectors while the
complementary model is focused into model temporal evolution
of the feature dimensions.
      </p>
      <p>
        For sequences matching we use the subsequence-dynamic
time warping algorithm (s-DTW) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. With it we obtain the
alignment paths and the scores of all the potential matches
of the query inside the utterance. The major di culty
relies in how to decide which ones of the provided alignments
are acceptable as potential query instances and how to deal
with intra-inter query results overlap. In our system we used
lowpass ltering to reduce the number of spurious detections
and keept only the highest score of the intra query
overlapping paths. Inter-query overlap is complex and remains for
future work. Finally, We explore two di erent approaches to
global score normalization: the standard Z-norm approach
and score mapping based on continuous density function.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE CMTECH SYSTEM DESCRIPTION</title>
      <p>The system is based on standard MFCC39 features
computed by means of HTK at (25ms windows , 10 ms shift
time).
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Spectral Acoustic Model</title>
      <p>
        The rst acoustic model based on a gaussian mixture
model (GMM). We originally trained this model using TIMIT
phonetic ground truth. We trained a 4 gaussians GMM for
each of the 39 Lee and Hon [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] phonetic classes and then
combined all of them into a single GMM. This GMM is used
as initialization for an unsupervised training of the nal 156
components GMM using SWS2013 utterances.
      </p>
      <p>Using this model we build an inter-cluster distance matrix
D (156x156) using Kullback Leibler divergence:
(log( j ij ) + tr( i j + j i</p>
      <p>j j j
+( i
j )( i + j )( i
When comparing posterior features ~x; ~y we use:</p>
      <p>ds(~x; ~y) = ~xe D~y&gt;</p>
      <p>We found this extended comparison providing above 0.05
absolute MTWV points gain in mediaeval 2012 data.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Temporal Acoustic Model</title>
      <p>
        The objective of this temporal model is to extend the
context information and to e ectively complement the frame
based acoustic model. The temporal model is based on long
temporal context approach [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] trained on Mediaeval 2012
data. We process each of the MFCC39 dimensions
independently. We rst segmented Mediaeval 2012 data using an
unsupervised phonetic segmentation approach[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and extraced
a 150 ms context from the center of each of the segments
forming a collection of R31 vector. Each context vector is
standarized to zero mean and unity variance, windowed
using a Hanning window, decorrelated using discrete cosinus
transform and only the 15 rst coe cients become the
nal R15 vector. The modeling is performed by hierarchical
k-medioid together with a nal covariance matrices
estimation. The resulting model is composed of a Gaussian
Mixture model of 128 components for each of the original 39
dimensions.
      </p>
      <p>The comparison between two input vector is done in each
band b indepently by means of its model posterior ~xb; ~yb,
and then we fuse them using the median operator:
dt(~x; ~y; b) =</p>
      <p>k~xbk k~ybk
dt(~x; ~y) = median(dt(~x; ~y; b));
~xb~yb&gt;</p>
      <p>Inside Mediaeval 2012 data, the incorporation of this
acoustic model boosted our system MTWV results from 0.47 to
0.53 points.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Query Search</title>
      <p>For each pair of Query q and utterance u patterns we build
a distance matrix M of size (jqjxjuj) using:</p>
      <p>M (q; u) = log(dt(q; u)ds(q; u))
(4)</p>
      <p>We use S-DTW to obtain the score of alignment paths for
each possible ending position in u. In order to select
relevant local maxima scores, we rst lowpass lter the results
by using a 25 frames gaussian window. Depite that the
resulting selected alignment paths retain their original score
values.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Global normalization</title>
      <p>
        When all utterances have been processed for a given query,
we perform a normalization step. The rst system presented
(primary) uses a standard Z-normalization excluding the
rst 500 results from the parametter estimation. Similarly
to contrast enhacing performed by histogram equalization
in image processing[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our mapping approach replaces
resulting query scores with their corresponing value at the
query probability continuous density function (cdf). This
(1)
(2)
(3)
e ectively maps the scores distribution into a uniform
distribution and their cdf as a linear function. Our second
system (contrastive) replaces global Z-normalization by the
cdf equalization aproach.
3.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>Our future work will be related to explore the
relationship between system performance and voice activity
detection. Face the inter query overlap problem its inherent open
set classi cation problem. We are interested into distiguish
which are the key elements that garantee the suitability of
an acoustic model for the task, Specially interesting is
explore rigid and elastic distribution matching methods like
maximum likelihood linear transforms in order to be able to
adapt pre-trained models to new data unsupervisedly.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Ace</surname>
          </string-name>
          .
          <article-title>Phoneme recognition based on long temporal context</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Acharya</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ray</surname>
          </string-name>
          .
          <article-title>Image processing: principles and applications</article-title>
          . Wiley. com,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The spoken web search task</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gracia</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Binefa</surname>
          </string-name>
          .
          <article-title>On hierarchical clustering for speech phonetic segmentation</article-title>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Query-by-example spoken term detection using phonetic posteriorgram templates</article-title>
          .
          <source>In Automatic Speech Recognition &amp; Understanding</source>
          ,
          <year>2009</year>
          .
          <article-title>ASRU 2009</article-title>
          . IEEE Workshop on, pages
          <volume>421</volume>
          {
          <fpage>426</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lopes</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Perdiga</surname>
          </string-name>
          <article-title>~o. Broad phonetic class de nition driven by phone confusions</article-title>
          .
          <source>EURASIP Journal on Advances in Signal Processing</source>
          ,
          <year>2012</year>
          (
          <volume>1):1</volume>
          {
          <fpage>12</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Dynamic time warping</article-title>
          .
          <source>Information Retrieval for Music and Motion</source>
          , pages
          <volume>69</volume>
          {
          <fpage>84</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Cuhk system for the spoken web search task at mediaeval 2012</article-title>
          . In MediaEval,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams</article-title>
          .
          <source>In Automatic Speech Recognition &amp; Understanding</source>
          ,
          <year>2009</year>
          .
          <article-title>ASRU 2009</article-title>
          . IEEE Workshop on, pages
          <volume>398</volume>
          {
          <fpage>403</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>