<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miroslav Skácel</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Szöke</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>All our systems are based on Dynamic Time Warping (DTW). These systems use bottle-neck features (BN) as input. The bottle-neck feature extractors were trained on GlobalPhone Czech, Portuguese, Russian and Spanish languages, so our approach is in low-resource category. We also aimed on T1/T2/T3 types of query search for late submission systems. System calibration and fusion were based on binary logistic regression.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>
        We developed one (single) system for on-time submission
and two more systems for late submission. The system
schema is in Figure 1. Similarly to last year, we used
feature extractors already available at BUT (so-called Atomic
Systems). We aimed only at bottle-neck features and DTW
search approach this year. Our goal was to build a simple
system and aim on word reordering in queries (T2/T3) (we
addressed this problem in late submission). On the other
hand, we have not addressed noise and reverberation in the
data (see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for details on the task).
      </p>
    </sec>
    <sec id="sec-2">
      <title>ATOMIC SYSTEMS</title>
      <p>
        All our subsystems use Arti cial Neural Networks (ANN)
to estimate per-frame phone-state probabilities (so-called
posterior-grams) and bottle-neck features. Subsystems are
based on DTW using BNs to calculate distances between
query and test segment frames. We re-use ANNs, which were
trained for different projects as acoustic models for phone
or LVCSR recognizers: 1 SpeechDat (Hungarian;
monolingual LCRC systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) for phone posterior-grams and
4 GlobalPhone (Czech, Portuguese, Russian, Spanish;
monolingual stacked-bottleneck systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) for BN features.
We didn't exploit phone-state posterior-grams for DTW as
in the last year's evaluations due to signi cant loss of
accuracy for noisy data sets. The Hungarian phone recognizer
was used only for Speech Activity Detection (SAD). Also,
we didn't use Acoustic Keyword Spotting (AKWS)
subsystems or ANNs adaptation on target language as we did in
previous years.
      </p>
      <p>We ended up with 4 atomic systems and 3 subsystems
based on DTW using GlobalPhone features.
Data</p>
      <sec id="sec-2-1">
        <title>Query</title>
      </sec>
      <sec id="sec-2-2">
        <title>Utterance</title>
        <p>sAytostmemic GQP CZ BN GP PO BN GP RU BN GP SP BN</p>
        <p>U</p>
        <p>Subsystem
Normalization</p>
        <p>Calibration
fea stack
norm
calib</p>
        <p>DTW 2w</p>
        <p>DTW 3w
norm
calib</p>
      </sec>
      <sec id="sec-2-3">
        <title>Fusion</title>
        <p>norm
calib
late submission
Output detections
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Fusion of features</title>
      <p>
        We use concatenation of feature vectors for DTW
proposed by GTTS [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The feature vectors are simply stacked
on each other to create larger feature vector. We tried
several combinations of 7 languages and ended up with a
concatenation of the Czech, Portuguese, Russian, and Spanish
GlobalPhone BNs (denoted as fea stack).
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>DYNAMIC TIME WARPING</title>
      <p>
        In our implementation, we follow the standard
query-byexample recipe { sub-sequence DTW [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Single DTW is
run for each combination of query and test segment, where
the query is allowed to start at any frame of the test
segment. When selecting the locally optimal path in the
standard DTW algorithm, transition from the smallest
accumulated distance is chosen.
      </p>
      <p>In our implementation, we compare the accumulated
distances (including the current local distance) normalized by
online normalization. The online normalization performs
the division by current path length on-the- y for every step
p-fea stack DTW
l-fea stack DTW+slope
l-fea stack DTW 2w+slope
l-fea stack DTW 3w+slope
l-fea stack DTW+slope+2w3w fusion
sideinfo</p>
      <p>QU</p>
      <p>eval
dev
calculation to decide which step (vertical, horizontal or
diagonal) is the best to choose. The division is not saved during
the calculation, it is performed only to decide the next step.
The length normalization is done afterwards as in standard
approach. This leads to prefer longer paths over shorter
ones.</p>
      <p>
        As the distance metric, we used the Pearson product
moment correlation distance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We applied SAD to drop out
non-speech frames in queries (see our previous work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
Queries having less than 10 frames after SAD application
were discarded. The primary submitted system (denoted as
p-fea stack DTW) using described algorithm was the
winning one in last year's evaluations. We have made a few
changes to the primary system for the late submission.
      </p>
      <p>We used different step size conditions during the
calculation of accumulated distances to control the slope of paths
(systems denoted as +slope). Each path has a local slope
within the bounds 12 and 2. This limitation allows us to
eliminate errors where one query frame maps to the whole
utterance perfectly or vice versa. We also experimented with
different local weights for vertical, horizontal and diagonal
direction but we got no improvement out of it.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Dealing with T2/T3</title>
      <p>We built additional subsystems to deal with T2/T3 type
of queries. A query is split into equal parts and for each part,
DTW is performed separately (denoted as bands). The
smallest accumulated distance is chosen from each band of the
given query and results are averaged together as a
matching score. This approach allows us to search for multiple
word queries. For single word queries, the results remain
the same. Note that it is not mandatory that two words in
T2 query are separated exactly in the middle of the query.
We experimented with 1 (baseline system), 2 (denoted as
2w) and 3 (denoted as 3w) bands. These subsystems were
used only for fusion and, therefore, were not submitted as
separated late systems.</p>
    </sec>
    <sec id="sec-6">
      <title>4. SCORE POST-PROCESSING</title>
      <p>The global minimum of frame-by-frame detection scores is
selected as candidate detection. There might be signi cant
differences between the score distributions corresponding to
the different queries and it is important to normalize the
scores for each query.</p>
      <p>
        We applied m-norm (developed in SWS2013 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) to
normalize the scores for each query to allow for a single common
threshold maximizing the Cnxe metric.
      </p>
      <p>As the task expects only one score per query{utterance
pair without timing, we nd and return the best particular
score from a set of detections of a query in an utterance.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>CALIBRATION</title>
      <p>The post-processed scores were calibrated with respect to
the Cnxe scoring metric using binary logistic regression.</p>
      <p>We attached a sideinfo to each score (query{utterance
pair). The sideinfo consists of: number of phonemes, log
of number of phonemes, number of speech frames, log of
number of speech frames and average log-posterior of speech
frames taken from SAD. The sideinfo was generated for
queries and utterances so the nal \feature vector" for
calibration consists of: 1 detection score (query{utterance pair),
5 query sideinfo, 5 utterance sideinfo. Parameters (11 linear
weights and 1 additive constant) were trained on
development set. We denoted this 10 sideinfo parameters as QU.
However, we found sideinfo harms the performance for late
submission so we omit it.</p>
    </sec>
    <sec id="sec-8">
      <title>6. FUSION</title>
      <p>We applied fusion on the level of calibrated systems using
the binary logistic regression again.</p>
      <p>For improved late system, we fused the primary system
with slope limitation (l-fea stack DTW+slope) and the 2w
and 3w systems. The fused system is denoted as \l-fea stack
DTW+slope+2w3w fusion" and was submitted as second
late system.
7.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>First, we processed data by the primary system
without respect to T2 and T3 type of query. The rst general
late system using slope constraint improved output score by
0:79% in Cnmxien for eval data. The conclusion is that for such
noisy data, there could be one or few single query frames
tting perfectly to single or few frames from utterance. This
generates path too steep or too shallow, obviously not a
matching hit.</p>
      <p>To improve accuracy of T2 queries, we used algorithm
where queries are split into parts and search is performed
\per-partes". The output scores of these subsystems were
not signi cantly better, however, helped in fusion with the
best single system. We got improvement of 0:6% in Cnmxien.
More detailed, the deterioration of T1 query is 0:15% but
there is slight improvement for T2 query (0:71%) and T3
query (0:94%).</p>
      <p>The real-time factor for the primary system is 0:009, for
late system without fusion is 0:009 and for late system with
fusion reaches 0:023. The highest memory consumption
(high level water mark) is 450MB. The experiments were
run on a hybrid cluster with average CPU Intel(R) Xeon(R)
CPU X5670 @ 3GHz.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Szo</surname>
          </string-name>
          <article-title>ke, Luis J</article-title>
          .
          <string-name>
            <surname>Rodriguez-Fuentes</surname>
            , Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca,
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Lojka</surname>
            , and
            <given-names>Xiao</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
          </string-name>
          .
          <article-title>Query by Example Search on Speech at Mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the Mediaeval 2015 Workshop</source>
          , Wurzen, Germany, September 14-15
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Petr</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , Pavel Matejka, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>Towards Lower Error Rates in Phoneme Recognition</article-title>
          .
          <source>In Proceedings of 7th International Conference Text,Speech and Dialoque</source>
          <year>2004</year>
          , page 8. Springer Verlag,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Frantisek</given-names>
            <surname>Grezl</surname>
          </string-name>
          and
          <article-title>Martin Kara at</article-title>
          .
          <article-title>Hierarchical Neural Net Architectures for Feature Extraction in ASR</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2010</year>
          , volume
          <year>2010</year>
          , pages
          <fpage>1201</fpage>
          {
          <fpage>1204</fpage>
          . International Speech Communication Association,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Luis</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          , Amparo Varona, Mikel Penagarikano, German Bordel, and
          <article-title>Mireia Diez. GTTS Systems for the SWS Task at MediaEval 2013</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          , volume
          <volume>2013</volume>
          , pages
          <issue>1{2</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Meinard</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller</article-title>
          .
          <source>Information Retrieval for Music and Motion</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Szo</surname>
          </string-name>
          ke, Miroslav Skacel, Lukas Burget, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>Coping with Channel Mismatch in Query-by-Example - BUT QUESST 2014</article-title>
          .
          <source>In Proceedings of ICASSP 2015</source>
          ,
          <article-title>pages {</article-title>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Szo</surname>
          </string-name>
          <article-title>ke, Lukas Burget, Frantisek Grezl, and Lucas Ondel. Calibration and Fusion of Query-by-Example Systems - BUT SWS 2013</article-title>
          .
          <source>In Proceedings of ICASSP 2014</source>
          , pages
          <fpage>7899</fpage>
          {
          <fpage>7903</fpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>