<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Igor Szöke</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukáš Burget</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>František Grézl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucas Ondel</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We submitted a system composed of 26 subsystems as the required run. 13 subsystems are based on Acoustic Keyword Spotting and 13 on DTW. All of them were using three state phoneme posteriors as input. The underlaying phoneme posterior estimators were both in-language (Czech, English) and out-of-language (other 12 languages). We also performed unsupervised adaptation of the artificial neural network (ANN) on the target data and fusion based on binary logistic regression. ∗Igor Szo¨ke was supported by Grant Agency of Czech Republic post-doctoral project No. GP202/12/P567.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>Our motivation was to use many (mostly)
out-of-targetlanguages systems which can be combined by fusion at the
detection level. The goal was to re-use as many trained
systems available at BUT as possible. Please bear in mind
that reusing all these systems (so-called Atomic Systems)
lead to several inconsistencies among them — for example
feature extraction and sizes of the ANN.</p>
      <p>We performed unsupervised adaptation of ANN on the
target data (utterances). Our goal was also to test
combination two approaches in the query-by-example task –
Acoustic Keyword Spotting (AKWS) and Dynamic Time Warping
(DTW). We also explored system calibration with respect to
the TWV metric.</p>
    </sec>
    <sec id="sec-2">
      <title>ATOMIC SYSTEMS</title>
      <p>
        All our subsystems use ANN to estimate per-frame
phonestate probabilities (so-called posteriorgrams). The
subsystems based on DTW use the posteriorgrams as features
for calculating distances between query and test segment
frames. The subsystems based on AKWS uses the
phonestate posteriors as HMM output probabilities. We re-use
ANNs, which were trained for different projects as
acoustic models for phone or LVCSR recognizers. One DTW
and one AKWS system were built for each of the 13 ANNs
trained on the following datasets: 3× Speechdat (Czech,
Hungarian and Russian; monolingual LCRC systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]),
1× BABEL (Cantonese, Pashto, Tagalog, Turkish;
multilingual stacked-bottleneck system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), 1× SWS2012
(MediaEval SWS2012 development data; multilingual
stackedbottleneck system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), 8× GlobalPhone (Czech, English,
German, Portuguese, Russian, Spanish, Turkish, Vietnamese;
monolingual stacked-bottleneck systems [
        <xref ref-type="bibr" rid="ref3 ref8">8, 3</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>ACOUSTIC KEYWORD SPOTTING</title>
      <p>
        The Acoustic Keyword Spotting (AKWS) systems follow
our paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We build an HMM for each query. For
each frame, the detection score is calculated as the
loglikelihood ratio between 1) staying in a background HMM
(free phoneme loop) and 2) escaping from it through the
query HMM.
      </p>
      <p>For standard keyword spotting tasks (in-language task
and textual input), the query model is built using a
pronunciation dictionary. In SWS task, however, we need to
generate the phoneme sequence for each of the query
acoustic examples – query-to-text step. This is achieved by
decoding each example using free phoneme loop. We cut-off
initial and final silence labels (if present) and omit queries
having less than three non-silence phones, as these short
queries could generate huge amounts of false alarms. We
experimented with phoneme insertion penalty in the
queryto-text step with the conclusion that it has no significant
impact and set it to −1 consistently.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>DYNAMIC TIME WARPING</title>
      <p>In our implementation, we follow the standard
query-byexample recipe – subsequence DTW. Single DTW is run
for each combination of query and test segment, where the
query is allowed to start at any frame of the test segment.
When selecting the locally optimal path in the standard
DTW algorithm, transition from the smallest accumulated
distance is chosen. In our implementation, we compare the
accumulated distances (including the current local distance)
normalized by the corresponding path lengths on-the-fly.
This is to avoid the preference for shorter paths. As the
distance metric, we used the usual negative logarithm of the
dot product of phone-state posterior vectors.</p>
      <p>In our late submission, we further improved the DTW
systems by applying VAD to cut off the initial and the final
silence from the query examples. As can be seen in Table 1,
it improved the overall system by 10% relative.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>DETECTION SCORE POST-PROCESSING</title>
      <p>For both DTW and AKWS systems, the local maxima
of frame-by-frame detection scores are selected as candidate
detections. For overlapping detections, only the best scoring
ones are preserved. There might be significant differences
between the score distributions corresponding to the
different queries and it is important to normalize (calibrate) the
scores for each query to allow for a single common threshold
maximizing the TWV metric. We adopted a new
normalization approach, m-norm, which is motivated by the
observation that score distributions have very long tails towards the
small scores, which significantly differ in shape from query
to query. In m-norm, for each query, score corresponding to
the mod (maximum) of the score histogram (denoted SM ) is
found for each query and subtracted from the original scores
– all mods are thus aligned to 0. The scores are further
divided by standard deviation estimated only on scores larger
than SM , to unify the terms’ variances.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>FUSION</title>
      <p>
        Normalized scores from the individual subsystems were
fused similarly to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The scores from different subsystems
are first aligned in time and then linearly combined. The
fusion weights (and the default score for a subsystem with
no detection at the given time) are trained to minimize
crossentropy (or binary logistic regression) objective.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>Approach
AKWSDTW-vad (late)
DTW-vad (notsub)
AKWS
AKWSDTW
AKWSDTW-treefus
AKWSDTW-notarlang
AKWS-notarlang
eval MTWV
0.3776(0.4835)
0.3557(0.4585)
0.3041(0.4165)
0.2969(0.4081)
0.2787(0.3934)
0.2562(0.3685)
0.2778(0.3840)
8.1</p>
    </sec>
    <sec id="sec-8">
      <title>LESSONS LEARNED</title>
    </sec>
    <sec id="sec-9">
      <title>NN adaptation</title>
      <p>
        We have experimented with three types of NN adaptation
using BABEL system NN. This network was initially trained
with 4 independent softmax non-linearities in the output
layer – one softmax per language (Cantonese, Pashto,
Tagalog, Turkish). The original network had 1065 phoneme-state
outputs (355 phonemes for all the 4 languages). We decoded
SWS-dev data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] using free phoneme loop phone recognizer
based on this network and we found, that 37 out of 355
were never activated. We also filtered out 95 phonemes
having less occurrences than 10 seconds in total. We ended up
with 220 “active” phonemes – denoted as orig. Next, we used
this orig network to label the SWS-dev data again. Using
this labeling, we 1) adapted (re-trained) the original NN on
SWS-dev data (denoted as adapt) and 2) we completely
retrained the NN using the SWS-dev data (denoted as rtfs).
In the stacked bottleneck NN hierarchy, only the merger was
adapted in adapt case.
      </p>
      <p>In terms of MTWV (UBTWV), our results on SWS dev
with BABEL AKWS subsystem are as follows : orig 0.0443
(0.1154), adapt 0.0569 (0.1355), and trfs 0.0769 (0.1630).
8.2</p>
    </sec>
    <sec id="sec-10">
      <title>Calibration</title>
      <p>
        As the TWV metric was set to drastically penalize false
alarms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the proper calibration and good choice of global
threshold was very important this year. We experimented
with two approaches for score normalization. First, we have
experimented with z-norm that worked well for the last year
SWS evaluations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Next, we tried to calibrate the scores using binary
logistic regression, where the input to the logistic regression
was a vector of z-normed scores augmented with different
per-term side-information scores [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] – denoted as
z-normsideinfo. The best tested side information, which
significantly improved MTWV, was the logarithm of the number of
detections of a particular term. This indicates that z-norm is
not sufficient to properly normalize score distributions over
different queries.
      </p>
      <p>Finally, we tested m-norm (see section 5), which we found
to be superior to z-norm-sideinfo. Furthermore, the
additional side information based calibration resulted in no
further MTWV improvements.</p>
      <p>In terms of MTWV (UBTWV), our results are: raw 0.000
(0.1012), z-norm 0.0330 (0.1434), z-norm-sideinfo 0.0603
(0.1436), m-norm 0.0769 (0.1611).
9.</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>We successfully built a QbE system making use of a high
number of already trained phoneme posterior estimators and
applied unsupervised adaptation of ANNs. DTW with
application of VAD (VAD is really important) is still superior
to AKWS approach. On the other hand, AKWS is
level-ofmagnitude faster compared to DTW. Adaptation of ANN is
also important, so it makes sense to take as much as possible
“black-boxed” phoneme posterior estimators, label the
target data and train a new ANN. Finally, we found m-norm
calibration is promising in the area of high FA rates and
non-posterior scores.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akbacak</surname>
          </string-name>
          et al.
          <article-title>Rich system combination for keyword spotting in noisy and acoustically heterogenous audio streams</article-title>
          .
          <source>In Proceedings of ICASSP 2013</source>
          , pages
          <fpage>8267</fpage>
          -
          <lpage>8271</lpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          et al.
          <article-title>The spoken web search task</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gr</surname>
          </string-name>
          <article-title>´ezl et al</article-title>
          .
          <article-title>Hierarchical neural net architectures for feature extraction in asr</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2010</year>
          , volume
          <year>2010</year>
          , pages
          <fpage>1201</fpage>
          -
          <lpage>1204</lpage>
          . International Speech Communication Association,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karafi´</surname>
          </string-name>
          at et al.
          <article-title>BUT babel system for spontaneous cantonese</article-title>
          .
          <source>In Proceedings of Interspeech 2013, number 8</source>
          , pages
          <fpage>2589</fpage>
          -
          <lpage>2593</lpage>
          . International Speech Communication Association,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          et al.
          <article-title>Towards lower error rates in phoneme recognition</article-title>
          .
          <source>In Proceedings of 7th International Conference Text,Speech and Dialoque</source>
          <year>2004</year>
          , page 8. Springer Verlag,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke et al</article-title>
          .
          <article-title>Phoneme based acoustics keyword spotting in informal continuous speech</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          ,
          <year>2005</year>
          (
          <volume>3658</volume>
          ):
          <fpage>8</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke et al</article-title>
          .
          <article-title>BUT2012 approaches for spoken web search - mediaeval 2012</article-title>
          . In MediaEval 2012 Workshop, Pisa, Italy,
          <source>October 4-5 2012. CEUR Workshop Proceedings</source>
          , Vol.
          <year>2012</year>
          , No. 927,
          <string-name>
            <surname>DE.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Vesely</surname>
          </string-name>
          ´ et al.
          <article-title>The language-independent bottleneck features</article-title>
          .
          <source>In Proceedings of IEEE 2012 Workshop on Spoken Language Technology</source>
          , pages
          <fpage>336</fpage>
          -
          <lpage>341</lpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          et al.
          <article-title>CUHK system for the spoken web search task at mediaeval 2012</article-title>
          . In MediaEval 2012 Workshop, Pisa, Italy,
          <source>October 4-5 2012. CEUR Workshop Proceedings</source>
          , Vol.
          <year>2012</year>
          , No. 927,
          <string-name>
            <surname>DE.</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>