<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Igor Szöke</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miroslav Skácel</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukáš Burget</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>The primary system we submitted was composed of 11 subsystems as the required run. 3 subsystems are based on Acoustic Keyword Spotting (AKWS) and 8 on Dynamic Time Warping (DTW). The AKWS systems were based only on phoneme posteriors while the DTW subsystems were based on both phoneme posteriors and Bottle-Neck features (BN) as input. The underlying phoneme posterior estimators / bottle-neck feature extractors were both in-language (Czech) and out-of-language (other 4 languages). We also performed experiments on T1/T2/T3 types of query, system calibration and fusion based on binary logistic regression. ∗Igor Szo¨ke was supported by Grant Agency of Czech Republic post-doctoral project No. GP202/12/P567.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>
        In comparison to last year [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we decided to use lower
number of systems in parallel. Our goal was to further
investigate the sensitivity of particular approaches to the
language / channel mismatch in the query and utterance data.
Also, coping with different types of queries was challenging
this year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Similarly to last year, we used systems
already available at BUT (so-called Atomic Systems). This
led to several inconsistencies — for example, feature
extraction and sizes of the Artificial Neural Networks (ANN).
      </p>
    </sec>
    <sec id="sec-2">
      <title>ATOMIC SYSTEMS</title>
      <p>
        All our subsystems use ANN to estimate 1) per-frame
phone-state probabilities (so-called posterior-grams) 2)
bottleneck (BN) features. The subsystems based on DTW use the
BN features for calculating distances between query and test
segment frames. The subsystems based on AKWS use the
phone-state posteriors as HMM output probabilities. We
reuse ANNs, which were trained for different projects as
acoustic models for phone or LVCSR recognizers: 3× SpeechDat
(Czech, Hungarian and Russian; monolingual LCRC
systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) for phone posterior-grams and 4× GlobalPhone
(Czech, Portuguese, Russian, Spanish; monolingual
stackedbottleneck systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) for BN features.
      </p>
      <p>We prefer the SpeechDat posterior-grams to GlobalPhone
posterior-grams in AKWS due to significantly lower
accuracy of “GlobalPhone posterior-grams”. For DTW approach,
we prefer the GlobalPhone bottle-necks to GlobalPhone
posterior-grams also due to significant accuracy
deterioration. We have observed even larger deterioration when the
GlobalPhone ANNs were adapted on the SWS2013 data in
unsupervised manner (as we performed last year with
positive impact on accuracy). This holds both for posteriors
3.</p>
    </sec>
    <sec id="sec-3">
      <title>ACOUSTIC KEYWORD SPOTTING</title>
      <p>
        The AKWS systems follow [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We build an HMM for each
query. For each frame, the detection score is calculated as
the log-likelihood ratio between 1) staying in a background
HMM (free phoneme loop) and 2) escaping from it through
the query HMM. For standard keyword spotting tasks
(inlanguage task and textual input), the query model is built
using a pronunciation dictionary. In SWS task, however, we
need to generate the phoneme sequence for each of the query
acoustic examples – query-to-text step. This is achieved
by decoding each example using free phoneme loop. We
removed all silence labels (if present).
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>DYNAMIC TIME WARPING</title>
      <p>In our implementation, we follow the standard
query-byexample recipe – sub-sequence DTW. Single DTW is run
for each combination of query and test segment, where the
query is allowed to start at any frame of the test segment.
When selecting the locally optimal path in the standard
DTW algorithm, transition from the smallest accumulated
distance is chosen. In our implementation, we compare the
accumulated distances (including the current local distance)
normalized by the corresponding path lengths on-the-fly.
This is to avoid the preference for shorter paths. As the
distance metric, we used the Pearson product moment
correlation distance.</p>
      <p>
        We applied Speech Activity Detection (SAD) to drop out
the silence frames in queries (see our last year’s work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
We also tried to apply the SAD on utterances, but obtained
only tiny improvement therefore SAD was not used. The
Hungarian SpeechDat phoneme recognizer was used as the
SAD.
4.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Fusion</title>
      <p>
        We were inspired by GTTS [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (concatenation of feature
vectors going into DTW) and CUHK [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (averaging of
distance matrices). Finally, we found both of these methods
comparable, so we followed the feature vector
concatenation approach. We concatenate the Czech, Portuguese,
Russian, and Spanish GlobalPhone BN features and made a new
Atomic DTW system – the eight system.
5.
      </p>
    </sec>
    <sec id="sec-6">
      <title>SCORE POST-PROCESSING</title>
      <p>Approach
p-bigfusion
g-bigfusionnoside
g-best single
g-LID
AKWS-cz
AKWS-T3-cz
sideinfo</p>
      <p>QU
QU LID
QU LID
QU LID</p>
      <p>eval minCnxe
0.465(0.310/0.461/0.673)
0.464(0.323/0.470/0.660)
0.528(0.374/0.546/0.714)
0.926(0.897/0.946/0.920)
0.648(0.519/0.645/0.848)
0.674(0.597/0.694/0.756)</p>
      <p>
        For both DTW and AKWS systems, the local maxima
of frame-by-frame detection scores are selected as candidate
detections. For overlapping detections, only the best
scoring ones are preserved. We applied m-norm (developed in
SWS2013 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) to normalize (calibrate) the scores for each
query to allow for a single common threshold maximizing
the Cnxe metric.
      </p>
      <p>As the task was document retrieval rather than keyword
spotting this year, only one score per query–utterance pair
without timing was requested. That is why we find and
return the best particular score from a set of detections of a
query in an utterance.</p>
    </sec>
    <sec id="sec-7">
      <title>CALIBRATION</title>
      <p>The post-processed scores were calibrated to respect the
Cnxe scoring metric using binary logistic regression.</p>
      <p>We attached a side info to each score (query–utterance
pair). The side-info consists of: number of phonemes, log of
number of phonemes, number of speech frames, log of
number of speech frames, average log-posterior of speech frames
taken from SAD and optionally the LID i-vector score. The
side-info was generated for queries and utterances so the
final “feature vector” for calibration consists of: 1 detection
score (query–utterance pair), 5 query side-info, 5 utterance
side-info. Parameters (11 linear scales and 1 additive
constant) were trained on development set. We denoted this 10
side-info parameters as QU.</p>
      <p>
        The language identification system is a state-of-the-art
system based on i-vectors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As acoustic features, we used
Shifted Delta Cepstra. Gaussian mixture model with 2048
Gaussians serves as Universal Background Model for 600
dimensional, gender-independent, i-vector extractor. Our
goal here was to calculate distance (Pearson product
moment correlation distance) between particular query and
utterance i-vectors. This distance should provide us similarity
measure on the level of language (as Czech queries do not
exist in Basque utterances for example) and was used also
as side-info (denoted as LID).
      </p>
    </sec>
    <sec id="sec-8">
      <title>FUSION</title>
      <p>Finally, we applied fusion on the level of calibrated
systems using the binary logistic regression again. We took all
11 systems (3 AKWS, 7 DTW, 1 fused DTW) and found
the best linear combination of them.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>We tried to approach the T2 and T3 queries to improve
accuracy of our system. However, we ended up with
conclusion that slight improvement accuracy of T2 / T3 queries
largely degrades accuracy of T1 queries. This leads to overall
score degradation. Our conclusion here was, that it does not
make sense to cover T2 queries by a special approach (search
algorithm), as these queries are covered enough by “softness”
of standard DTW algorithm. We found tiny improvement
of 0.4% on T2 while we got overall 1% Cnxe deterioration.
This T2 improvement was observed with AKWS approach
when we allowed the last phoneme to be any phoneme.</p>
      <p>To improve accuracy of T3 queries, we split queries longer
than 7 phonemes in the middle. Then, we searched for
these two particular sub-queries independently. Finally, we
merged the sub-query results by forbidding sub-queries
overlap longer than 10 frames. Results of this experiment are
in table 1. System AKWS-cz is reference system where we
search for T3 in the same way as for T1. We implement the
above mentioned split to sub-queries in system
AKWS-T3cz. We got improvement 9% on T3 but overall deterioration
is 2.4% of Cnxe on eval queries.</p>
      <p>We built a QbE system making use of phoneme
posteriors and bottlenecks as input features. We found DTW
superior to AKWS event in cross channel environment (this
year data set). Our conclusion on different types of query
is, that it does not make sense to aim at T2 queries due to
tiny 0.4% improvement on the T2 but significant 1%
deterioration on overall score. The same holds for T3, where the
improvement is significant (9%) but overall deterioration is
(2.4%). The T3 queries need more investigation to overcome
the overall deterioration.
9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          et al.
          <article-title>Query by Example search on speech at MediaEval 2014</article-title>
          .
          <source>In Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brummer</surname>
          </string-name>
          et al.
          <article-title>Description and analysis of the brno276 system for LRE2011</article-title>
          .
          <source>In Proceedings of Odyssey 2012: The Speaker and Language Recognition Workshop</source>
          , pages
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          . International Speech Communication Association,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gr</surname>
          </string-name>
          <article-title>´ezl et al</article-title>
          .
          <article-title>Hierarchical neural net architectures for feature extraction in ASR</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2010</year>
          , volume
          <year>2010</year>
          , pages
          <fpage>1201</fpage>
          -
          <lpage>1204</lpage>
          . International Speech Communication Association,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          et al.
          <article-title>GTTS systems for the SWS task at MediaEval 2013</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          , volume
          <volume>2013</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          et al.
          <article-title>Towards lower error rates in phoneme recognition</article-title>
          .
          <source>In Proceedings of 7th International Conference Text,Speech and Dialoque</source>
          <year>2004</year>
          , page 8. Springer Verlag,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke et al</article-title>
          .
          <article-title>Phoneme based acoustics keyword spotting in informal continuous speech</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          ,
          <year>2005</year>
          (
          <volume>3658</volume>
          ):
          <fpage>8</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke et</article-title>
          . al.
          <article-title>Calibration and fusion of Query-by-Example systems - BUT SWS 2013</article-title>
          .
          <source>In Proceedings of ICASSP 2014</source>
          , pages
          <fpage>7899</fpage>
          -
          <lpage>7903</lpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>CUHK system for the Spoken Web Search task at MediaEval 2012</article-title>
          .
          <source>In Proceedings of the MediaEval 2012 Multimedia Benchmark Workshop</source>
          , volume
          <volume>2012</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>