<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jozef Vavrek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Viszlay</string-name>
          <email>Peter.Viszlay@tuke.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Lojka</string-name>
          <email>Martin.Lojka@tuke.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matúš Pleva</string-name>
          <email>Matus.Pleva@tuke.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jozef Juhár</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice Park Komenského 13</institution>
          ,
          <addr-line>041 20 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Two approaches to QbE (Query-by-Example) retrieving system, proposed by the Technical University of Kosice (TUKE) for the query by example search on speech task (QUESST), are presented in this paper. Our main interest was focused on building such QbE system, which is able to retrieve all given queries with and without using any external speech resources. Therefore we developed posteriorgram-based keyword matching system, which utilizes a novel weighted fast sequential variant of DTW (WFS-DTW) algorithm in order to detect occurrences of each query within the particular utterance le, using two GMM-based acoustic units modeling approaches. The rst one, referred as low-resource approach, employs language-dependent phonetic decoders to convert queries and utterances into posteriorgrams. The second one, de ned as zero-resource approach, implements combination of unsupervised segmentation and clustering techniques by using only provided utterance les.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>The motivation for developing our system was to assess
the ability of proposed WFS-DTW algorithm to detect
various spoken query terms by implementing low and
zeroresource posteriorgram-based matching approach.</p>
    </sec>
    <sec id="sec-2">
      <title>WFS-DTW SEARCHING ALGORITHM</title>
      <p>
        Searching algorithm for QUESST task follows the one
used in our paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Proposed solution is a modi cation of
segmental DTW algorithm we applied in spoken web search
task last year [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. There are three main contributions to this
algorithm: 1) one step forward moving strategy, when each
DTW search is carried out sequentially, block by block, with
size equal to the length of query; 2) linear time-aligned
accumulated distance for speeding up sequential DTW without
considerable loss in retrieving performance; 3) optimization
of global minimum for set of alignment paths by
implementing weighted cumulative distance (WCD) parameter.
      </p>
    </sec>
    <sec id="sec-3">
      <title>LOW-RESOURCE APPROACH</title>
      <p>
        The low-resource approach includes 4 language-dependent
subsystems, each represented by GMM-based acoustic model.
The acoustic models were trained previously using four
databases: 2 Speechdat (Slovak, 66h and Czech, 89h) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
Slovak ParDat1 (40h) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and English TIMIT (10h) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The well-trained models were intended to generate
timealigned and labelled segments for each utterance through
Viterbi decoding. The phonetic decoder employed a
phonelevel vocabulary and a phone network. We found that the
phoneme insertion log probability p in Viterbi segmentation
has signi cant impact to time-alignment. Since the best
results were obtained with p = 0, we used this value in the
whole setup. The time-alignments were used to train a new
GMM-based acoustic model using the development data. It
means that each language-dependent model was replaced by
its re ned version, which was nally used to generate the
posteriorgrams for utterances and queries.</p>
      <p>Note that we used 39-dimensional MFCC (Mel-Frequency
Cepstral Coe cients) features for Viterbi segmentation and
GMM training. In low-resource approach we did not need
any voice activity detector (VAD) because the silent parts of
the audio stream were identi ed in the Viterbi segmentation.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>ZERO-RESOURCE APPROACH</title>
      <p>In keeping with the zero-resource approach, we did not
assume any prior knowledge of the acoustic units or
pronunciation lexicon. In order to train the acoustic models, it
was rstly necessary to identify the acoustic speech units in
the audio data automatically. In this work, we utilized four
di erent zero-resource approaches to address this problem.</p>
      <p>
        Type 1: This one uses a PCA-based VAD to discriminate
the voice active segments from the silent ones [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The
initial feature selection, based on simple PCA (principal
component analysis) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], is carried out after extracting rst 13
MFCCs. Only those speech active feature vectors are
selected, whose variance achieves values greater than 90% at
the rst principal component. Then, K-means clustering
with K = 75 clusters and correlation distance metric is
computed on the reduced data. The clustering starts by
selecting K points uniformly. Finally, speech segmentation
is performed by computing the squared Euclidean distance
between feature vectors and K mean vectors, where the
label of the mean vector with minimum distance is assigned
in collaboration with VAD.
      </p>
      <p>Type 2: Type 2 approach comes directly out from the
Type 1 and is further extended by Viterbi segmentation and
new GMM training. These two steps are identical to those
already described in Section 3. The main di erence is that
the acoustic model from the Type 1 is used to generate the
time-alignments through Viterbi segmentation.</p>
      <p>
        Type 3: The third approach is based on the well-known
at start training procedure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It does not need any
segdev
mentation or clustering because the utterances are uniformly
segmented using the Baum-Welch embedded re-estimation.
Therefore, an alternative GMM initialization strategy is
applied, where all phone models are initialized identically with
state means and variances equal to the global mean and
variance. The phone models are then moved straight to
embedded training and simultaneously updated and expanded to
the higher GMs (Gaussian Mixtures) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The key element in
at start training is the phone-level transcription, obtained
from the phone-based recognition using the acoustic model
acquired from the rst type zero-resource approach.
      </p>
      <p>Type 4: Type 4 approach implements GMM-based
segmentation and ergodic HMM (EHMM) training. Firstly, an
unsupervised GMM training is performed on whole database,
where each acoustic unit is represented by one GM. Each
GM is then associated with one of the 64 states in EHMM
and new GMs for each acoustic unit are trained iteratively.</p>
      <p>Note that we used conventional 39-dimensional MFCCs
for each zero-resource processing (except the Type 1). We
did not use any VAD here (except the Type 1) because the
&lt;sil&gt; labels were available from the Viterbi segmentation.</p>
    </sec>
    <sec id="sec-5">
      <title>POST-PROCESSING: SCORE NORMAL</title>
    </sec>
    <sec id="sec-6">
      <title>IZATION AND FUSION</title>
      <p>
        Score parameter was represented by WCD, normalized by
scaling factor 0/1, similarly as we used in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This step
helped us to uni ed score ranges for the rst 500 detection
candidates per each query. Then the score fusion for four
di erent subsystems was carried out, employing a simple
max-score merging strategy, similarly as Anguera et al. did
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Detection candidates from each individual subsystem
were merged together, keeping the one with the highest score
in case of overlap. Merged candidates for each query were
subsequently normalized by z-normalization and aligned
according to the score value. The nal set was obtained by
keeping rst 45-150 candidates, according to the length of
query (the shorter query the lower number of candidates).
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND CONCLUSION</title>
      <p>
        We submitted four runs obtained from low-resource
(primary) and zero-resource (general) systems for QUESST 2014
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The primary systems employ language-dependent
acoustic modeling using Viterbi segmentation with 128 GMs
(ParDat1, TIMIT) and 256 GMs (Speechdat SK, CZ). The
general systems use 32 GMs for Type 1,2,3 and 64 GM for
Type 4. The best-one-win strategy was used at rst runs
(on time). Thus, only the subsystem with best performance
was submitted, namely p-low using Speechdat SK and
gzero Type 2 subsystem. Late submissions include
maxscore merging fusion of four subsystems for both primary
and general approaches. Results in Tab. 1 show that there
are still big di erences in performance between p-low and
g-zero approaches, even if the score fusion technique was
applied. Even more, there is also considerable gap between
act and min Cnxe despite the fact that the act and max
T W V are perfectly calibrated. Therefore, an improved
calibration/fusion models based on a ne transformation and
linear-regression will be investigated in the future.
      </p>
      <p>The indexing was done using 2xIBM x3650 (Intel E5530 @
2.4 GHz, 8 cores), 28 GB RAM, under Debian OS. Searching
algorithm was running on 52xIBM dx360 M3 cluster (Intel
E5645 @ 2.4GHz, 624 cores), 48 GB RAM per node, running
on Scienti c Linux 6 and Torque (see Tab. 2).
7.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This publication is the result of the Project
implementation: University Science Park TECHNICOM for Innovation
Applications Supported by Knowledge Technology, ITMS:
26220220182, supported by the Research &amp; Development
Operational Programme funded by the ERDF (100%).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          et al.
          <article-title>The Telefonica Research Spoken Web Search System for MediaEval 2013</article-title>
          .
          <source>In Working Notes Proc. of the MediaEval</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>Query by Example Search on Speech at Mediaeval 2014</article-title>
          .
          <source>In Working Notes Proc. of the MediaEval 2014 Workshop</source>
          , Barcelona, Spain,
          <fpage>16</fpage>
          -
          <lpage>17</lpage>
          October
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Darjaa</surname>
          </string-name>
          et al.
          <article-title>Rule-based Triphone Mapping for Acoustic Modeling in Automatic Speech Recognition</article-title>
          .
          <source>In Proc. of the 14th Intl. Conf. on Text, Speech and Dialogue</source>
          ,
          <source>TSD'11</source>
          , pages
          <fpage>268</fpage>
          {
          <fpage>275</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Garofolo</surname>
          </string-name>
          et al. TIMIT
          <string-name>
            <surname>Acoustic-Phonetic Continuous Speech Corpus</surname>
          </string-name>
          ,
          <year>1993</year>
          . Linguistic Data Consortium, Philadelphia.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Juhar</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Viszlay</surname>
          </string-name>
          .
          <article-title>Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition</article-title>
          .
          <source>In Modern Speech Recognition Approaches with Case Studies</source>
          , pages
          <volume>131</volume>
          {
          <fpage>154</fpage>
          . InTech Open Access,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>H. van den Heuvel</surname>
          </string-name>
          et al.
          <article-title>SpeechDat-E: ve eastern european speech databases for voice-operated teleservices completed</article-title>
          .
          <source>In Proc. of INTERSPEECH</source>
          , pages
          <year>2059</year>
          {
          <year>2062</year>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vavrek</surname>
          </string-name>
          et al.
          <article-title>TUKE at MediaEval 2013 Spoken Web Search Task</article-title>
          .
          <source>In Working Notes Proc. of the MediaEval</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vavrek</surname>
          </string-name>
          et al.
          <article-title>Query-by-Example Retrieval via Fast Sequential Dynamic Time Warping Algorithm</article-title>
          .
          <source>In TSP 2014</source>
          , Berlin, DE, pages
          <volume>469</volume>
          {
          <fpage>473</fpage>
          . IEEE,
          <year>July 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          et al.
          <source>The HTK Book (for HTK Version 3.4)</source>
          . Cambridge University,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>