<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jozef Vavrek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Viszlay</string-name>
          <email>Peter.Viszlay@tuke.sk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Lojka</string-name>
          <email>Martin.Lojka@tuke.sk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matúš Pleva</string-name>
          <email>Matus.Pleva@tuke.sk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jozef Juhár</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan Rusko</string-name>
          <email>Milan.Rusko@savba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Informatics, Slovak Academy of Sciences</institution>
          ,
          <addr-line>Dúbravská cesta 9, 845 07 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice Park Komenského 13</institution>
          ,
          <addr-line>041 20 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present our retrieving system for QUery by Example Search on Speech Task (QUESST), comprising the posteriorgram-based modeling approach along with the weighted fast sequential dynamic time warping algorithm (WFS-DTW). For this year, our main e ort was directed toward developing language-dependent keyword matching system, utilizing all available information about spoken languages, considering all queries and utterance les. Despite the fact that the retrieving algorithm is the same as we used in previous year, a big novelty resides in the way of utilizing the information about all languages spoken in the retrieving database. Two low-resource systems using languagedependent acoustic unit modeling (AUM) approaches have been submitted. The rst one, called supervised, employs four well-trained phonetic decoders using acoustic models trained on time-aligned and annotated speech. The second one, de ned as unsupervised, uses blind phonetic segmentation for the speci c language where the information about spoken language is extracted from Mediaeval 2013 and Mediaeval 2014 databases. Considering the in uence on the overall retrieving performance, the acoustic model adaptation to the speci c language through retraining procedure was investigated for both approaches as well.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>Challenging acoustic conditions and di erent types of
queries led us to explore the area of language adaptation in
query-by-example (QbE) retrieving. Therefore, our
intention was to built a QbE retrieving system using all the
available acoustic models trained solely on languages presented
in the provided database.</p>
    </sec>
    <sec id="sec-2">
      <title>SUPERVISED AUM APPROACH</title>
      <p>The low-resource approach allowed us to use external
resources (not related to QUESST task) for AUM and
building acoustic models (AM) for the target languages in the
provided database. We developed four language-dependent
(LD) speech recognition systems, each represented by
speci c LD phonetic decoder and by an external well-trained
LD phoneme-based GMM (Gaussian mixture model), trained
with the corresponding phone-level transcription.</p>
      <p>
        Four monolingual annotated datasets were used for
acousSearch
corpus
LD-GMM
tic model training: Slovak Speechdat (66 hours of read speech,
54 phonemes) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Czech Speechdat (89 hours of read speech,
42 phonemes) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Romanian anonymous speech corpus1 (4.6
hours of read speech, 28 phonemes) and Portuguese (3 hours
of BN recordings from COST278 DB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], 1 hour of Laps
Benchmark corpus from Fala Brasil project2, 34 phonemes).
      </p>
      <p>The phonetic decoders and the LD AMs were intended to
perform phonetic transcription and time alignment of search
data utilizing the Viterbi algorithm. Each decoder employed
a phone-level vocabulary and a phone network. The
timealigned utterances were used in the supervised training of the
nal multilingual GMM. In the presented work, we exploited
two di erent ways of building multilingual GMMs.</p>
      <p>
        The rst one is oriented to training of a new GMM from
scratch using the utterances and the time alignment, needed
to initialize the GMM. The initialized GMMs were then
simultaneously updated and expanded to higher mixtures (up
to 1024 mixtures) using the Baum-Welch estimation
procedure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this case, the external AM operates as an initial
AM needed to bootstrap the recognition system, which is
further supposed to proceed without an external input.
      </p>
      <p>The second, improved training scheme is related to AM
retraining3. The main idea is to re-estimate the acoustic
likelihoods of the well-trained AM iteratively, using the
utterances and the time alignments described above. We
performed always three re-estimation cycles in the retraining to
achieve the convergence of estimation. We found that the
retraining brings higher precision over the standard
training from scratch. The newly prepared language-dependent
GMMs were used to generate posteriorgrams, which were
nally fed to DTW-based search. The low-resource
acoustic unit modeling is conceptually illustrated in Fig. 1. In
the whole experimental setup, we used the standard 39-dim.
MFCCs (Mel-Frequency Cepstral Coe cients).</p>
      <sec id="sec-2-1">
        <title>1http://rasc.racai.ro 2http://www.laps.ufpa.br/falabrasil/ 3characteristic for late submission</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>UNSUPERVISED AUM APPROACH</title>
      <p>The multilinguality problem and missing knowledge about
acoustic units led us to employ two di erent unsupervised
acoustic modeling approaches. For both types, we extracted
an additional acoustic information about six spoken
languages from Mediaeval 2013 and Mediaeval 2014 databases
according to the available language tags.</p>
      <p>
        The rst type is focused on unsupervised building of
acoustic model from unlabelled speech data. We re-employed our
well-established procedures from the previous year and we
built an acoustic model with up to 1024 mixtures for each
language. This process included PCA-based voice activity
detection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], feature extraction and selection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], K-means
clustering (K = 50), Euclidean segmentation and GMM
training, respectively [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This concept is depicted in Fig. 2
with dashed line. Each LD AM was intended to generate
LD posteriorgrams, whereas the score results from all
subsystems were nally fused together.
      </p>
      <p>The second, advanced acoustic modeling technique used
language adaptation4 in acoustic (phonetic) sense performed
through a retraining procedure similar to that used in
lowresource modeling. The main idea is to use the already
prepared LD AMs and feed them to phonetic decoding of search
data in order to obtain LD time alignments. Inevitably, it
was necessary to build an initial multilingual AM intended
to adaptation, utilizing the same, already mentioned
unsupervised segmentation and GMM training. The multilingual
AM is then iteratively retrained on the LD time-aligned
utterances using the search data (Fig. 2). Compared to the
low-resource retraining, we retrained here the multilingual
GMM instead of the language-speci c GMM. The resulting
six language-adapted GMMs practically match the
probability distributions of the acoustic units of the speci c language
as good as possible.</p>
    </sec>
    <sec id="sec-4">
      <title>4. POST-PROCESSING: SCORE NORMAL</title>
    </sec>
    <sec id="sec-5">
      <title>IZATION AND FUSION</title>
      <p>
        The average cumulative distance parameter (ACD),
represented by mean value of cumulative distance matrix elements
within each warping region and multiplied by = 0:1, was
used as score parameter. Scaling the ACD parameter within
the values 0-1 helped us to unify score ranges for the rst
500 detection candidates per each query. Then the score
fusion for di erent subsystems was carried out, employing
a max-score merging strategy and z-normalization, similarly
as we did last year [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The nal set was obtained by keeping
all fused detections per each query.
      </p>
      <sec id="sec-5-1">
        <title>4characteristic for late submission</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND CONCLUSION</title>
      <p>
        We submitted four runs obtained from supervised
(primary) and unsupervised (general) systems, including late
submissions, for QUESST 2015 task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We did not perform
evaluation with each individual type of query T1/T2/T3,
but concentrated on the overall detection performance. The
maximum number of Gaussian mixtures (GMs), we
employed in both primary and general subsystems, was 256 for
supervised and 64 for unsupervised AUM. Higher number of
GMs did not bring any improvement.
      </p>
      <p>
        The result obtained from all examined systems are far
beyond our expectations (Tab. 1). It can be explained by the
quality of audio data that were recorded in degraded
acoustic conditions and in uenced by background noises. The
overall detection accuracy did not increase even though we
examined various ways of speech enhancement techniques
(DC o set removal, spectral subtraction, minimum mean
squared error and Wiener ltering). Even the bottle-neck
features developed at Brno University of Technology (BUT)
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] did not work well for our system.
      </p>
      <p>Supervised AUM approach shows slightly better values of
Cnxe and T W V in comparison with unsupervised AUM. The
reason for relatively high performance of general approach is
the AM adaptation employed in unsupervised AUM where
all spoken languages are covered. Not signi cant
improvement can be observed for late submission systems, that
represent retraining procedure employed in AUM. However, the
process of retraining did not perform well in case of
supervised AUM for eval query set.</p>
      <p>
        The robust statistical model-based speech enhancement
methods embedded in the AUM and HMM-based speech
segmentation will be investigated in the future. The
processing load (PL) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for all systems, comprising development
query set, is shown in Tab. 2. Considering the same
searching set, the processing load is nearly identical for both dev
and eval queries. It is obvious that the unsupervised AUM
has the advantage of fast processing, mainly due to separate
segmentation and acoustic modeling for each language.
      </p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This publication is the result of the Project
implementation: University Science Park TECHNICOM for Innovation
Applications Supported by Knowledge Technology, ITMS:
26220220182, supported by the Research &amp; Development
Operational Programme funded by the ERDF (100%).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Grezl</surname>
          </string-name>
          , M. Kara at, S. Kontar, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>Probabilistic and bottle-neck features for LVCSR of meetings</article-title>
          .
          <source>In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP</source>
          <year>2007</year>
          ), pages
          <fpage>757</fpage>
          {
          <fpage>760</fpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Juhar</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Viszlay</surname>
          </string-name>
          .
          <article-title>Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition</article-title>
          .
          <source>In Modern Speech Recognition Approaches with Case Studies</source>
          , pages
          <volume>131</volume>
          {
          <fpage>154</fpage>
          . InTech Open Access,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          .
          <article-title>Mediaeval 2013 spoken web search task:system performance measures</article-title>
          .
          <source>Technical report</source>
          , Software Technologies Working Group (GTTS, http://gtts.ehu.
          <source>es)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szoke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Proenca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lojka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          .
          <article-title>Query by Example Search on Speech at Mediaeval 2015</article-title>
          .
          <source>In Working Notes Proc. of the MediaEval 2015 Workshop</source>
          , Germany, Wurzen,
          <fpage>14</fpage>
          -
          <issue>15</issue>
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>H. van den Heuvel</surname>
          </string-name>
          et al.
          <article-title>SpeechDat-E: ve eastern european speech databases for voice-operated teleservices completed</article-title>
          .
          <source>In Proc. of INTERSPEECH</source>
          , pages
          <year>2059</year>
          {
          <year>2062</year>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vandecatseye</surname>
          </string-name>
          et al.
          <article-title>The COST278 pan-european broadcast news database</article-title>
          .
          <source>In Proc. of the 4th Intl. Conf. on Language Resources And Evaluation, LREC'04</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vavrek</surname>
          </string-name>
          et al.
          <article-title>Query-by-Example Retrieval via Fast Sequential Dynamic Time Warping Algorithm</article-title>
          .
          <source>In TSP 2014</source>
          , Berlin, DE, pages
          <volume>469</volume>
          {
          <fpage>473</fpage>
          . IEEE,
          <year>July 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vavrek</surname>
          </string-name>
          et al.
          <article-title>TUKE System for MediaEval 2014 QUESST</article-title>
          .
          <source>In Working Notes Proc. of the MediaEval</source>
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          et al.
          <source>The HTK Book (for HTK Version 3.4)</source>
          . Cambridge University,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>