<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paula Lopez-Otero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Docio-Fernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Garcia-Mateo</string-name>
          <email>carmen@gts.uvigo.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AtlantTIC Research Center E.E. Telecomunicación</institution>
          ,
          <addr-line>Campus Universitario S/N, 36310 Vigo</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present the systems developed by GTMUVigo team for the query by example search on speech task (QUESST) at MediaEval 2015. The systems consist in a fusion of 11 dynamic time warping based systems that use phoneme posteriorgrams for speech representation; the primary system introduces a technique to select the most relevant phonetic units on each phoneme decoder, leading to an improvement of the search results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The query by example search on speech task (QUESST)
aims at searching for audio content within audio content
using an audio content query [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], having special focus on
low-resource languages. This paper describes the systems
developed by GTM-UVigo team to address this task1.
      </p>
      <p>GTM-UVIGO SYSTEM DESCRIPTION
GTM-UVigo systems consist in the fusion of 11 individual
systems that represent the documents and queries by means
of phoneme posteriorgrams, and then subsequence dynamic
time warping (S-DTW) is used to perform the search. The
primary system features a phonetic unit selection strategy,
which is briefly described in this Section.</p>
    </sec>
    <sec id="sec-2">
      <title>Phoneme posteriorgrams</title>
      <p>
        Three architectures were used to obtain phoneme
posteriorgrams:
• lstm: a context-independent phone recognizer based
on a long short-term memory (LSTM) neural network
was trained using the KALDI toolkit [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A 2-layer
LSTM was used; the input of the first layer consists
of 40 log filter-bank energies augmented with 3 pitch
related features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the output layer dimension was
the number of context independent phone units.
• dnn: a deep neural network (DNN)-based
contextdependent phone recognizer was trained using the KALDI
toolkit following Karel Vesely´’s DNN training
implementation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The network has 6 hidden layers, each
1The code of GTM-UVigo systems will be released at
https://github.com/gtm-uvigo/MediaEval_QUESST2015
with 2048 units, and it was trained on
LDA-STCfMLLR features obtained from auxiliary Gaussian
mixture models (GMM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The dimension of the input
layer was 440 and the output layer dimension was the
number of context-dependent states.
• traps: the phone decoder based on long temporal
context developed at the Brno University of Technology
(BUT) was used [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>11 models, summarized in Table 1, were trained using data
in 6 languages: Galician (GA), Spanish (ES), English (EN),
Czech (CZ), Hungarian (HU) and Russian (RU).</p>
      <p>
        Database
Transcrigal [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
TC-STAR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
      </p>
      <p>
        LibriSpeech [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Vystadial 2013 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
      </p>
      <p>Speech-Dat</p>
      <p>Duration (h)
35
78
100
15
n/a
2.2</p>
      <p>Dynamic Time Warping Strategy</p>
      <p>
        The search of the spoken queries within the audio
documents is performed by means of S-DTW [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. First, a cost
matrix M ∈ ℜn×m is defined, where the rows and the columns
correspond to the frames of the query Q and the document
D, respectively:
      </p>
      <p>Mi,j =  cc((qqii,, ddjj)) + Mi−1,0
 c(qi, dj) + M∗(i, j)
if
if
otherwise
i = 0
i &gt; 0, j = 0
where c(qi, dj) represents the cost between query vector
qi and document vector dj, both of dimension U , and</p>
      <p>M∗(i, j) = min (Mi−1,j, Mi−1,j−1, Mi,j−1)</p>
      <p>
        Pearson’s correlation coefficient r is used as distance
metric [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]:
r(qi, dj) =
      </p>
      <p>U(qi · dj) − kqikkdjk
q(Ukqi2k − kqik2)(Ukdj2k − kdjk2)</p>
      <p>
        In order to use r as a cost function, it is linearly mapped
to the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], where 0 corresponds to correlations equal
to 1 and 1 corresponds to correlations equal to -1.
(1)
(2)
(3)
In order to detect nc candidate matches of a query in a
spoken document, every time a candidate match is detected,
which ends at frame b∗, M (n, b∗) is set to ∞ in order to
ignore this match.
2.3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Phoneme Unit Selection</title>
      <p>A technique to select the most relevant phonemes among
the phonetic units of the different decoders was used in the
primary system. Given the best alignment path P(Q,D) of
length K between a query and a matching document, the
correlation and the cost at each step of the path can be
decomposed so there is a different term for each phonetic
unit u:
r(qi, dj , u) =</p>
      <p>U qi,udj,u − U1 kqikkdj k
q(U kqi2k − kqik2)(U kdj2k − kdj k2)</p>
      <p>In this way, the cost accumulated by each phonetic unit
through the best alignment path can be computed:
R(P (Q, D), u) =
1 K</p>
      <p>X c(qik , djk , u)</p>
      <p>K k=1</p>
      <p>This value R(P (Q, D), u) can be considered as the
relevance of the phonetic unit u (the lower the contribution to
the cost, the more relevant the phonetic unit). Hence, the
phonetic units can be sorted from more relevant to less
relevant in order to keep the most relevant ones and to discard
those who increased the cost of the best alignment path.</p>
      <p>Using only one alignment path may not provide a good
estimate of the relevance of the phonetic units; hence, the
relevance of the different pairs query-matching document in
the development set were accumulated in order to robustly
estimate the relevance. The number of relevant phonetic
units was empirically selected for each system.
2.4</p>
    </sec>
    <sec id="sec-4">
      <title>Normalization and fusion</title>
      <p>
        Score normalization and fusion were performed following
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. First, the scores were normalized by the length of the
warping path. A binary logistic regression was used for
fusion, as described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>Table 2 shows the results obtained on QUESST 2015 data
using the submitted systems. The Table shows that the
primary system, that features phoneme unit selection, clearly
outperforms the contrastive system, suggesting that the
proposed technique obtains the expected improvement.
Another fact that can be observed is that Dev and Eval results
are very similar, showing the generalization capability of the
(4)
(5)
systems. Late systems feature z-norm normalization of the
query scores, obtaining an improvement with respect to the
original submissions, where only path-length normalization
was applied. In Table 3, actCnxe obtained with and
without applying the phoneme unit selection approach in some
individual systems are compared.</p>
      <p>
        Table 4 shows the indexing speed factor (ISF), searching
speed factor (SSF), peak memory usage for indexing (PMUI)
and searching (PMUS) and processing load (PL)2, computed
as described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. ISF and PMUI are rather high because,
in the dnn systems, first an automatic speech recognition
system (ASR) is applied in order to obtain the input
features to the DNN; hence, the peak memory usage is so large
due to the memory requirements of the language model, and
the large computation time is caused by the two recognition
steps that are performed to estimate the transformation
matrix used to obtain the fMLLR features that are the input
to the DNN. In future work, the ASR step of dnn systems
will be replaced with a phonetic network in order to avoid
these time and memory consuming steps.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research was funded by the Spanish Government
(‘SpeechTech4All Project’ TEC2012-38939-C03-01), the
Galician Government through the research contract GRC2014/024
(Modalidade: Grupos de Referencia Competitiva 2014) and
‘AtlantTIC Project’ CN2012/160, and also by the Spanish
Government and the European Regional Development Fund
(ERDF) under project TACTICA.
2
These values were computed using 2xIntel(R) Xeon(R) CPU E5-2620
v3 @ 2.40GHz, 12cores/24threads, 128GB RAM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akbacak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Weng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Houtvan</surname>
          </string-name>
          .
          <article-title>Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams</article-title>
          .
          <source>In Proceedings of ICASSP</source>
          , pages
          <fpage>8267</fpage>
          -
          <lpage>8271</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Docio-Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cardenal-Lopez</surname>
          </string-name>
          , and
          <string-name>
            <surname>C.</surname>
          </string-name>
          Garcia-Mateo.
          <article-title>TC-STAR 2006 automatic speech recognition evaluation: The UVIGO system</article-title>
          .
          <source>In TC-STAR Workshop on Speech-to-Speech Translation</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Garcia-Mateo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dieguez-Tirado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Docio-Fernandez</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Cardenal-Lopez</surname>
          </string-name>
          .
          <article-title>Transcrigal: A bilingual system for automatic indexing of broadcast news</article-title>
          .
          <source>In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC)</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ghahremani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>BabaAli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Riedhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Trmal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <article-title>A pitch extraction algorithm tuned for automatic speech recognition</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2014</year>
          , Florence, Italy, May 4-
          <issue>9</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>2494</fpage>
          -
          <lpage>2498</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2013</year>
          , Vancouver, BC, Canada, May
          <volume>26</volume>
          -31,
          <year>2013</year>
          , pages
          <fpage>6645</fpage>
          -
          <lpage>6649</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Korvas</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Pla´tek,</article-title>
          <string-name>
            <surname>O. Duˇsek</surname>
            , L. Zˇilka, and
            <given-names>F.</given-names>
          </string-name>
          <article-title>Jurˇc´ıˇcek</article-title>
          . Vystadial 2013 - Czech data,
          <year>2014</year>
          . LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>¨ller</article-title>
          .
          <source>Information Retrieval for Music and Motion</source>
          . Springer-Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Khudanpur.</surname>
          </string-name>
          <article-title>LibriSpeech: an ASR corpus based on public domain audio books</article-title>
          .
          <source>In Proceedings of ICASSP</source>
          , pages
          <fpage>99</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          .
          <article-title>MediaEval 2013 spoken web search task: system performance measures</article-title>
          .
          <source>Technical report</source>
          , Dept. Electricity and Electronics, University of the Basque Country,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          , G. Bordel, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          .
          <article-title>GTTS systems for the SWS task at MediaEval 2013</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          .
          <article-title>Phoneme Recognition based on Long Temporal Context</article-title>
          .
          <source>PhD thesis</source>
          , Brno University of Technology,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>I. Szo</surname>
          </string-name>
          ¨ke,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Gr´ezl</article-title>
          , J. Cˇ ernocky´, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ondel</surname>
          </string-name>
          .
          <article-title>Calibration and fusion of query-by-example systems - BUT SWS 2013</article-title>
          .
          <source>In Proceedings of ICASSP 2014</source>
          , pages
          <fpage>7899</fpage>
          -
          <lpage>7903</lpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I. Szo</given-names>
            ¨ke, L.
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Proenca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lojka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          .
          <article-title>Query by example search on speech at Mediaeval 2015</article-title>
          .
          <source>In Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>I. Szo</surname>
          </string-name>
          ¨ke, M. Ska´cel, and
          <string-name>
            <surname>L. Burget. BUT</surname>
          </string-name>
          <article-title>QUESST2014 system description</article-title>
          .
          <source>In Proceedings of the MediaEval 2014 Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Vesely´</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghoshal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          .
          <article-title>Sequence-discriminative training of deep neural networks</article-title>
          .
          <source>In INTERSPEECH</source>
          <year>2013</year>
          ,
          <article-title>14th Annual Conference of the International Speech Communication Association</article-title>
          , Lyon, France,
          <source>August 25-29</source>
          ,
          <year>2013</year>
          , pages
          <fpage>2345</fpage>
          -
          <lpage>2349</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>