<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GTTS-EHU Systems for QUESST at MediaEval 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis J. Rodriguez-Fuentes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amparo Varona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikel Penagarikano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germán Bordel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mireia Diez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Software Technologies Working Group (</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper brie y describes the systems presented by the Software Technologies Working Group (http://gtts.ehu.es, GTTS) of the University of the Basque Country (UPV/EHU) to the Query-by-Example Search on Speech Task (QUESST) at MediaEval 2014. The GTTS-EHU systems consist of four modules: (1) feature extraction; (2) speech activity detection; (3) DTW-based query matching; and (4) score calibration and fusion. The submitted systems follow the same approach used in our SWS 2013 submissions, with two minor changes (needed to address the new task): the search stops at the most likely query detection (no further detections are looked for) and a score is produced for each (query, document) pair. The two approximate matching types introduced in QUESST have not received special treatment. This year, we have just explored the use of reduced feature sets, obtaining worse results but at lower computational costs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2014 Query-by-Example Search on Speech
Task (QUESST) consists of searching for a spoken query
within a set of spoken documents. For each pair (query,
document), a score in the range ( 1; +1) must be produced,
the higher (the more positive) the score, the more likely that
the query appears in the document. System performance is
primarily measured in terms of a normalized cross-entropy
cost Cnxe. Term-Weighted Value metrics (ATWV/MTWV)
are used as secondary metrics, along with the processing
resources (real-time factor and peak memory usage) required
by the submitted systems. For more details on QUESST,
see [
        <xref ref-type="bibr" rid="ref5">2</xref>
        ].
2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
    </sec>
    <sec id="sec-3">
      <title>Feature extraction</title>
      <p>
        The Brno University of Technology (BUT) phone decoders
for Czech, Hungarian and Russian [
        <xref ref-type="bibr" rid="ref9">6</xref>
        ] are applied to
decode both the spoken queries and the audio documents.
BUT decoders are trained on 8 kHz SpeechDat(E) databases
recorded over xed telephone networks, featuring 45, 61 and
52 units for Czech, Hungarian and Russian, respectively
(three of them being non-phonetic units that stand for short
pauses and noises).
      </p>
      <p>Given an input signal of length T , the decoder outputs
the posterior probability of each state s (1 s S) of each
unit i (1 i M ) at each frame t (1 t T ), pi;s(t),
where M is the number of units and S the number of states
per unit. The posterior probability of each unit i at each
frame t are computed by adding the posteriors of its states:
pi(t) =</p>
      <p>X pi;s(t)
8s
(1)
Finally, the posteriors of the three non-phonetic units are
added and stored as a single non-speech posterior. Thus, the
size of the frame-level feature vectors is 43, 59 and 50 for the
Czech, Hungarian and Russian BUT decoders, respectively.
2.1.1</p>
      <p>Reduced feature sets</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3 ref7">4</xref>
        ], several dimensionality reduction techniques were
successfully applied on phone posterior features to reduce
the computational cost while keeping performance on spoken
language recognition tasks. Following one of the approaches
proposed in [
        <xref ref-type="bibr" rid="ref3 ref7">4</xref>
        ], here we de ne a reduced set of features
by adding the posteriors of phones with the same manner
and place of articulation. This leads to feature sets of size
25, 23 and 21, for the Czech, Hungarian and Russian BUT
decoders, respectively.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Speech Activity Detection</title>
      <p>Given an audio signal, Speech Activity Detection (SAD) is
performed by discarding those phone posterior feature
vectors for which the non-speech posterior is the highest. The
remaining vectors, along with their corresponding time o
sets, are stored for further use, but the component
corresponding to the non-speech unit is deleted. If the number
of speech vectors is too low (in this evaluation, 10, meaning
0.1 seconds), the whole signal is discarded (thus saving time
and possibly avoiding many false alarms) and a oor score
is output (in this evaluation, 10 5).
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>DTW-based query matching</title>
      <p>Given two SAD- ltered sequences of feature vectors
corresponding to a spoken query q and a spoken document x, the
cosine distance is computed between each pair of vectors,
q[i] and x[j] as follows:
d(q[i]; x[j]) =
log
Note that d(v; w) 0, with d(v; w) = 0 if and only if v
and w are perfectly aligned and d(v; w) = +1 if and only
if v and w are orthogonal. The distance matrix computed
according to Eq. 2 is further normalized with regard to the
spoken document x, as follows:
d(q[i]; x[j]) dmin(i)
dmax(i)
dmin(i)
with dmin(i) = min d(q[i]; x[j]) and dmax(i) = max d(q[i]; x[j]).
j j
(2)
(3)
In this way, matrix values are all comprised between 0 and
1, so that a perfect match would produce a quasi-diagonal
sequence of zeroes.</p>
      <p>The best match of a query q of length m in a spoken
document x of length n is de ned as that minimizing the
average distance in a crossing path of the matrix dnorm. A
crossing path starts at any given frame of x, k1 2 [1; n],
then traverses a region of x which is optimally aligned to
q (involving L vector alignments), and ends at frame k2 2
[k1; n]. The average distance in this crossing path is:</p>
      <p>L
davg(q; x) = 1 X dnorm(q[il]; x[jl]) (4)</p>
      <p>L
where il and jl are the indices of the vectors of q and x
in the alignment l, for l = 1; 2; : : : ; L. Note that i1 = 1,
iL = m, j1 = k1 and jL = k2. The minimization
operation is accomplished by means of a dynamic programming
procedure, which is (n m d) in time (d: size of feature
vectors) and (n m) in space. The detection score is
computed as 1 davg(q; x). Once the best match is obtained,
the search procedure stops. As noted above, if either q or x
have not enough speech samples, no alignment is performed
and a oor score (10 5) is output. Note that a detection
score must be mandatorily produced for each pair (q; x).</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Score calibration and fusion</title>
      <p>
        First, the so-called q-norm (query normalization) is
applied, so that zero-mean and unit-variance scores are
obtained per query [
        <xref ref-type="bibr" rid="ref4">1</xref>
        ]. Then, if n di erent systems are fused,
since all of them contain a complete set of scores, for each
trial the set of n scores is considered, which besides the
ground truth (target/non-target labels) can be used to
discriminatively estimate a linear transformation that produces
well-calibrated scores that can be linearly combined to get
fused scores. Under this approach, the Bayes optimal
threshold (given by the e ective prior: 0:0741 for this evaluation)
is applied. The BOSARIS toolkit [
        <xref ref-type="bibr" rid="ref6">3</xref>
        ] is used to estimate and
apply the calibration/fusion models.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3. RESULTS</title>
      <p>Table 1 shows the performance and processing costs of
GTTS-EHU systems on QUESST 2014. To speed up
computations, experiments with the full and reduced sets of
features were carried out on di erent machines (see Table
1), which makes it nonsense to compare the reported times.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Indexing involves just applying BUT decoders to extract phone posterior features. The Indexing Speed Factor (ISF), the Searching Speed Factor (SSF) and the Peak Memory Usage (PMU) values have been computed as if all the computation was performed sequentially in a single processor (see [5]). Calibration and fusion costs have been neglected. The contrastive systems 1 and 2 (c1 and c2) use the concatenation of phone posteriors from the three decoders as features, for the full and reduced feature sets, respectively</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>The system c3 is the fusion of four subsystems, using the full set of phone posteriors for Czech, Hungarian, Russian and the concatenation of them, respectively. The system c4 is equivalent to c3 but using the reduced sets of features. Finally, the primary system is the fusion of the eight available subsystems. In all cases, calibration and fusion parameters have been estimated on the development set. Note that the primary system yields only slightly better performance than system c3, meaning that reduced sets of features provide little additional information to full sets of features. In fact, the full sets outperform the reduced sets in all cases. As shown in Table 2, performance strongly degrades from T1 to T2 and (not so much) from T2 to T3; on the other hand, the non-native English and (to a lesser extent) the Basque subsets seem problematic. Future work may involve some kind of language detection and adaptation, plus speci c strategies for matching types T2 and T3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>4. REFERENCES</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J. Rodriguez</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bordel</surname>
          </string-name>
          .
          <article-title>On the calibration and fusion of heterogeneous spoken term detection systems</article-title>
          .
          <source>In Interspeech</source>
          <year>2013</year>
          , Lyon, France,
          <year>August</year>
          25-29
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szoke, A</article-title>
          . Buzo, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <article-title>Query by Example Search on Speech at Mediaeval 2014</article-title>
          .
          <source>In Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bru</surname>
          </string-name>
          <article-title>mmer</article-title>
          and E. de Villiers.
          <article-title>The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary Classi er Score Processing</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2011</year>
          . https://sites.google.com/site/bosaristoolkit/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J. Rodriguez</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bordel</surname>
          </string-name>
          .
          <article-title>Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition</article-title>
          .
          <source>In Interspeech</source>
          <year>2013</year>
          , Lyon, France,
          <year>August</year>
          25-29
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano. MediaEval 2013 Spoken Web</surname>
          </string-name>
          <article-title>Search Task: System Performance Measures</article-title>
          .
          <source>Technical report</source>
          , GTTS, UPV/EHU, May
          <year>2013</year>
          . http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          .
          <article-title>Phoneme recognition based on long temporal context</article-title>
          .
          <source>PhD thesis</source>
          , FIT,
          <string-name>
            <surname>BUT</surname>
          </string-name>
          , Brno, Czech Republic,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>