<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The SPL-IT Query by Example Search on Speech system for MediaEval 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Proença</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arlindo Veiga</string-name>
          <email>aveiga@co.it.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Perdigão</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Audio Figure 1. Query vs. Audio posterior distance matrix (top) and the best path from A2</institution>
          ,
          <addr-line>bottom</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Telecomunicações, Coimbra, Portugal Electrical and Computer Eng. Department, University of Coimbra</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This document briefly describes the system submitted by the Speech Processing Lab of Instituto de Telecomunicações, pole of Coimbra (SPL-IT) to the Query by Example Search on Speech Task (QUESST) of MediaEval 2014. Our approach is based on merging results of a phoneme recognition system using three different languages. A version of Dynamic Time Warping (DTW) using posteriorgram distances was created to allow finding some of the peculiar search cases of this task. Our primary submission merges two approaches: simple DTW for detecting entire queries and a version where cutting final portions of queries is allowed. The late submission merges 5 approaches that account for all the search possibilities described for the task, though improved results were only observed in the evaluation dataset for type 3 queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-2">
      <title>2.1 Phonetic recognizer</title>
      <p>
        We started by using our in-house phoneme recognizer for
Portuguese, which is based in Hidden Markov Models, with our
keyword spotting system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Since it was hard to obtain
posteriorgrams with this technique, we decided to use an available
external system based on neural networks, the phoneme
recognizer from Brnu University of Technology (BUT) [3]. We
used the three available systems for 8kHz audio, for three
languages: Czech, Hungarian and Russian. Using different
languages allows us to deal with different sets of phonemes, and
hopefully the fusion of the results will better describe the
similarities between what is said in a query and in the searched
audio.
      </p>
      <p>All queries and audio files were run through the 3 systems,
resulting in phoneme state posteriograms (3 states per phoneme).
Leading and trailing silence/noise were cut on queries, from the
initial and final frames that had a high probability of
corresponding to silence or noise (sum of the 3 states of ‘int’,
‘pau’ and ‘spk’ phones is greater than 50% for the average of the
3 languages).</p>
    </sec>
    <sec id="sec-3">
      <title>2.2 Dynamic Time Warping</title>
      <p>
        We implemented a version of Dynamic Time Warping (DTW)
specific for this challenge. As in [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ], the local distance is based on
the dot product of query and audio posterior probability vectors.
Also, a back-off of phoneme probabilities with lambda=10-4 is
applied, and minus log is applied to the dot product. This results
in the local distance matrix for DTW.
      </p>
      <p>The beginning and end of a DTW path was not restricted. For
the local path restrictions we tested a small number of alternative
options, but the most versatile was found to be the one that allows
a path to continue through 3 jumps to directly adjacent points in
the matrix: horizontal, vertical and diagonal. All these 3 types of
movements have equal (unitary) weight. The final path distance is
normalized by the number of query frames. This simple approach
(named A1) will output the distance of the best path and is the
basis from which the subsequent approaches will be devised.</p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Modifications on the DTW</title>
      <p>To account for the special types of queries indicated in the task,
we developed 4 additional approaches based on altering the
DTW:</p>
      <p>(A2) This approach considers a cut of up to 250ms at the end
of a query, keeping the non-cut segment above 500ms (example
on Figure 1).</p>
      <p>(A3) This approach considers a cut at the beginning of a query
up to 250ms, keeping it at least above 500ms. To improve
computational speed, we reason that the basic DTW paths that
include the matching query should already have one of the lowest
distances for this query-audio pair, therefore we only search
solutions by backtracking the 5 best total paths.</p>
      <p>(A4) This approach allows just one “jump” in the DTW path
(Figure 2) - for each possible path, a jump of up to half the
query’s length is allowed to cover for possible extra words
between the query’s own words. The jump may not occur at the
initial and final 250 ms of the query, and for queries shorter than
800ms.
y
r
e
u
Q</p>
      <p>Audio
Figure 2. Query vs. Audio posterior distance matrix (top) and
the best path from A4 (bottom).</p>
      <p>(A5) This approach allows swaps of two path segments (Figure
3). This accounts for re-ordering of query words by backtracking
the candidate DTW paths from the end of query (as for A3) and
finding an alternative path that appears ahead of the initial one,
but which better matches the beginning of the query. This second
path segment can’t start before the end of the first one but can
start later to account for a gap due to an extra word in the middle
of the query. The same limitations as for A4 apply.</p>
      <p>Audio
Figure 3. Query vs. Audio posterior distance matrix (top) and
the best path from A5 (bottom).</p>
      <p>All examples in the figures are true cases from the development
dataset that were rejected at first with strategy A1 but were now
accepted with one of the other approaches.</p>
    </sec>
    <sec id="sec-5">
      <title>2.4 Fusing</title>
      <p>Since different approaches provide different distance measures
for the same query-audio pair, one could argue that the minimum
of distances obtained through them would correspond to the best
detection possible. However, tests showed that the minimum was
not the best method, supposedly increasing false alarms through
one of the special approaches. The harmonic mean was found to
be a good compromise, and was employed here to extract a single
distance value from several approaches.</p>
      <p>Per-query normalization is performed, by subtracting the mean
and dividing by the standard deviation of all the results from
Query-Audio pairs for a given query. This step may skew the
results to indicate that every query should be found at least once
on the database, but we found that this procedure was highly
beneficial.</p>
      <p>To fuse systems that are based on recognizers of different
languages, we employ the arithmetic mean of the already
normalized values, which was found to be the best method on the
development dataset. Distances are transformed into figures of
merit simply by taking their symmetrical values.</p>
    </sec>
    <sec id="sec-6">
      <title>2.5 Processing Speed</title>
      <p>The hardware that processed our systems was the CRAY CX1
Cluster, running windows server 2008 HPC, and using 16 of 56
cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core
and 24GB RAM per node). Approximately, the Indexing Speed
Factor was 1.4, Searching Speed Factor was 0.0029 per sec and
per language, and Peak Memory was 0.098 GB.</p>
    </sec>
    <sec id="sec-7">
      <title>3. SUBMISSIONS AND RESULTS</title>
      <p>We submitted two systems for evaluation, one primary on-time
and one late. The primary system is simply a fusion of the A1 and
the A2 approach for the 3 languages. The late system corresponds
to the fusion of the 5 approaches. The summarization of the scores
obtained for each of them is shown on table 1.</p>
      <p>primary</p>
      <p>late</p>
      <p>The late system did not improve overall results for all matching
query types. By analyzing the results of individual query types, we
found that it did improve slightly for type 3 queries, from 0.8049
Cnxe on primary to 0.7865 Cnxe on late systems. This is where
approaches A4 and A5 would be exclusively useful and the late
system was submitted to include and discuss these special
methods that, at first, were significantly increasing the number of
false positives in our trials. For each solo approach, the resulting
Cnxe scores on the Eval dataset were: A1: 0.6823, A2: 0.6721,
A3: 0.6947, A4: 0.6957 and A5: 0.6999.</p>
      <p>By analyzing the output from the scoring tool, it is also
observed that the Cnxe could have been severely decreased
(0.6797 to 0.5438 on Dev set), showing that we lacked an
optimization method for it.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Conclusions</title>
      <p>The complicated nature of this year’s task presents an added
difficulty and we have tried a few strategies for dealing with it.
We should have definitely applied methods to optimize Cnxe,
since we know that resembling the attainable MinCnxe would
improve our results by a big margin, and is something to review
for future participations. Our main conclusion is that including the
possibility of, e.g., re-ordering of words, increases false positives
overall for our approach, and as these special cases are a small
part of the database, results may worsen.</p>
      <p>We would like to acknowledge the BUT group for making
available their phoneme recognition system. Jorge Proença is
supported by the SFRH/BD/97204/2013 FCT Grant
[3] Phoneme recognizer based on long temporal context, Brno
University of Technology, FIT,
http://speech.fit.vutbr.cz/software/phoneme-recognizerbased-long-temporal-context</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szöke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <article-title>"Query by Example Search on Speech at Mediaeval 2014"</article-title>
          ,
          <source>in Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Veiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perdigão</surname>
          </string-name>
          .
          <article-title>Acoustic Similarity Scores for Keyword Spotting</article-title>
          .
          <source>In PROPOR</source>
          <year>2014</year>
          ,
          <string-name>
            <given-names>São</given-names>
            <surname>Carlos</surname>
          </string-name>
          , Brazil, October 6-
          <issue>9</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.J.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.M.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Query-by-example spoken term detection using phonetic posteriorgram templates</article-title>
          .
          <source>In ASRU</source>
          <year>2009</year>
          :
          <fpage>421</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>