<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Proença</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Castela</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Perdigão</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electrical and Computer Eng. Department, University of Coimbra</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Telecomunicações</institution>
          ,
          <addr-line>Coimbra</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This document describes the system built by the SPL-IT-UC team from the Signal Processing Lab of Instituto de Telecomunicações (pole of Coimbra) and University of Coimbra for the Query by Example Search on Speech Task (QUESST) of MediaEval 2015. The submitted system filters considerable background noise by applying spectral subtraction, uses five phonetic recognizers from which posterior probabilities are extracted as features, implements novel modifications of Dynamic Time Warping (DTW) that focus on complex queries, and uses linear calibration and fusion to optimize results. This year's task proved extra challenging in terms of acoustic conditions and match cases, though we observe the best results when merging all complex approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        MediaEval’s challenge for audio query search on audio,
QUESST [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], keeps adding new relevant problems to tackle, and
in depth details can be consulted in the referenced paper. Briefly,
this year, the audio can have significant background or
intermittent noise as well as reverberation, and there are queries
that originate from spontaneous requests. These conditions further
approach real case scenarios of a query search application, which
is one of the underlying motivations of the challenge.
      </p>
      <p>
        Systems for Query by Example (QbE) search keep improving
with recent advances such as combining spectral acoustic and
temporal acoustic models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], combining a high number of
subsystems using both Acoustic Keyword Spotting (AKWS) and
Dynamic Time Warping (DTW) and using bottleneck features of
neural networks as input [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], new distance normalization
techniques [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and several approaches to system fusion and
calibration [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Some attempts have been made to address
complex query types that were introduced in QUESST 2014, by
segmenting the query in some way such as using a moving
window [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or splitting the query in the middle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our approach
is based on modifying the DTW algorithm to allow paths to be
created in ways that conform to the complex types, which has
shown success in improving overall results [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>Using bottleneck features could be an important step to improve
our systems, and although we did not consider them yet, we are
certain that there are still improvements to be made not related to
the feature extractor. Nevertheless, we tried to improve the system
feature-wise by implementing additional phonetic recognizers.
Compared to last year’s submission, we reduce severe background
noise in the audio by applying spectral subtraction, add two
phonetic recognizers (five in total) from which to extract
posteriorgrams, remove all silence and noise frames from queries,
improve and add modifications to Dynamic Time Warping (DTW)
for complex query search (filler inside a query being the novelty),
and implement better fusion and calibration methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Noise filtering</title>
      <p>First, we apply a high pass filter to the audio signals to remove
low frequency artefacts. Then, to tackle the existence of
substantial stationary background noise in both queries and
reference audio, we apply spectral subtraction (SS) to noisy
signals (not performed for high SNR signals, which was
worsening results). This implies a careful selection of samples of
noise from an utterance. For this, we analyze the log mean Energy
of the signal, consider only values above -60dB and calculate a
threshold below which segments of more than 100ms are selected
as “noise” samples, whose mean spectrum will be subtracted from
the whole signal.</p>
      <p>Using dithering (white noise) to counterbalance the musical
noise effect due to SS didn’t help. Nothing was specifically
performed for reverberation or intermittent noise.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Phonetic recognizers</title>
      <p>
        We continue to use an available external tool based on neural
networks and long temporal context, the phoneme recognizers
from Brno University of Technology (BUT) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We used the
three available systems for 8kHz audio, for three languages:
Czech, Hungarian and Russian. Additionally, we trained two new
systems with the same framework: English (using TIMIT and
Resource Management databases) and European Portuguese
(using annotated broadcast news data and a dataset of command
words and sentences). Using different languages implies dealing
with different sets of phonemes, and the fusion of the results will
better describe the similarities between what is said in a query and
the searched audio. This makes our system a low-resource one.
      </p>
      <p>All de-noised queries and audio files were run through the 5
systems, extracting frame-wise state-level posterior probabilities
(with 3 states per phoneme) to be analyzed separately.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Voice Activity Detection</title>
      <p>Silence or noise segments are undesirable for a query search,
and were cut on queries from all frames that had a high
probability of corresponding to silence or noise, if the sum of the
3 state posteriors of silence or noise phones is greater than a 50%
threshold for the average of the 5 languages. To account for
queries that may still have significant noise, this threshold is
incrementally raised if the previous cut is too severe (the obtained
query having less than 500ms).</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Modified Dynamic Time Warping</title>
      <p>
        Every query must then be searched on every reference audio.
We continue to use DTW as a way to find paths on the audio that
may be similar to the query. As in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the local distance is based
on the dot product of query and audio posterior probability
vectors, with a back-off of 10-4 and minus log applied to the dot
product, resulting in the local distance matrix of a query-audio
pair, ready for DTW to be applied. Before dot product, removing
the posterior probabilities of silence and noise altogether, and
normalizing the probabilities of speech phones to 1 did not help.
      </p>
      <p>In addition to separate searches on distance matrices from
posteriorgrams of 5 languages, we add a 6th
“language”/subsystem (called ML for multi-language) whose distance matrix is
the average of the 5 matrices.</p>
      <p>
        We employ several modifications to the DTW algorithm to
allow intricate paths to be constructed that can correspond
logically to the complex match scenarios of query and audio. The
basic approach (named A1) outputs the best path (minimal
normalized distance) by using the complete query and allows 3
movements in the distance matrix with equal weight: horizontal,
vertical and diagonal. As in previous work [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], 4 modifications
are made that do not require repeating DTW with segmentations
of the query:
(A2) Considering cuts at the end of query for lexical variations;
(A3) Considering cuts at the beginning of the query;
(A4) Allowing one ‘jump’ for extra content in the audio;
(A5) Allowing word-reordering, where an initial part of the
query may be found ahead of the last.
      </p>
      <p>Additionally, we designed an extra approach to address the case
of type 3 queries where small fillers or hesitations in the query
may exist:</p>
      <p>(A6) Allowing one ‘jump’ along the query, of maximum 33%
of query duration.</p>
      <p>Each distance is normalized per the number of movements
(mean distance of the path).</p>
    </sec>
    <sec id="sec-7">
      <title>2.5 Fusion and Calibration</title>
      <p>At this stage, we have distance values for each audio-query pair
for 6 languages and 6 DTW strategies (36 vectors). First,
modifications are performed on the distribution per query per
strategy. While deciding on a maximum distance value to give to
unsearched cases (such as an audio being too short for a long
query), we found that drastically truncating large distances
(lowering to the same value) improved results. Surprisingly,
changing all upper distance values (larger than the mean) to the
mean of the distribution was the overall best. We reason that since
there are a lot of ground truth matches with very high distances
(false negatives), lowering these values improves the Cnxe metric
more than lowering the value of true negatives worsens it. The
next step is to normalize per query by subtracting the new mean
and dividing by the new standard deviation. Distances are
transformed to figures-of-merit by taking the symmetrical value.</p>
      <p>
        To fuse results of different strategies and languages we have
two separate approaches/systems, both using weighted linear
fusion and transformation trained with the Bosaris toolkit [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
calibrating for the Cnxe metric by taking into account the prior
defined by the task:
- Fusion of all approaches and all languages (36 vectors).
- Fusion of the Harmonic mean (Hmean) of the 6 strategies for
a given language (6 vectors, one per language). This is done to
possibly counter overfitting to the training data from weighing 36
vectors, and only languages are weighed.
      </p>
      <p>From a fusion, final result vectors with only one value per
audio-query pair are obtained for development and test data.</p>
      <p>Additionally, we provide side-info based on query and audio,
added as extra vectors for all fusions. The 7 extra side-info vectors
are: mean of distances per query before truncation and
normalization from the best approach and language (the highest
weighted from fusion of all); query size in frames and log of query
size; 4 vectors of SNR values (original SNR of query and of
audio, post spectral subtraction SNR of query and of audio).</p>
    </sec>
    <sec id="sec-8">
      <title>2.6 Processing Speed</title>
      <p>The hardware that processed our systems was the CRAY CX1
Cluster, running windows server 2008 HPC, and using 16 of 56
cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core
and 24GB RAM per node).</p>
      <p>Approximately, the Indexing Speed Factor was 2.14, Searching
Speed Factor was 0.0034 per sec, and Peak Memory was 120MB.</p>
    </sec>
    <sec id="sec-9">
      <title>3. SUBMISSIONS AND RESULTS</title>
      <p>We submitted 4 systems for evaluation: fusion of all approaches
and languages with and without side-info; fusion of harmonic
mean with and without side-info. Table 1 summarizes the results
of the Cnxe metric for the 4 systems.</p>
      <p>It can be seen that the best result for the development set was
our primary system that fused all languages and approaches plus
some side-info. As suspected, the weighted combination applied
to the Eval set may be too over fitted to the Dev set, as the best
result on Eval was using the Hmean of approaches. The
considered side-info always helped.</p>
      <p>Analyzing the best result on Eval per query type (All-0.7842,
T1-0.7107, T2-0.8147, T3-0.8115), the exact matches of type 1
are the easiest to detect compared to other types.</p>
      <p>Using only each individual DTW approach and fusing, the
results on the Dev set are: A1: 0.8041, A2: 0.7978, A3: 0.8335,
A4: 0.8137, A5: 0.8184, A6: 0.8460. The overall best performing
strategy is allowing cuts at the end of the query (A2), which may
help in all cases due to co-articulation or intonation. The new
strategy of allowing a jump in query (A6) performs badly and
should be reviewed. Actually, a filler in a query may be an
extension of an existing phone, which leads to a straight path and
not a jump.</p>
      <p>Below, we also report the improvements of some steps of our
system on the Dev set (although the comparison may not be to the
final approach). Using Spectral Subtraction resulted in 0.8130
Cnxe from 0.8368. Fusing 5 languages - 0.7971 Cnxe, using only
the mean distance matrix ML - 0.8136, using 5 langs and ML
0.7873. Using per query truncation to the mean – 0. 7873 Cnxe,
without truncation – 0.7939.</p>
    </sec>
    <sec id="sec-10">
      <title>4. CONCLUSIONS</title>
      <p>Several steps were explored to improve the results of existing
methods, and the main contributions came from: a careful Spectral
Subtraction to diminish background noise which greatly
influences the output of phonetic recognizers; using the average
distance matrix of all languages as a 6th sub-system for fusion;
including side-info of query and audio; and per-query truncation
of large distances.</p>
      <p>Including a DTW strategy that considers gaps in query did not
prove very successful. This may be due to its target cases being
too few in the dataset, and even some fillers in query being
extensions and not unrelated hesitations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szöke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Proença</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lojka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>"Query by Example Search on Speech at Mediaeval 2015"</article-title>
          ,
          <source>in Working Notes Proceedings of the Mediaeval 2015 Workshop</source>
          , Wurzen, Germany, September 14-15
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gracia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Binefa</surname>
          </string-name>
          , “
          <article-title>Combining temporal and spectral information for Query-by-Example Spoken Term Detection,”</article-title>
          <source>in Proc. European Signal Processing Conference (EUSIPCO)</source>
          , Lisbon, Portugal,
          <year>2014</year>
          , pp.
          <fpage>1487</fpage>
          -
          <lpage>1491</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szöke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grezl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Cernocky</surname>
          </string-name>
          , and L. Ondel, “
          <article-title>Calibration and fusion of query-by-example systems-</article-title>
          <source>BUT SWS</source>
          <year>2013</year>
          ,” in
          <source>Proc IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , Florence, Italy,
          <year>2014</year>
          , pp.
          <fpage>7849</fpage>
          -
          <lpage>7853</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          , G. Bordel, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          , “
          <article-title>High-performance query-by-example spoken term detection on the SWS 2013 evaluation,”</article-title>
          <source>in Proc IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , Florence, Italy,
          <year>2014</year>
          , pp.
          <fpage>7819</fpage>
          -
          <lpage>7823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.J. Rodriguez</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          , and G. Bordel, “
          <article-title>On the calibration and fusion of heterogeneous spoken term detection systems,”</article-title>
          <source>in Proc. Interspeech</source>
          <year>2013</year>
          , Lyon, France,
          <year>2013</year>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al., “
          <article-title>The NNI Query-by-Example System for MediaEval 2014”</article-title>
          ,
          <source>in Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szöke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skácel</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          , “
          <article-title>BUT QUESST 2014 system description”</article-title>
          ,
          <source>in Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Proença</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veiga</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Perdigão</surname>
          </string-name>
          , “
          <article-title>The SPL-IT Query by Example Search on Speech system for MediaEval 2014”</article-title>
          ,
          <source>in Working Notes Proceedings of the Mediaeval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Proença</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veiga</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Perdigão</surname>
          </string-name>
          ,
          <article-title>"Query by Example Search with Segmented Dynamic Time Warping for NonExact Spoken Queries"</article-title>
          ,
          <source>Proc European Signal Processing Conf. - EUSIPCO</source>
          , Nice, France,
          <year>August</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>Phoneme recognizer based on long temporal context</article-title>
          , Brno University of Technology, FIT, http://speech.fit.vutbr.cz/ software/phoneme
          <article-title>-recognizer-based-long-temporal-context</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.J.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.M.</given-names>
            <surname>White</surname>
          </string-name>
          , “
          <article-title>Query-by-example spoken term detection using phonetic posteriorgram templates”</article-title>
          ,
          <source>In ASRU</source>
          <year>2009</year>
          :
          <fpage>421</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brummer</surname>
          </string-name>
          , and E. de Villiers, “
          <article-title>The BOSARIS Toolkit User Guide: Theory, Algorithms</article-title>
          and Code for Binary Classifer Score Processing,”
          <source>Technical report</source>
          ,
          <year>2011</year>
          . https:// sites.google.com/site/bosaristoolkit/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>