The SPL-IT-UC Query by Example Search on Speech
                   system for MediaEval 2015
                               Jorge Proença1,2, Luis Castela2, Fernando Perdigão1,2
                                       1 Instituto de Telecomunicações, Coimbra, Portugal
                         2 Electrical and Computer Eng. Department, University of Coimbra, Portugal

                                                      {jproenca, fp}@co.it.pt

ABSTRACT
This document describes the system built by the SPL-IT-UC team
                                                                      2. SYSTEM DESCRIPTION
from the Signal Processing Lab of Instituto de Telecomunicações       2.1 Noise filtering
(pole of Coimbra) and University of Coimbra for the Query by             First, we apply a high pass filter to the audio signals to remove
Example Search on Speech Task (QUESST) of MediaEval 2015.             low frequency artefacts. Then, to tackle the existence of
The submitted system filters considerable background noise by         substantial stationary background noise in both queries and
applying spectral subtraction, uses five phonetic recognizers from    reference audio, we apply spectral subtraction (SS) to noisy
which posterior probabilities are extracted as features, implements   signals (not performed for high SNR signals, which was
novel modifications of Dynamic Time Warping (DTW) that focus          worsening results). This implies a careful selection of samples of
on complex queries, and uses linear calibration and fusion to         noise from an utterance. For this, we analyze the log mean Energy
optimize results. This year’s task proved extra challenging in        of the signal, consider only values above -60dB and calculate a
terms of acoustic conditions and match cases, though we observe       threshold below which segments of more than 100ms are selected
the best results when merging all complex approaches.                 as “noise” samples, whose mean spectrum will be subtracted from
                                                                      the whole signal.
1. INTRODUCTION                                                          Using dithering (white noise) to counterbalance the musical
   MediaEval’s challenge for audio query search on audio,             noise effect due to SS didn’t help. Nothing was specifically
QUESST [1], keeps adding new relevant problems to tackle, and         performed for reverberation or intermittent noise.
in depth details can be consulted in the referenced paper. Briefly,   2.2 Phonetic recognizers
this year, the audio can have significant background or                  We continue to use an available external tool based on neural
intermittent noise as well as reverberation, and there are queries    networks and long temporal context, the phoneme recognizers
that originate from spontaneous requests. These conditions further    from Brno University of Technology (BUT) [10]. We used the
approach real case scenarios of a query search application, which     three available systems for 8kHz audio, for three languages:
is one of the underlying motivations of the challenge.                Czech, Hungarian and Russian. Additionally, we trained two new
   Systems for Query by Example (QbE) search keep improving           systems with the same framework: English (using TIMIT and
with recent advances such as combining spectral acoustic and          Resource Management databases) and European Portuguese
temporal acoustic models [2], combining a high number of              (using annotated broadcast news data and a dataset of command
subsystems using both Acoustic Keyword Spotting (AKWS) and            words and sentences). Using different languages implies dealing
Dynamic Time Warping (DTW) and using bottleneck features of           with different sets of phonemes, and the fusion of the results will
neural networks as input [3], new distance normalization              better describe the similarities between what is said in a query and
techniques [4] and several approaches to system fusion and            the searched audio. This makes our system a low-resource one.
calibration [5]. Some attempts have been made to address                 All de-noised queries and audio files were run through the 5
complex query types that were introduced in QUESST 2014, by           systems, extracting frame-wise state-level posterior probabilities
segmenting the query in some way such as using a moving               (with 3 states per phoneme) to be analyzed separately.
window [6] or splitting the query in the middle [7]. Our approach
is based on modifying the DTW algorithm to allow paths to be          2.3 Voice Activity Detection
created in ways that conform to the complex types, which has             Silence or noise segments are undesirable for a query search,
shown success in improving overall results [8, 9].                    and were cut on queries from all frames that had a high
   Using bottleneck features could be an important step to improve    probability of corresponding to silence or noise, if the sum of the
our systems, and although we did not consider them yet, we are        3 state posteriors of silence or noise phones is greater than a 50%
certain that there are still improvements to be made not related to   threshold for the average of the 5 languages. To account for
the feature extractor. Nevertheless, we tried to improve the system   queries that may still have significant noise, this threshold is
feature-wise by implementing additional phonetic recognizers.         incrementally raised if the previous cut is too severe (the obtained
Compared to last year’s submission, we reduce severe background       query having less than 500ms).
noise in the audio by applying spectral subtraction, add two          2.4 Modified Dynamic Time Warping
phonetic recognizers (five in total) from which to extract              Every query must then be searched on every reference audio.
posteriorgrams, remove all silence and noise frames from queries,     We continue to use DTW as a way to find paths on the audio that
improve and add modifications to Dynamic Time Warping (DTW)           may be similar to the query. As in [11], the local distance is based
for complex query search (filler inside a query being the novelty),   on the dot product of query and audio posterior probability
and implement better fusion and calibration methods.                  vectors, with a back-off of 10-4 and minus log applied to the dot
 Copyright is held by the author/owner(s).
                                                                      product, resulting in the local distance matrix of a query-audio
 MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany      pair, ready for DTW to be applied. Before dot product, removing
the posterior probabilities of silence and noise altogether, and        2.6 Processing Speed
normalizing the probabilities of speech phones to 1 did not help.         The hardware that processed our systems was the CRAY CX1
   In addition to separate searches on distance matrices from           Cluster, running windows server 2008 HPC, and using 16 of 56
posteriorgrams of 5 languages, we add a 6th “language”/sub-             cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core
system (called ML for multi-language) whose distance matrix is          and 24GB RAM per node).
the average of the 5 matrices.                                            Approximately, the Indexing Speed Factor was 2.14, Searching
   We employ several modifications to the DTW algorithm to              Speed Factor was 0.0034 per sec, and Peak Memory was 120MB.
allow intricate paths to be constructed that can correspond
logically to the complex match scenarios of query and audio. The        3. SUBMISSIONS AND RESULTS
basic approach (named A1) outputs the best path (minimal                  We submitted 4 systems for evaluation: fusion of all approaches
normalized distance) by using the complete query and allows 3           and languages with and without side-info; fusion of harmonic
movements in the distance matrix with equal weight: horizontal,         mean with and without side-info. Table 1 summarizes the results
vertical and diagonal. As in previous work [8, 9], 4 modifications      of the Cnxe metric for the 4 systems.
are made that do not require repeating DTW with segmentations
of the query:                                                           Table 1. Results of Cnxe (and MinCnxe) for development and
   (A2) Considering cuts at the end of query for lexical variations;                         evaluation datasets.
   (A3) Considering cuts at the beginning of the query;
                                                                         Fusion Systems        Dev: Cnxe, MinCnxe    Eval: Cnxe, MinCnxe
   (A4) Allowing one ‘jump’ for extra content in the audio;
   (A5) Allowing word-reordering, where an initial part of the           All + side-info         0.7782, 0.7716         0.7866, 0.7809
query may be found ahead of the last.                                    Hmean + side-info       0.7862, 0.7800         0.7842, 0.7786
   Additionally, we designed an extra approach to address the case
                                                                         All, no side            0.7873, 0.7816         0.7930, 0.7875
of type 3 queries where small fillers or hesitations in the query
may exist:                                                               Hmean, no side          0.7957, 0.7893         0.7914, 0.7865
   (A6) Allowing one ‘jump’ along the query, of maximum 33%
of query duration.                                                         It can be seen that the best result for the development set was
   Each distance is normalized per the number of movements              our primary system that fused all languages and approaches plus
(mean distance of the path).                                            some side-info. As suspected, the weighted combination applied
                                                                        to the Eval set may be too over fitted to the Dev set, as the best
2.5 Fusion and Calibration                                              result on Eval was using the Hmean of approaches. The
   At this stage, we have distance values for each audio-query pair
                                                                        considered side-info always helped.
for 6 languages and 6 DTW strategies (36 vectors). First,
                                                                           Analyzing the best result on Eval per query type (All-0.7842,
modifications are performed on the distribution per query per
                                                                        T1-0.7107, T2-0.8147, T3-0.8115), the exact matches of type 1
strategy. While deciding on a maximum distance value to give to
                                                                        are the easiest to detect compared to other types.
unsearched cases (such as an audio being too short for a long
                                                                           Using only each individual DTW approach and fusing, the
query), we found that drastically truncating large distances
                                                                        results on the Dev set are: A1: 0.8041, A2: 0.7978, A3: 0.8335,
(lowering to the same value) improved results. Surprisingly,
                                                                        A4: 0.8137, A5: 0.8184, A6: 0.8460. The overall best performing
changing all upper distance values (larger than the mean) to the
                                                                        strategy is allowing cuts at the end of the query (A2), which may
mean of the distribution was the overall best. We reason that since
                                                                        help in all cases due to co-articulation or intonation. The new
there are a lot of ground truth matches with very high distances
                                                                        strategy of allowing a jump in query (A6) performs badly and
(false negatives), lowering these values improves the Cnxe metric
                                                                        should be reviewed. Actually, a filler in a query may be an
more than lowering the value of true negatives worsens it. The
                                                                        extension of an existing phone, which leads to a straight path and
next step is to normalize per query by subtracting the new mean
                                                                        not a jump.
and dividing by the new standard deviation. Distances are
                                                                           Below, we also report the improvements of some steps of our
transformed to figures-of-merit by taking the symmetrical value.
                                                                        system on the Dev set (although the comparison may not be to the
   To fuse results of different strategies and languages we have
                                                                        final approach). Using Spectral Subtraction resulted in 0.8130
two separate approaches/systems, both using weighted linear
                                                                        Cnxe from 0.8368. Fusing 5 languages - 0.7971 Cnxe, using only
fusion and transformation trained with the Bosaris toolkit [12],
                                                                        the mean distance matrix ML - 0.8136, using 5 langs and ML -
calibrating for the Cnxe metric by taking into account the prior
                                                                        0.7873. Using per query truncation to the mean – 0. 7873 Cnxe,
defined by the task:
                                                                        without truncation – 0.7939.
   - Fusion of all approaches and all languages (36 vectors).
   - Fusion of the Harmonic mean (Hmean) of the 6 strategies for        4. CONCLUSIONS
a given language (6 vectors, one per language). This is done to            Several steps were explored to improve the results of existing
possibly counter overfitting to the training data from weighing 36      methods, and the main contributions came from: a careful Spectral
vectors, and only languages are weighed.                                Subtraction to diminish background noise which greatly
   From a fusion, final result vectors with only one value per          influences the output of phonetic recognizers; using the average
audio-query pair are obtained for development and test data.            distance matrix of all languages as a 6th sub-system for fusion;
   Additionally, we provide side-info based on query and audio,         including side-info of query and audio; and per-query truncation
added as extra vectors for all fusions. The 7 extra side-info vectors   of large distances.
are: mean of distances per query before truncation and                     Including a DTW strategy that considers gaps in query did not
normalization from the best approach and language (the highest          prove very successful. This may be due to its target cases being
weighted from fusion of all); query size in frames and log of query     too few in the dataset, and even some fillers in query being
size; 4 vectors of SNR values (original SNR of query and of             extensions and not unrelated hesitations.
audio, post spectral subtraction SNR of query and of audio).
5. REFERENCES                                                     [6] P. Yang, et al., “The NNI Query-by-Example System for
[1] I. Szöke, L.J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F.         MediaEval 2014”, in Working Notes Proceedings of the
    Metze, J. Proença, M. Lojka, and X. Xiong, "Query by              Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17
    Example Search on Speech at Mediaeval 2015", in Working       [7] I. Szöke, M. Skácel and L. Burget, “BUT QUESST 2014
    Notes Proceedings of the Mediaeval 2015 Workshop,                 system description”, in Working Notes Proceedings of the
    Wurzen, Germany, September 14-15                                  Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17
[2] C. Gracia, X. Anguera, and X. Binefa, “Combining temporal     [8] J. Proença, A. Veiga and F. Perdigão, “The SPL-IT Query by
    and spectral information for Query-by-Example Spoken              Example Search on Speech system for MediaEval 2014”, in
    Term Detection,” in Proc. European Signal Processing              Working Notes Proceedings of the Mediaeval 2014
    Conference (EUSIPCO), Lisbon, Portugal, 2014, pp. 1487-           Workshop, Barcelona, Spain, October 16-17
    1491.                                                         [9] J. Proença, A. Veiga and F. Perdigão, "Query by Example
[3] I. Szöke, L. Burget, F. Grezl, J.H. Cernocky, and L. Ondel,       Search with Segmented Dynamic Time Warping for Non-
    “Calibration and fusion of query-by-example systems—BUT           Exact Spoken Queries", Proc European Signal Processing
    SWS 2013,” in Proc IEEE International Conference on               Conf. - EUSIPCO, Nice, France, August, 2015.
    Acoustics, Speech and Signal Processing (ICASSP),             [10] Phoneme recognizer based on long temporal context, Brno
    Florence, Italy, 2014, pp. 7849-7853.                              University of Technology, FIT, http://speech.fit.vutbr.cz/
[4] L.J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G.             software/phoneme-recognizer-based-long-temporal-context
    Bordel, and M. Diez, “High-performance query-by-example       [11] T.J. Hazen, W.Shen and C.M. White, “Query-by-example
    spoken term detection on the SWS 2013 evaluation,” in Proc         spoken term detection using phonetic posteriorgram
    IEEE International Conference on Acoustics, Speech and             templates”, In ASRU 2009: 421-426.
    Signal Processing (ICASSP), Florence, Italy, 2014, pp.
    7819-7823.                                                    [12] N. Brummer, and E. de Villiers, “The BOSARIS Toolkit
                                                                       User Guide: Theory, Algorithms and Code for Binary
[5] A. Abad, L.J. Rodriguez Fuentes, M. Penagarikano, A.               Classifer Score Processing,” Technical report, 2011. https://
    Varona, M. Diez, and G. Bordel, “On the calibration and            sites.google.com/site/bosaristoolkit/
    fusion of heterogeneous spoken term detection systems,” in
    Proc. Interspeech 2013, Lyon, France, 2013, pp. 20-24.