The IIT-B Query-by-Example System for MediaEval 2015

                                                  Hitesh Tulsiani, Preeti Rao
                                              Department of Electrical Engineering,
                                           Indian Institute of Technology Bombay, India
                                                {hitesh26, prao}@ee.iitb.ac.in


ABSTRACT
This paper describes the system developed at I.I.T. Bombay
for Query-by-Example Search on Speech Task (QUESST)
within the MediaEval 2015 evaluation framework. Our sys-
tem preprocesses the data to remove noise and performs sub-
sequence DTW on posterior/bottleneck features obtained
using four phone recognition systems to detect the queries.
Scores from each of these subsystems are fused to get the sin-
gle score per query-utterance pair which is then calibrated
with respect to the cross entropy evaluation metric.

1.    INTRODUCTION
   The goal of the QUESST task within the MediaEval 2015
framework is to determine the presence of a spoken query
in an unlabeled speech data set by building a language in-
dependent system. In this year’s QUESST task, the data
consisted of about 18 hours of noisy audio from 7 different
languages. More details about the task can be found in [1].
   To minimize the effect of noise, we preprocess our data
(both the queries and utterances) and follow it with speech            Figure 1: Block diagram of the IIT-B system
activity detection to remove silence frames. Our approach,
to solve the task, is inspired by Hazen et al.[2]. A block-
diagram of our system is shown in Figure 1 and is inspired         2.2   Subsystems
by [3].                                                              We make use of 4 subsystems:

2.    SYSTEM DESCRIPTION                                              1. Two DNN based phone recognisers (Hungarian and
                                                                         Russian) trained on the SpeechDat-E corpus by Brno
2.1    Preprocessing - Noise Removal                                     University of Technology (BUT)[5]. These are used to
                                                                         extract posterior and bottleneck features.
   We use spectral subtraction to remove noise from the au-
dio. Power spectral density (PSD) of noise is estimated us-           2. A phone recogniser trained on Hindi database [6] (re-
ing the minimum statistics technique described by R. Martin              ferred to as TIFR phone recogniser from here on).
[4]. The technique used to estimate noise PSD makes the                  TIFR phone recogniser is MLP based and is trained
assumption that during speech pause or within brief periods              using 39 dimensional MFCC features. It has a single
in between words the speech energy is close to zero. Thus,               hidden layer with 700 neurons and 36 output neurons.
by tracking the minimum power within a finite window large               We extract phone posteriors using the TIFR phone
enough to bridge high power speech segments the noise floor              recogniser.
can be estimated. We next remove the silence at the start
and end of an utterance using a simple energy based speech            3. 64-GMM system trained in unsupervised manner on
activity detector.                                                       QUESST-2015 database using 36 dimensional MFCC
                                                                         features [7] (the energy of the audio was not used as
                                                                         feature because large energy variations were observed
                                                                         across utterances). We used this system to extract
                                                                         Gaussian posteriorgrams.

                                                                   2.3   DTW
Copyright is held by the author/owner(s).                            We use the standard subsequence DTW as implemented
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        in [3]. The query is allowed to start at any frame of the test
                                                 eval                                    dev
                 Query Type      actCnxe/minCnxe ATWV/MTWV               actCnxe/minCnxe ATWV/MTWV
                     T1            0.9330/0.9117       0.0531/0.0661       0.8971/0.8680     0.1434/0.1449
                     T2            0.9852/0.9637      -0.0099/0.0178       0.9214/0.9113     0.0492/0.0528
                     T3            0.9313/0.9109       0.0525/0.0627       0.9348/0.9210     0.0454/0.0461
                   overall         0.9536/0.9364       0.0254/0.0421       0.9213/0.9082     0.0812/0.0816

Table 1: Overall and per query type (T1/T2/T3) summarization of results on evaluation and development
datasets.


utterance and the locally optimal detection is the one that        with respect to cross entropy evaluation metric to give us
has the smallest accumulated distance. Also, to avoid the          log-likelihood score.
preference for the shorter paths, accumulated distances are
normalized by the corresponding detected path lengths. For         3.   RESULTS AND DISCUSSION
distance measure, we have used Pearson product-moment
                                                                      Table 1 shows our results for development and evaluation
correlation for bottleneck features (BUT - Hungarian and
                                                                   queries. Probably due to the high amount of noise (and
Russian) and inner product for posteriors (BUT - Hungar-
                                                                   reverberation) in the dataset, the overall cross-entropy score
ian and Russian, TIFR, 64-GMM). A filtering step is then
                                                                   is poor even after noise removal. If we look at the scores
applied to remove detected candidates which are very large
                                                                   for each query type, clearly our system works best for the
or very small in duration compared to the query length.
                                                                   T1 query type. This can be attributed to the fact that we
2.4    Fusion and Calibration                                      didn’t take any special steps to counter T2 and T3 query
                                                                   types like word level reordering (for T2 queries) and partial
   Our approach is most similar to the discriminative fusion
                                                                   matching (for T3 queries). Also, we didn’t calibrate our
approach proposed by A. Abad et al. [8]. Scores are first nor-
                                                                   score for Term Weighted Values (TWV) resulting in very
malized to zero mean and unit variance per query to allow
                                                                   low ATWV/MTWV scores.
for use of a single threshold. Then the detections are aligned
                                                                      We observed that after subsequence DTW many possible
and only those detections for which at least half the systems
                                                                   detections (candidates) were found for a query in an utter-
show overlap in time are retained (majority voting) to re-
                                                                   ance. This clearly suggests that posteriors and bottleneck
duce the false alarms. This leaves us with multiple detec-
                                                                   features used were not robust enough for the given noisy
tions of a query in an utterance. So for each query-utterance
                                                                   and multilingual data. Also, we rely heavily on our first
pair we will get multiple score vectors (A score vector is a
                                                                   step of fusion which is nothing but the arithmetic mean of
collection of scores from all the subsystems for a possible de-
                                                                   scores (since α = 1) from various subsystems to detect the
tection of query in an utterance). Our score vector has six
                                                                   best candidate for a given query-utterance pair. So a high
elements (BUT Hungarian-Posterior and Bottleneck, BUT
                                                                   score from even one of the subsystems can make the com-
Russian-Posterior and Bottleneck, TIFR - Posterior, GMM
                                                                   bined score (obtained after Step 1 of fusion) biased towards
- Posterior).
                                                                   it, leading to the selection of that candidate over other can-
   Since the task requires to give only one score per query-
                                                                   didates with moderate scores from all the systems.
utterance pair, we determine best score vector per query-
                                                                      Our experiments were done on a computer with Intel i7-
utterance pair using a two-step procedure:
                                                                   4790 CPU (3.60GHz, 8 cores), 16GB RAM. For searching,
  1. First step is inspired by Hazen et al.[2]. Scores from        all the posteriorgrams for a query-utterance pair were loaded
     various subsystems S(X|Ki ) are combined according            in memory. This caused high memory usage for longer ut-
     to equation:                                                  terances (Peak memory usage of around 15GB). It took us
                                                                   around 80 hours to search approximately 475 seconds of
                                        N
                            1     1 X                              query in 18 hours of audio database per subsystem, lead-
      S(X|K1 K2 ...KN ) = − log(        exp(−αS(X|Ki )))           ing to SSF of 0.0093 per sec.
                            α    N i
                                                       (1)
      where varying α between 0 to 1 changes the averaging         4.   CONCLUSION
      function from geometric mean to arithmetic mean (we             We have described the system developed at IIT-B for
      have used α = 1).                                            QUESST task. To combat the effect of noise in data, we used
                                                                   spectral subtraction. Spectral subtraction reduces noise but
  2. In the second step, we make use of the combined score         is also known to create artifacts in speech and so posteri-
     obtained in first step to determine the best candidate        ors/bottleneck features were not robust enough for the given
     for an utterance. We retain the individual scores of the      noisy and multilingual data. It would be interesting to study
     subsystems along with the combined score (obtained            the performance of our system without noise suppression.
     using equation 1) corresponding to the best detected          The main novelty of our work was a two-step fusion ap-
     candidate, thus giving us one score vector per query-         proach where in the first step we decide the best candidate
     utterance pair.                                               for a query-utterance pair and in the second step we train
                                                                   a logistic regression classifier. The effect of the first step of
   All of these score vectors (corresponding to different query-
                                                                   fusion for different values of α on the cross entropy score
utterance pairs) are then used to train a binary logistic clas-
                                                                   needs to be investigated.
sifier [3] which gives us the fused score representative of
query-utterance pair. The fused scores are then calibrated
5.   REFERENCES
[1] Igor Szőke, Luis J. Rodriguez-Fuentes, Andi Buzo,
    Xavier Anguera, Florian Metze, Jorge Proenca, Martin
    Lojka, and Xiao Xiong. Query by example search on
    speech at mediaeval 2015. In Working Notes
    Proceedings of the Mediaeval 2015 Workshop, 14-15
    September 2015.
[2] T. Hazen, W. Shen, and C. White. Query-by-example
    spoken term detection using phonetic posteriorgram
    templates. In Proc. ASRU, 2009.
[3] Igor Szőke, Lukáš Burget, František Grézl, Jan
    Černocký, and Lucas Ondel. Calibration and fusion of
    query-by-example systems - BUT SWS 2013. In Proc.
    ICASSP, pages 7899–7903, 2014.
[4] R. Martin. Noise power spectral density estimation
    based on optimal smoothing and minimum statistics.
    IEEE Transactions on Speech and Audio Processing,
    9(5):504–512, 2001.
[5] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky.
    Probabilistic and bottle-neck features for LVCSR of
    meetings. In Proc. ICASSP, pages 757–760, 2007.
[6] V. Chourasia, K. Samudravijaya, and M. Chandwani.
    Phonetically rich hindi sentence corpus for creation of
    speech database. In Proc. O-Cocosda, pages 132–137,
    2005.
[7] Y. Zhang and J. Glass. Unsupervised spoken keyword
    spotting via segmental DTW on gaussian
    posteriorgrams. In Proc. ASRU, pages 398–403, 2009.
[8] A. Abad, L. Rodrı́guez-Fuentes, M. Penagarikano,
    A. Varona, and G. Bordel. On the calibration and
    fusion of heterogeneous spoken term detection systems.
    In Proc. INTERSPEECH, pages 20–24, 2013.