The NNI Query-by-Example System for MediaEval 2015∗

    Jingyong Hou1 , Van Tung Pham2 , Cheung-Chi Leung3 , Lei Wang3 , Haihua Xu2 , Hang Lv1 ,
       Lei Xie1 , Zhonghua Fu1 , Chongjia Ni3 , Xiong Xiao2 , Hongjie Chen1 , Shaofei Zhang1 ,
       Sining Sun1 , Yougen Yuan1 , Pengcheng Li1 , Tin Lay Nwe3 , Sunil Sivadas3 , Bin Ma3 ,
                                  Eng Siong Chng2 , Haizhou Li2,3
             1
                 School of Computer Science, Northwestern Polytechnical University (NWPU), Xi’an, China
                                  2
                                    Nanyang Technological University (NTU), Singapore
                              3
                                Institute for Infocomm Research (I2 R), A*STAR, Singapore
 jyhou@nwpu-aslp.org, VANTUNG001@e.ntu.edu.sg, ccleung@i2r.a-star.edu.sg, lxie@nwpu.edu.cn

ABSTRACT                                                               power spectrum of each sentence and design the mini-
This paper describes the system developed by the NNI team              mum phase filter.
for the Query-by-Example Search on Speech Task (QUESST)              • Use EM algorithm to estimate the parameters of the
in the MediaEval 2015 evaluation. Our submitted system                 noise amplitude distribution (empirically select Gaus-
mainly used bottleneck features/stacked bottleneck features            sian distribution and set the number of Gaussian mix-
(BNF/SBNF) trained from various resources. We investi-                 tures to 2).
gated noise robustness techniques to deal with the noisy da-         • Generate a random white noise with the target noise
ta of this year. The submitted system obtained the actual              amplitude distribution.
normalized cross entropy (actCnxe) of 0.761 and the actu-            • Filter the random white noise using the minimum phase
al Term Weighted Value (actTWV) of 0.270 on all types of               filter.
queries of the evaluation data.                                   The second set of noise (noise2) was also estimated from
                                                                the development data by using a method in [6]. The time
1. INTRODUCTION                                                 domain noise was reconstructed by inverse short-time Fouri-
   This year’s data is more challenging in terms of acoustic    er transform of the estimated instantaneous noise spectrum.
and noise conditions [1]. Noise robustness techniques, in-      Please refer to [7, 8] for details.
cluding adding noise to the training data of tokenizers and a     When noise was added, we had to ensure that the signal-
speech enhancement method, were investigated to deal with       to-noise ratio (SNR) distribution of the resultant training
the noisy data. Our submitted system involves dynamic           data was similar to that of this year’s development data.
time warping (DTW) and symbolic search (SS) based ap-           Moreover, since not all the utterances in this year were high-
proaches as last year. This year, the final submitted system    ly noisy or reverberated, we only added noise to randomly
was obtained by fusing 66 systems from our 3 groups, in-        selected 50 percent of training data.
cluding 15 DTW systems (selected from 26 original systems
using FoCal toolkit [2]) from NWPU, 39 DTW systems from         3.    SPEECH ENHANCEMENT
I2 R, and 8 DTW and 4 SS systems from NTU. Moreover,              A Wiener filter [9] was used to reduce the noise in the data.
various voice activity detection (VAD) methods were used        The noise was reduced in the time domain and the enhanced
in the DTW systems.                                             data was used for VAD and feature extraction. Initial results
                                                                (detailed in section 8) showed that the enhanced data led to
2. ADDING NOISE TO TRAINING DATA                                better DTW performance for some tokenizers.
  To reduce the mismatch problem between the training da-
ta of tokenizers and this year’s development and test data,     4.    VOICE ACTIVITY DETECTION
noise was added to the training data. We used two meth-            For exact matching DTW systems, we used two voice ac-
ods to obtain two sets of noise from the development data.      tivity detectors (VADs), including a frequency band energy
The method used to obtain the first set of noise (noise1) is    based VAD [10] (VAD1) and a statistical model based VAD
summarized as follows [3, 4, 5]:                                [11] (VAD2), because we found that they performed the best
                                                                in different types of queries. For phoneme-sequence based
   • Perform voice/unvoice detection on the development         approximate matching DTW systems (detailed in section 5)
     data and obtain segments of noise from the utterance.      with phoneme posterior features, we used their single-best
   • Estimate the noise power spectrum of each utterance        decoding hypotheses to perform VAD and obtain phoneme
     and generate minimum phase signal according to the         boundary information. For a phoneme-sequence approxi-
∗This work was partially supported by the National Natural      mate matching DTW systems with SBNF, we simply bor-
                                                                rowed the single-best decoding hypothesis of a phoneme rec-
Science Foundation of China (61175018 and 61571363).
                                                                ognizer to perform VAD and obtained the phoneme bound-
                                                                ary information.

Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
                                                                5.    DTW SEARCH
   Exact matching and approximate matching DTW systems
were developed to deal with different types of queries. An       Table 1: Performance gain of an exact matching DTW sys-
                                                                 tem on the development set when different data (s1: original
exact matching system matched each query with a subse-
                                                                 SWBD data; s2: noise1 is added; s3: noise2 is added) is used
quence of a test utterance using DTW [12, 13]. It found a
                                                                 to train a tokenizer. The tokenizer is used to extract triphone
path on the cosine distance matrix of the speech feature of
                                                                 state SBNF. Result Form: minCnxe, maxTWV
the query and the test utterance. The system output the                        All          T1            T2            T3
similarity score between the query and the matched subse-          s1      0.891,0.111  0.762,0.227   0.934,0.024   0.918,0.093
quence of the test utterance.                                      s2     0.875,0.133   0.733,0.258  0.925,0.041   0.901,0.101
   We used two different kinds of approximate matching DTW         s3      0.877,0.132  0.735,0.270  0.923,0.038    0.907,0.095
systems in total, including fixed-window [12, 14] and phoneme-
sequence [15] approximate matching systems, to deal with         Table 2: Performance on different types of queries in devel-
                                                                 opment and evaluation datasets.
type 2 and type 3 queries. In fixed-window approximate
                                                                                       dev                        eval
matching systems, when the window was shifted, the cor-                          All(T1, T2, T3)            All(T1, T2, T3)
responding segment of the query was matched with a test           actCnxe    0.773(0.629,0.813,0.829)   0.761(0.609,0.854,0.783)
utterance. The highest similarity score which corresponds to      minCnxe    0.757(0.601,0.793,0.810)   0.747(0.577,0.831,0.769)
                                                                  actTWV     0.286(0.439,0.203,0.200)   0.270(0.436,0.189,0.203)
a query segment and the test utterance was used as the score      maxTWV     0.286(0.447,0.208,0.205)   0.274(0.444,0.194,0.215)
of the query-utterance pair of the system. The window sizes
were set between 70 and 90 frames and the window shifts             I2 R’s 39 DTW systems consisted of 13 exact matching
were set between 5 and 10 frames. In phoneme-sequence            systems (using b1-b13) and 13 fixed-window approximate
approximate matching systems, the size of the window was         matching systems (using b1-b13) with VAD1, and 13 exact
determined by the phoneme boundary information derived           matching systems (using b1-b13) with VAD2.
from phoneme recognizers. The window size was set to 8              NTU’s 12 systems consisted of 4 exact matching (using c1-
phonemes, as it provided best results on the development         c4) and 4 fixed-window approximate matching (using c1-c4)
data.                                                            DTW systems with VAD1, and 4 phoneme-sequence approx-
                                                                 imate matching SS systems with 4 acoustic models trained
                                                                 from SWBD and a Malay speech corpus [22].
6. SYMBOLIC SEARCH                                                  The scores of all systems in each group were fused to a
   Weighted finite state transducer (WFST) based symbol-         single system internally and the 3 resultant systems were
ic search systems were used as last year [12]. Phoneme-          further fused to obtain the final submitted system. In each
sequence approximate matching [14] was used to faciliate         fusion step, scores were first normalized to zero mean and
type 2 and type 3 queries, and to reduce the miss rate. A        unit variance, and then fused with the FoCal toolkit [2].
sequence length of 6 phonemes was chosen, as it provided
best matching results on the development data.                   8.   RESULTS AND CONCLUSION
                                                                    Table 1 shows the performance gain of an exact matching
7. TOKENIZERS AND SYSTEMS                                        DTW system on the development set when noise1 and noise2
   Spectral features, phoneme-state posterior features and       were added to the SWBD data for training triphone SBNF.
BNF/SBNF were used in our DTW systems.                           The results show that adding the noise to the training data
   NWPU extracted truncated PLP [16] (a1), posterior fea-        gives 1.8% relative improvement on all query types and 3.8%
tures from 3 BUT phoneme recognizers [17] (Czech, Hun-           relative improvement on type 1 queries in minCnxe.
garian and Russian; a2-a4), 3 sets of SBNF (1 being mono-           When the enhanced data was used to extract SWBD mono-
phone state using original training data and 2 being triphone    phone SBNF, BUT Czech and Hungarian phoneme-state
state with noise1 and noise2 added in training data respec-      posterior features for our DTW systems, we observed rela-
tively; a5-a7) trained from the English Switchboard corpus       tive improvements of 1.9-3.1% on all query types and relative
(SWBD), and 1 set of triphone state SBNF (a8) trained            improvements of 2.7-6.3% on type 1 queries in minCnxe.
from the SEAME corpus [18].                                         Table 2 shows the performance of our final submitted sys-
   I2 R extracted 4 sets of BNF (b1-b4) and 4 sets of SBN-       tem on this year’s data. In the intra-group fusion, each
F (b5-b8) trained from four LDC corpora (SWBD, Fish-             group experienced performance gains by fusing exact match-
er Spanish, HKUST Mandarin and CallHome Egyptian),               ing and approximate matching systems, and fusing sytems
and 5 sets of BNF (b9-b13) (4 language-dependent and one         using different speech preprocessing techniques and different
language-independent [19]) trained from 4 development lan-       tokenizers. Compared with our single best exact matching
guages in the OpenKWS evaluation [20].                           DTW system (s2 in table 1), system fusion brings around
   NTU extracted 3 sets of BNF (c1-c3) trained (1 being          13.5% relative improvement in minCnxe on the development
triphone state with original training data and 2 being tri-      data (all query types).
phone state with Noisex92 [21] added in training data once          The peak memory usage (PMU) of all DTW systems is
and twice respectively) from SWBD, and 1 set of BNF (c4)         1.45GB when 1 set of 30 dimensional SBNF are loaded, and
trained from the 6 development languages in the OpenKWS          the searching speed factor (SSF) is around 0.0044 in each
evaluation.                                                      DTW system. The PMU of all SS systems is 45GB, and the
   NWPU’s 26 DTW systems consisted of 9 exact match-             SSF is around 0.0012 in each SS system.
ing systems (using a1-a8, c4) and 4 phoneme-sequence ap-            We adopted noise robustness techniques to deal with the
proximate matching systems (using a2-a4, a6). The rest 13        noise condition of data, which led to better search perfor-
systems were exactly the same as the previous 13 systems         mance. We also experienced performance gains by fusing
except the enhanced data was used in VAD and feature ex-         systems using different tokenizers, different VADs and dif-
traction.                                                        ferent search algorithms.
9. REFERENCES                                                        term detection using n-best phone sequences and
 [1] I. Szoke, L. J. Rodriguez-Fuentes, A. Buzo,                     partial matching,” in Acoustics, Speech and Signal
     X. Anguera, F. Metze, J. Proenca, M. Lojka, and                 Processing (ICASSP), 2015 IEEE International
     X. Xiao, “Query by example search on speech at                  Conference on. IEEE, 2015, pp. 5191–5195.
     mediaeval 2015,” Working Notes Proceedings of the          [15] J. Hou, L. Xie, P. Yang, X. Xiao, C.-C. Leung, H. Xu,
     MediaEval 2015 Workshop, Sept. 14-15, 2015,                     L. Wang, H. Lv, B. Ma, E. S. Chng, and H. Li,
     Wurzen, Germany, 2015.                                          “Spoken term detection technology based on DTW(to
 [2] N. Brümmer, “FoCal: Toolkit for Evaluation, Fusion             be published),” Journal of Tsinghua University (Sci
     and Calibration of statistical pattern recognizers,”            and Tech), 2015.
     https://sites.google.com/site/nikobrummer/focal.           [16] A. Jansen, E. Dupoux, S. Goldwater, M. Johnson,
 [3] W. Yao and T. Yao, “Analyzing classical spectral                S. Khudanpur, K. Church, N. Feldman,
     estimation by MATLAB,” Journal of Huazhong                      H. Hermansky, F. Metze, and R. Rose, “A summary of
     University of Science and Technology, vol. 4, p. 021,           the 2012 JHU CLSP workshop on zero resource speech
     2000.                                                           technologies and models of early language acquisition,”
 [4] M. H. Hayes, J. S. Lim, and A. V. Oppenheim, “Signal            in Acoustics, Speech and Signal Processing (ICASSP),
     reconstruction from phase or magnitude,” Acoustics,             2013 IEEE International Conference on. IEEE, 2013,
     Speech and Signal Processing, IEEE Transactions on,             pp. 8111–8115.
     vol. 28, no. 6, pp. 672–680, 1980.                         [17] P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical
 [5] M. H. Gruber, “Statistical digital signal processing            structures of neural networks for phoneme
     and modeling,” Technometrics, vol. 39, no. 3, pp.               recognition,” in Acoustics, Speech and Signal
     335–336, 1997.                                                  Processing (ICASSP), 2006 IEEE International
 [6] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New               Conference on. IEEE, 2006, pp. 325–328.
     insights into the noise reduction Wiener filter,” Audio,   [18] D. C. Lyu, T. P. Tan, E. Chng, and H. Li, “SEAME: a
     Speech, and Language Processing, IEEE Transactions              Mandarin-English code-switching speech corpus in
     on, vol. 14, no. 4, pp. 1218–1234, 2006.                        South-East Asia.” INTERSPEECH 2010: 11th Annual
 [7] J. Chen, Y. Huang, and J. Benesty, “Filtering                   Conference of the International Speech
     techniques for noise reduction and speech                       Communication Association, 2010.
     enhancement,” in Adaptive Signal Processing.               [19] K. Vesely, M. Karafiát, F. Grezl, M. Janda, and
     Springer, 2003, pp. 129–154.                                    E. Egorova, “The language-independent bottleneck
                                                                     features,” in Spoken Language Technology Workshop
 [8] E. J. Diethorn, “Subband noise reduction methods for
     speech enhancement,” in Audio Signal Processing for             (SLT), 2012 IEEE. IEEE, 2012, pp. 336–341.
     Next-Generation Multimedia Communication Systems.          [20] “Open keyword search 2015 evaluation,”
     Springer, 2004, pp. 91–115.                                     http://www.nist.gov/itl/iad/mig/openkws15.cfm.
 [9] J. Chen, J. Benesty, Y. Huang, and T. Gaensle, “On         [21] A. Varga and H. J. Steeneken, “Assessment for
     single-channel noise reduction in the time domain,” in          automatic speech recognition: II. NOISEX-92: A
     Acoustics, Speech and Signal Processing (ICASSP),               database and an experiment to study the effect of
     2011 IEEE International Conference on. IEEE, 2011,              additive noise on speech recognition systems,” Speech
     pp. 277–280.                                                    communication, vol. 12, no. 3, pp. 247–251, 1993.
[10] E. Cornu, H. Sheikhzadeh, R. L. Brennan, H. R.             [22] T. Tan, X. Xiao, E. K. Tang, E. S. Chng, and H. Li,
     Abutalebi, E. C. Tam, P. Iles, and K. W. Wong,                  “MASS: A Malay language LVCSR corpus resource,”
     “ETSI AMR-2 VAD: evaluation and ultra low-resource              in Speech Database and Assessments, 2009 Oriental
     implementation,” in Multimedia and Expo, 2003.                  COCOSDA International Conference on. IEEE,
     ICME’03. Proceedings. 2003 International Conference             2009, pp. 25–30.
     on, vol. 2. IEEE, 2003, pp. II–841.
[11] M. Huijbregts and F. De Jong, “Robust
     speech/non-speech classification in heterogeneous
     multimedia content,” Speech Communication, vol. 53,
     no. 2, pp. 143–153, 2011.
[12] P. Yang, H. Xu, X. Xiao, L. Xie, C.-C. Leung,
     H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow et al.,
     “The NNI query-by-example system for MediaEval
     2014,” Working Notes Proceedings of the MediaEval
     2014 Workshop, Barcelona, Spain, Oct, pp. 16–17,
     2014.
[13] A. Muscariello, G. Gravier, and F. Bimbot, “Audio
     keyword extraction by unsupervised word discovery,”
     in INTERSPEECH 2009: 10th Annual Conference of
     the International Speech Communication Association,
     2009.
[14] H. Xu, P. Yang, X. Xiao, L. Xie, C.-C. Leung,
     H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow et al.,
     “Language independent query-by-example spoken