The NNI Query-by-Example System for MediaEval 2015∗ Jingyong Hou1 , Van Tung Pham2 , Cheung-Chi Leung3 , Lei Wang3 , Haihua Xu2 , Hang Lv1 , Lei Xie1 , Zhonghua Fu1 , Chongjia Ni3 , Xiong Xiao2 , Hongjie Chen1 , Shaofei Zhang1 , Sining Sun1 , Yougen Yuan1 , Pengcheng Li1 , Tin Lay Nwe3 , Sunil Sivadas3 , Bin Ma3 , Eng Siong Chng2 , Haizhou Li2,3 1 School of Computer Science, Northwestern Polytechnical University (NWPU), Xi’an, China 2 Nanyang Technological University (NTU), Singapore 3 Institute for Infocomm Research (I2 R), A*STAR, Singapore jyhou@nwpu-aslp.org, VANTUNG001@e.ntu.edu.sg, ccleung@i2r.a-star.edu.sg, lxie@nwpu.edu.cn ABSTRACT power spectrum of each sentence and design the mini- This paper describes the system developed by the NNI team mum phase filter. for the Query-by-Example Search on Speech Task (QUESST) • Use EM algorithm to estimate the parameters of the in the MediaEval 2015 evaluation. Our submitted system noise amplitude distribution (empirically select Gaus- mainly used bottleneck features/stacked bottleneck features sian distribution and set the number of Gaussian mix- (BNF/SBNF) trained from various resources. We investi- tures to 2). gated noise robustness techniques to deal with the noisy da- • Generate a random white noise with the target noise ta of this year. The submitted system obtained the actual amplitude distribution. normalized cross entropy (actCnxe) of 0.761 and the actu- • Filter the random white noise using the minimum phase al Term Weighted Value (actTWV) of 0.270 on all types of filter. queries of the evaluation data. The second set of noise (noise2) was also estimated from the development data by using a method in [6]. The time 1. INTRODUCTION domain noise was reconstructed by inverse short-time Fouri- This year’s data is more challenging in terms of acoustic er transform of the estimated instantaneous noise spectrum. and noise conditions [1]. Noise robustness techniques, in- Please refer to [7, 8] for details. cluding adding noise to the training data of tokenizers and a When noise was added, we had to ensure that the signal- speech enhancement method, were investigated to deal with to-noise ratio (SNR) distribution of the resultant training the noisy data. Our submitted system involves dynamic data was similar to that of this year’s development data. time warping (DTW) and symbolic search (SS) based ap- Moreover, since not all the utterances in this year were high- proaches as last year. This year, the final submitted system ly noisy or reverberated, we only added noise to randomly was obtained by fusing 66 systems from our 3 groups, in- selected 50 percent of training data. cluding 15 DTW systems (selected from 26 original systems using FoCal toolkit [2]) from NWPU, 39 DTW systems from 3. SPEECH ENHANCEMENT I2 R, and 8 DTW and 4 SS systems from NTU. Moreover, A Wiener filter [9] was used to reduce the noise in the data. various voice activity detection (VAD) methods were used The noise was reduced in the time domain and the enhanced in the DTW systems. data was used for VAD and feature extraction. Initial results (detailed in section 8) showed that the enhanced data led to 2. ADDING NOISE TO TRAINING DATA better DTW performance for some tokenizers. To reduce the mismatch problem between the training da- ta of tokenizers and this year’s development and test data, 4. VOICE ACTIVITY DETECTION noise was added to the training data. We used two meth- For exact matching DTW systems, we used two voice ac- ods to obtain two sets of noise from the development data. tivity detectors (VADs), including a frequency band energy The method used to obtain the first set of noise (noise1) is based VAD [10] (VAD1) and a statistical model based VAD summarized as follows [3, 4, 5]: [11] (VAD2), because we found that they performed the best in different types of queries. For phoneme-sequence based • Perform voice/unvoice detection on the development approximate matching DTW systems (detailed in section 5) data and obtain segments of noise from the utterance. with phoneme posterior features, we used their single-best • Estimate the noise power spectrum of each utterance decoding hypotheses to perform VAD and obtain phoneme and generate minimum phase signal according to the boundary information. For a phoneme-sequence approxi- ∗This work was partially supported by the National Natural mate matching DTW systems with SBNF, we simply bor- rowed the single-best decoding hypothesis of a phoneme rec- Science Foundation of China (61175018 and 61571363). ognizer to perform VAD and obtained the phoneme bound- ary information. Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany 5. DTW SEARCH Exact matching and approximate matching DTW systems were developed to deal with different types of queries. An Table 1: Performance gain of an exact matching DTW sys- tem on the development set when different data (s1: original exact matching system matched each query with a subse- SWBD data; s2: noise1 is added; s3: noise2 is added) is used quence of a test utterance using DTW [12, 13]. It found a to train a tokenizer. The tokenizer is used to extract triphone path on the cosine distance matrix of the speech feature of state SBNF. Result Form: minCnxe, maxTWV the query and the test utterance. The system output the All T1 T2 T3 similarity score between the query and the matched subse- s1 0.891,0.111 0.762,0.227 0.934,0.024 0.918,0.093 quence of the test utterance. s2 0.875,0.133 0.733,0.258 0.925,0.041 0.901,0.101 We used two different kinds of approximate matching DTW s3 0.877,0.132 0.735,0.270 0.923,0.038 0.907,0.095 systems in total, including fixed-window [12, 14] and phoneme- sequence [15] approximate matching systems, to deal with Table 2: Performance on different types of queries in devel- opment and evaluation datasets. type 2 and type 3 queries. In fixed-window approximate dev eval matching systems, when the window was shifted, the cor- All(T1, T2, T3) All(T1, T2, T3) responding segment of the query was matched with a test actCnxe 0.773(0.629,0.813,0.829) 0.761(0.609,0.854,0.783) utterance. The highest similarity score which corresponds to minCnxe 0.757(0.601,0.793,0.810) 0.747(0.577,0.831,0.769) actTWV 0.286(0.439,0.203,0.200) 0.270(0.436,0.189,0.203) a query segment and the test utterance was used as the score maxTWV 0.286(0.447,0.208,0.205) 0.274(0.444,0.194,0.215) of the query-utterance pair of the system. The window sizes were set between 70 and 90 frames and the window shifts I2 R’s 39 DTW systems consisted of 13 exact matching were set between 5 and 10 frames. In phoneme-sequence systems (using b1-b13) and 13 fixed-window approximate approximate matching systems, the size of the window was matching systems (using b1-b13) with VAD1, and 13 exact determined by the phoneme boundary information derived matching systems (using b1-b13) with VAD2. from phoneme recognizers. The window size was set to 8 NTU’s 12 systems consisted of 4 exact matching (using c1- phonemes, as it provided best results on the development c4) and 4 fixed-window approximate matching (using c1-c4) data. DTW systems with VAD1, and 4 phoneme-sequence approx- imate matching SS systems with 4 acoustic models trained from SWBD and a Malay speech corpus [22]. 6. SYMBOLIC SEARCH The scores of all systems in each group were fused to a Weighted finite state transducer (WFST) based symbol- single system internally and the 3 resultant systems were ic search systems were used as last year [12]. Phoneme- further fused to obtain the final submitted system. In each sequence approximate matching [14] was used to faciliate fusion step, scores were first normalized to zero mean and type 2 and type 3 queries, and to reduce the miss rate. A unit variance, and then fused with the FoCal toolkit [2]. sequence length of 6 phonemes was chosen, as it provided best matching results on the development data. 8. RESULTS AND CONCLUSION Table 1 shows the performance gain of an exact matching 7. TOKENIZERS AND SYSTEMS DTW system on the development set when noise1 and noise2 Spectral features, phoneme-state posterior features and were added to the SWBD data for training triphone SBNF. BNF/SBNF were used in our DTW systems. The results show that adding the noise to the training data NWPU extracted truncated PLP [16] (a1), posterior fea- gives 1.8% relative improvement on all query types and 3.8% tures from 3 BUT phoneme recognizers [17] (Czech, Hun- relative improvement on type 1 queries in minCnxe. garian and Russian; a2-a4), 3 sets of SBNF (1 being mono- When the enhanced data was used to extract SWBD mono- phone state using original training data and 2 being triphone phone SBNF, BUT Czech and Hungarian phoneme-state state with noise1 and noise2 added in training data respec- posterior features for our DTW systems, we observed rela- tively; a5-a7) trained from the English Switchboard corpus tive improvements of 1.9-3.1% on all query types and relative (SWBD), and 1 set of triphone state SBNF (a8) trained improvements of 2.7-6.3% on type 1 queries in minCnxe. from the SEAME corpus [18]. Table 2 shows the performance of our final submitted sys- I2 R extracted 4 sets of BNF (b1-b4) and 4 sets of SBN- tem on this year’s data. In the intra-group fusion, each F (b5-b8) trained from four LDC corpora (SWBD, Fish- group experienced performance gains by fusing exact match- er Spanish, HKUST Mandarin and CallHome Egyptian), ing and approximate matching systems, and fusing sytems and 5 sets of BNF (b9-b13) (4 language-dependent and one using different speech preprocessing techniques and different language-independent [19]) trained from 4 development lan- tokenizers. Compared with our single best exact matching guages in the OpenKWS evaluation [20]. DTW system (s2 in table 1), system fusion brings around NTU extracted 3 sets of BNF (c1-c3) trained (1 being 13.5% relative improvement in minCnxe on the development triphone state with original training data and 2 being tri- data (all query types). phone state with Noisex92 [21] added in training data once The peak memory usage (PMU) of all DTW systems is and twice respectively) from SWBD, and 1 set of BNF (c4) 1.45GB when 1 set of 30 dimensional SBNF are loaded, and trained from the 6 development languages in the OpenKWS the searching speed factor (SSF) is around 0.0044 in each evaluation. DTW system. The PMU of all SS systems is 45GB, and the NWPU’s 26 DTW systems consisted of 9 exact match- SSF is around 0.0012 in each SS system. ing systems (using a1-a8, c4) and 4 phoneme-sequence ap- We adopted noise robustness techniques to deal with the proximate matching systems (using a2-a4, a6). The rest 13 noise condition of data, which led to better search perfor- systems were exactly the same as the previous 13 systems mance. We also experienced performance gains by fusing except the enhanced data was used in VAD and feature ex- systems using different tokenizers, different VADs and dif- traction. ferent search algorithms. 9. REFERENCES term detection using n-best phone sequences and [1] I. Szoke, L. J. Rodriguez-Fuentes, A. Buzo, partial matching,” in Acoustics, Speech and Signal X. Anguera, F. Metze, J. Proenca, M. Lojka, and Processing (ICASSP), 2015 IEEE International X. Xiao, “Query by example search on speech at Conference on. IEEE, 2015, pp. 5191–5195. mediaeval 2015,” Working Notes Proceedings of the [15] J. Hou, L. Xie, P. Yang, X. Xiao, C.-C. Leung, H. Xu, MediaEval 2015 Workshop, Sept. 14-15, 2015, L. Wang, H. Lv, B. Ma, E. S. Chng, and H. Li, Wurzen, Germany, 2015. “Spoken term detection technology based on DTW(to [2] N. Brümmer, “FoCal: Toolkit for Evaluation, Fusion be published),” Journal of Tsinghua University (Sci and Calibration of statistical pattern recognizers,” and Tech), 2015. https://sites.google.com/site/nikobrummer/focal. [16] A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, [3] W. Yao and T. Yao, “Analyzing classical spectral S. Khudanpur, K. Church, N. Feldman, estimation by MATLAB,” Journal of Huazhong H. Hermansky, F. Metze, and R. Rose, “A summary of University of Science and Technology, vol. 4, p. 021, the 2012 JHU CLSP workshop on zero resource speech 2000. technologies and models of early language acquisition,” [4] M. H. Hayes, J. S. Lim, and A. V. Oppenheim, “Signal in Acoustics, Speech and Signal Processing (ICASSP), reconstruction from phase or magnitude,” Acoustics, 2013 IEEE International Conference on. IEEE, 2013, Speech and Signal Processing, IEEE Transactions on, pp. 8111–8115. vol. 28, no. 6, pp. 672–680, 1980. [17] P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical [5] M. H. Gruber, “Statistical digital signal processing structures of neural networks for phoneme and modeling,” Technometrics, vol. 39, no. 3, pp. recognition,” in Acoustics, Speech and Signal 335–336, 1997. Processing (ICASSP), 2006 IEEE International [6] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New Conference on. IEEE, 2006, pp. 325–328. insights into the noise reduction Wiener filter,” Audio, [18] D. C. Lyu, T. P. Tan, E. Chng, and H. Li, “SEAME: a Speech, and Language Processing, IEEE Transactions Mandarin-English code-switching speech corpus in on, vol. 14, no. 4, pp. 1218–1234, 2006. South-East Asia.” INTERSPEECH 2010: 11th Annual [7] J. Chen, Y. Huang, and J. Benesty, “Filtering Conference of the International Speech techniques for noise reduction and speech Communication Association, 2010. enhancement,” in Adaptive Signal Processing. [19] K. Vesely, M. Karafiát, F. Grezl, M. Janda, and Springer, 2003, pp. 129–154. E. Egorova, “The language-independent bottleneck features,” in Spoken Language Technology Workshop [8] E. J. Diethorn, “Subband noise reduction methods for speech enhancement,” in Audio Signal Processing for (SLT), 2012 IEEE. IEEE, 2012, pp. 336–341. Next-Generation Multimedia Communication Systems. [20] “Open keyword search 2015 evaluation,” Springer, 2004, pp. 91–115. http://www.nist.gov/itl/iad/mig/openkws15.cfm. [9] J. Chen, J. Benesty, Y. Huang, and T. Gaensle, “On [21] A. Varga and H. J. Steeneken, “Assessment for single-channel noise reduction in the time domain,” in automatic speech recognition: II. NOISEX-92: A Acoustics, Speech and Signal Processing (ICASSP), database and an experiment to study the effect of 2011 IEEE International Conference on. IEEE, 2011, additive noise on speech recognition systems,” Speech pp. 277–280. communication, vol. 12, no. 3, pp. 247–251, 1993. [10] E. Cornu, H. Sheikhzadeh, R. L. Brennan, H. R. [22] T. Tan, X. Xiao, E. K. Tang, E. S. Chng, and H. Li, Abutalebi, E. C. Tam, P. Iles, and K. W. Wong, “MASS: A Malay language LVCSR corpus resource,” “ETSI AMR-2 VAD: evaluation and ultra low-resource in Speech Database and Assessments, 2009 Oriental implementation,” in Multimedia and Expo, 2003. COCOSDA International Conference on. IEEE, ICME’03. Proceedings. 2003 International Conference 2009, pp. 25–30. on, vol. 2. IEEE, 2003, pp. II–841. [11] M. Huijbregts and F. De Jong, “Robust speech/non-speech classification in heterogeneous multimedia content,” Speech Communication, vol. 53, no. 2, pp. 143–153, 2011. [12] P. Yang, H. Xu, X. Xiao, L. Xie, C.-C. Leung, H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow et al., “The NNI query-by-example system for MediaEval 2014,” Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain, Oct, pp. 16–17, 2014. [13] A. Muscariello, G. Gravier, and F. Bimbot, “Audio keyword extraction by unsupervised word discovery,” in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association, 2009. [14] H. Xu, P. Yang, X. Xiao, L. Xie, C.-C. Leung, H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow et al., “Language independent query-by-example spoken