1. INTRODUCTION

The NNI Query-by-Example System for MediaEval 2015

Jingyong Hou

jyhou@nwpu-aslp.org 2

Van Tung Pham

VANTUNG001@e.ntu.edu.sg

Cheung-Chi Leung

ccleung@i2r.a-star.edu.sg 0

Lei Wang

Haihua Xu

Hang Lv

Lei Xie

lxie@nwpu.edu.cn 2

Zhonghua Fu

Chongjia Ni

Xiong Xiao

Hongjie Chen

Shaofei Zhang

Sining Sun

Yougen Yuan

Pengcheng Li

Tin Lay Nwe

Sunil Sivadas

Bin Ma

Eng Siong Chng

Haizhou Li

0 0 Institute for Infocomm Research (I 1 STAR , Singapore 2 School of Computer Science, Northwestern Polytechnical University (NWPU) , Xi'an , China

2015

14 15

This paper describes the system developed by the NNI team for the Query-by-Example Search on Speech Task (QUESST) in the MediaEval 2015 evaluation. Our submitted system mainly used bottleneck features/stacked bottleneck features (BNF/SBNF) trained from various resources. We investigated noise robustness techniques to deal with the noisy data of this year. The submitted system obtained the actual normalized cross entropy (actCnxe) of 0.761 and the actual Term Weighted Value (actTWV) of 0.270 on all types of queries of the evaluation data.

1. INTRODUCTION

This year's data is more challenging in terms of acoustic and noise conditions [ 1 ]. Noise robustness techniques, including adding noise to the training data of tokenizers and a speech enhancement method, were investigated to deal with the noisy data. Our submitted system involves dynamic time warping (DTW) and symbolic search (SS) based approaches as last year. This year, the nal submitted system was obtained by fusing 66 systems from our 3 groups, including 15 DTW systems (selected from 26 original systems using FoCal toolkit [ 2 ]) from NWPU, 39 DTW systems from I2R, and 8 DTW and 4 SS systems from NTU. Moreover, various voice activity detection (VAD) methods were used in the DTW systems.

ADDING NOISE TO TRAINING DATA

To reduce the mismatch problem between the training data of tokenizers and this year's development and test data, noise was added to the training data. We used two methods to obtain two sets of noise from the development data. The method used to obtain the rst set of noise (noise1) is summarized as follows [ 3, 4, 5 ]:

Perform voice/unvoice detection on the development data and obtain segments of noise from the utterance. Estimate the noise power spectrum of each utterance and generate minimum phase signal according to the This work was partially supported by the National Natural Science Foundation of China (61175018 and 61571363). power spectrum of each sentence and design the minimum phase lter.

Use EM algorithm to estimate the parameters of the noise amplitude distribution (empirically select Gaussian distribution and set the number of Gaussian mixtures to 2).

Generate a random white noise with the target noise amplitude distribution.

Filter the random white noise using the minimum phase lter.

The second set of noise (noise2) was also estimated from the development data by using a method in [ 6 ]. The time domain noise was reconstructed by inverse short-time Fourier transform of the estimated instantaneous noise spectrum. Please refer to [ 7, 8 ] for details.

When noise was added, we had to ensure that the signalto-noise ratio (SNR) distribution of the resultant training data was similar to that of this year's development data. Moreover, since not all the utterances in this year were highly noisy or reverberated, we only added noise to randomly selected 50 percent of training data. 3.

SPEECH ENHANCEMENT

A Wiener lter [ 9 ] was used to reduce the noise in the data. The noise was reduced in the time domain and the enhanced data was used for VAD and feature extraction. Initial results (detailed in section 8) showed that the enhanced data led to better DTW performance for some tokenizers. 4.

VOICE ACTIVITY DETECTION

For exact matching DTW systems, we used two voice activity detectors (VADs), including a frequency band energy based VAD [ 10 ] (VAD1) and a statistical model based VAD [ 11 ] (VAD2), because we found that they performed the best in di erent types of queries. For phoneme-sequence based approximate matching DTW systems (detailed in section 5) with phoneme posterior features, we used their single-best decoding hypotheses to perform VAD and obtain phoneme boundary information. For a phoneme-sequence approximate matching DTW systems with SBNF, we simply borrowed the single-best decoding hypothesis of a phoneme recognizer to perform VAD and obtained the phoneme boundary information.

Exact matching and approximate matching DTW systems were developed to deal with di erent types of queries. An exact matching system matched each query with a subsequence of a test utterance using DTW [ 12, 13 ]. It found a path on the cosine distance matrix of the speech feature of the query and the test utterance. The system output the similarity score between the query and the matched subsequence of the test utterance.

We used two di erent kinds of approximate matching DTW systems in total, including xed-window [ 12, 14 ] and phonemesequence [ 15 ] approximate matching systems, to deal with type 2 and type 3 queries. In xed-window approximate matching systems, when the window was shifted, the corresponding segment of the query was matched with a test utterance. The highest similarity score which corresponds to a query segment and the test utterance was used as the score of the query-utterance pair of the system. The window sizes were set between 70 and 90 frames and the window shifts were set between 5 and 10 frames. In phoneme-sequence approximate matching systems, the size of the window was determined by the phoneme boundary information derived from phoneme recognizers. The window size was set to 8 phonemes, as it provided best results on the development data.

6. SYMBOLIC SEARCH

Weighted nite state transducer (WFST) based symbolic search systems were used as last year [ 12 ]. Phonemesequence approximate matching [ 14 ] was used to faciliate type 2 and type 3 queries, and to reduce the miss rate. A sequence length of 6 phonemes was chosen, as it provided best matching results on the development data.

TOKENIZERS AND SYSTEMS

Spectral features, phoneme-state posterior features and BNF/SBNF were used in our DTW systems.

NWPU extracted truncated PLP [ 16 ] (a1), posterior features from 3 BUT phoneme recognizers [ 17 ] (Czech, Hungarian and Russian; a2-a4), 3 sets of SBNF (1 being monophone state using original training data and 2 being triphone state with noise1 and noise2 added in training data respectively; a5-a7) trained from the English Switchboard corpus (SWBD), and 1 set of triphone state SBNF (a8) trained from the SEAME corpus [ 18 ].

I2R extracted 4 sets of BNF (b1-b4) and 4 sets of SBNF (b5-b8) trained from four LDC corpora (SWBD, Fisher Spanish, HKUST Mandarin and CallHome Egyptian), and 5 sets of BNF (b9-b13) (4 language-dependent and one language-independent [ 19 ]) trained from 4 development languages in the OpenKWS evaluation [ 20 ].

NTU extracted 3 sets of BNF (c1-c3) trained (1 being triphone state with original training data and 2 being triphone state with Noisex92 [ 21 ] added in training data once and twice respectively) from SWBD, and 1 set of BNF (c4) trained from the 6 development languages in the OpenKWS evaluation.

NWPU's 26 DTW systems consisted of 9 exact matching systems (using a1-a8, c4) and 4 phoneme-sequence approximate matching systems (using a2-a4, a6). The rest 13 systems were exactly the same as the previous 13 systems except the enhanced data was used in VAD and feature extraction. I2R's 39 DTW systems consisted of 13 exact matching systems (using b1-b13) and 13 xed-window approximate matching systems (using b1-b13) with VAD1, and 13 exact matching systems (using b1-b13) with VAD2.

NTU's 12 systems consisted of 4 exact matching (using c1c4) and 4 xed-window approximate matching (using c1-c4) DTW systems with VAD1, and 4 phoneme-sequence approximate matching SS systems with 4 acoustic models trained from SWBD and a Malay speech corpus [ 22 ].

The scores of all systems in each group were fused to a single system internally and the 3 resultant systems were further fused to obtain the nal submitted system. In each fusion step, scores were rst normalized to zero mean and unit variance, and then fused with the FoCal toolkit [ 2 ]. 8.

RESULTS AND CONCLUSION

Table 1 shows the performance gain of an exact matching DTW system on the development set when noise1 and noise2 were added to the SWBD data for training triphone SBNF. The results show that adding the noise to the training data gives 1.8% relative improvement on all query types and 3.8% relative improvement on type 1 queries in minCnxe.

When the enhanced data was used to extract SWBD monophone SBNF, BUT Czech and Hungarian phoneme-state posterior features for our DTW systems, we observed relative improvements of 1.9-3.1% on all query types and relative improvements of 2.7-6.3% on type 1 queries in minCnxe.

Table 2 shows the performance of our nal submitted system on this year's data. In the intra-group fusion, each group experienced performance gains by fusing exact matching and approximate matching systems, and fusing sytems using di erent speech preprocessing techniques and di erent tokenizers. Compared with our single best exact matching DTW system (s2 in table 1), system fusion brings around 13.5% relative improvement in minCnxe on the development data (all query types).

The peak memory usage (PMU) of all DTW systems is 1.45GB when 1 set of 30 dimensional SBNF are loaded, and the searching speed factor (SSF) is around 0.0044 in each DTW system. The PMU of all SS systems is 45GB, and the SSF is around 0.0012 in each SS system.

We adopted noise robustness techniques to deal with the noise condition of data, which led to better search performance. We also experienced performance gains by fusing systems using di erent tokenizers, di erent VADs and different search algorithms.

[1]

Szoke ,

L. J.

Rodriguez-Fuentes ,

Buzo ,

Anguera ,

Metze ,

Proenca ,

Lojka , and

Xiao , \ Query by example search on speech at mediaeval 2015," Working Notes Proceedings of the MediaEval 2015 Workshop , Sept. 14 - 15 , 2015 , Wurzen, Germany, 2015 .

[2]

Bru mmer, \ FoCal: Toolkit for Evaluation, Fusion and Calibration of statistical pattern recognizers," https://sites .google.com/site/nikobrummer/focal.

[3]

Yao and

Yao , \ Analyzing classical spectral estimation by MATLAB," Journal of Huazhong University of Science and Technology , vol. 4 , p. 021 , 2000 .

[4]

M. H.

Hayes ,

J. S.

Lim , and

A. V.

Oppenheim , \ Signal reconstruction from phase or magnitude," Acoustics, Speech and Signal Processing , IEEE Transactions on, vol. 28 , no. 6 , pp. 672 { 680 , 1980 .

[5]

M. H.

Gruber , \ Statistical digital signal processing and modeling," Technometrics , vol. 39 , no. 3 , pp. 335 { 336 , 1997 .

[6]

Chen ,

Benesty ,

Huang , and

Doclo , \ New insights into the noise reduction Wiener lter," Audio, Speech, and Language Processing , IEEE Transactions on, vol. 14 , no. 4 , pp. 1218 { 1234 , 2006 .

[7]

Chen ,

Huang , and

Benesty , \ Filtering techniques for noise reduction and speech enhancement," in Adaptive Signal Processing . Springer, 2003 , pp. 129 { 154 .

[8]

E. J.

Diethorn , \ Subband noise reduction methods for speech enhancement," in Audio Signal Processing for Next-Generation Multimedia Communication Systems . Springer, 2004 , pp. 91 { 115 .

[9]

Chen ,

Benesty ,

Huang , and T. Gaensle, \ On single-channel noise reduction in the time domain," in Acoustics, Speech and Signal Processing (ICASSP) , 2011 IEEE International Conference on. IEEE , 2011 , pp. 277 { 280 .

[10]

Cornu ,

Sheikhzadeh ,

R. L.

Brennan ,

H. R.

Abutalebi ,

E. C.

Tam ,

Iles , and

K. W.

Wong , \ ETSI AMR-2 VAD: evaluation and ultra low-resource implementation," in Multimedia and Expo , 2003 . ICME' 03 . Proceedings . 2003 International Conference on, vol. 2 . IEEE, 2003 , pp. II{841.

[11]

Huijbregts and F. De Jong, \ Robust speech/non-speech classi cation in heterogeneous multimedia content," Speech Communication , vol. 53 , no. 2 , pp. 143 { 153 , 2011 .

[12]

Yang ,

Xu ,

Xiao ,

Xie , C.-C. Leung , H.

Chen , J.

Yu , H.

Lv , L.

Wang , S. J.

Leow et al., \ The NNI query-by-example system for MediaEval 2014 , " Working Notes Proceedings of the MediaEval 2014 Workshop , Barcelona, Spain, Oct, pp. 16 { 17 , 2014 .

[13]

Muscariello , G. Gravier, and

Bimbot , \ Audio keyword extraction by unsupervised word discovery," in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association , 2009 .

[14]

Xu ,

Yang ,

Xiao ,

Xie , C.-C. Leung , H.

Chen , J.

Yu , H.

Lv , L.

Wang , S. J.

Leow et al., \Language independent query-by-example spoken term detection using n-best phone sequences and partial matching," in Acoustics, Speech and Signal Processing (ICASSP) , 2015 IEEE International Conference on. IEEE , 2015 , pp. 5191 { 5195 .

[15]

Hou ,

Xie ,

Yang ,

Xiao , C.-C. Leung , H.

Xu , L.

Wang , H.

Lv , B.

Ma , E. S.

Chng , and H.

Li , \Spoken term detection technology based on DTW(to be published)," Journal of Tsinghua University (Sci and Tech) , 2015 .

[16]

Jansen ,

Dupoux ,

Goldwater ,

Johnson , S. Khudanpur,

Church ,

Feldman ,

Hermansky ,

Metze , and

Rose , \ A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition," in Acoustics, Speech and Signal Processing (ICASSP) , 2013 IEEE International Conference on. IEEE , 2013 , pp. 8111 { 8115 .

[17]

Schwarz ,

Matejka , and

Cernocky , \ Hierarchical structures of neural networks for phoneme recognition," in Acoustics, Speech and Signal Processing (ICASSP) , 2006 IEEE International Conference on. IEEE , 2006 , pp. 325 { 328 .

[18]

D. C.

Lyu ,

T. P.

Tan ,

Chng , and

Li , \ SEAME: a Mandarin-English code-switching speech corpus in South-East Asia." INTERSPEECH 2010: 11th Annual Conference of the International Speech Communication Association , 2010 .

[19]

Vesely , M. Kara at, F. Grezl,

Janda , and E. Egorova, \ The language-independent bottleneck features," in Spoken Language Technology Workshop (SLT) , 2012 IEEE. IEEE, 2012 , pp. 336 { 341 .

[20] \ Open keyword search 2015 evaluation," http://www.nist.gov/itl/iad/mig/openkws15.cfm.

[21]

Varga and

H. J.

Steeneken , \ Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the e ect of additive noise on speech recognition systems," Speech communication , vol. 12 , no. 3 , pp. 247 { 251 , 1993 .

[22]

Tan ,

Xiao ,

E. K.

Tang ,

E. S.

Chng , and

Li , \ MASS: A Malay language LVCSR corpus resource," in Speech Database and Assessments, 2009 Oriental COCOSDA International Conference on. IEEE , 2009 , pp. 25 { 30 .