The IIT-B Query-by-Example System for MediaEval 2015 Hitesh Tulsiani, Preeti Rao Department of Electrical Engineering, Indian Institute of Technology Bombay, India {hitesh26, prao}@ee.iitb.ac.in ABSTRACT This paper describes the system developed at I.I.T. Bombay for Query-by-Example Search on Speech Task (QUESST) within the MediaEval 2015 evaluation framework. Our sys- tem preprocesses the data to remove noise and performs sub- sequence DTW on posterior/bottleneck features obtained using four phone recognition systems to detect the queries. Scores from each of these subsystems are fused to get the sin- gle score per query-utterance pair which is then calibrated with respect to the cross entropy evaluation metric. 1. INTRODUCTION The goal of the QUESST task within the MediaEval 2015 framework is to determine the presence of a spoken query in an unlabeled speech data set by building a language in- dependent system. In this year’s QUESST task, the data consisted of about 18 hours of noisy audio from 7 different languages. More details about the task can be found in [1]. To minimize the effect of noise, we preprocess our data (both the queries and utterances) and follow it with speech Figure 1: Block diagram of the IIT-B system activity detection to remove silence frames. Our approach, to solve the task, is inspired by Hazen et al.[2]. A block- diagram of our system is shown in Figure 1 and is inspired 2.2 Subsystems by [3]. We make use of 4 subsystems: 2. SYSTEM DESCRIPTION 1. Two DNN based phone recognisers (Hungarian and Russian) trained on the SpeechDat-E corpus by Brno 2.1 Preprocessing - Noise Removal University of Technology (BUT)[5]. These are used to extract posterior and bottleneck features. We use spectral subtraction to remove noise from the au- dio. Power spectral density (PSD) of noise is estimated us- 2. A phone recogniser trained on Hindi database [6] (re- ing the minimum statistics technique described by R. Martin ferred to as TIFR phone recogniser from here on). [4]. The technique used to estimate noise PSD makes the TIFR phone recogniser is MLP based and is trained assumption that during speech pause or within brief periods using 39 dimensional MFCC features. It has a single in between words the speech energy is close to zero. Thus, hidden layer with 700 neurons and 36 output neurons. by tracking the minimum power within a finite window large We extract phone posteriors using the TIFR phone enough to bridge high power speech segments the noise floor recogniser. can be estimated. We next remove the silence at the start and end of an utterance using a simple energy based speech 3. 64-GMM system trained in unsupervised manner on activity detector. QUESST-2015 database using 36 dimensional MFCC features [7] (the energy of the audio was not used as feature because large energy variations were observed across utterances). We used this system to extract Gaussian posteriorgrams. 2.3 DTW Copyright is held by the author/owner(s). We use the standard subsequence DTW as implemented MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany in [3]. The query is allowed to start at any frame of the test eval dev Query Type actCnxe/minCnxe ATWV/MTWV actCnxe/minCnxe ATWV/MTWV T1 0.9330/0.9117 0.0531/0.0661 0.8971/0.8680 0.1434/0.1449 T2 0.9852/0.9637 -0.0099/0.0178 0.9214/0.9113 0.0492/0.0528 T3 0.9313/0.9109 0.0525/0.0627 0.9348/0.9210 0.0454/0.0461 overall 0.9536/0.9364 0.0254/0.0421 0.9213/0.9082 0.0812/0.0816 Table 1: Overall and per query type (T1/T2/T3) summarization of results on evaluation and development datasets. utterance and the locally optimal detection is the one that with respect to cross entropy evaluation metric to give us has the smallest accumulated distance. Also, to avoid the log-likelihood score. preference for the shorter paths, accumulated distances are normalized by the corresponding detected path lengths. For 3. RESULTS AND DISCUSSION distance measure, we have used Pearson product-moment Table 1 shows our results for development and evaluation correlation for bottleneck features (BUT - Hungarian and queries. Probably due to the high amount of noise (and Russian) and inner product for posteriors (BUT - Hungar- reverberation) in the dataset, the overall cross-entropy score ian and Russian, TIFR, 64-GMM). A filtering step is then is poor even after noise removal. If we look at the scores applied to remove detected candidates which are very large for each query type, clearly our system works best for the or very small in duration compared to the query length. T1 query type. This can be attributed to the fact that we 2.4 Fusion and Calibration didn’t take any special steps to counter T2 and T3 query types like word level reordering (for T2 queries) and partial Our approach is most similar to the discriminative fusion matching (for T3 queries). Also, we didn’t calibrate our approach proposed by A. Abad et al. [8]. Scores are first nor- score for Term Weighted Values (TWV) resulting in very malized to zero mean and unit variance per query to allow low ATWV/MTWV scores. for use of a single threshold. Then the detections are aligned We observed that after subsequence DTW many possible and only those detections for which at least half the systems detections (candidates) were found for a query in an utter- show overlap in time are retained (majority voting) to re- ance. This clearly suggests that posteriors and bottleneck duce the false alarms. This leaves us with multiple detec- features used were not robust enough for the given noisy tions of a query in an utterance. So for each query-utterance and multilingual data. Also, we rely heavily on our first pair we will get multiple score vectors (A score vector is a step of fusion which is nothing but the arithmetic mean of collection of scores from all the subsystems for a possible de- scores (since α = 1) from various subsystems to detect the tection of query in an utterance). Our score vector has six best candidate for a given query-utterance pair. So a high elements (BUT Hungarian-Posterior and Bottleneck, BUT score from even one of the subsystems can make the com- Russian-Posterior and Bottleneck, TIFR - Posterior, GMM bined score (obtained after Step 1 of fusion) biased towards - Posterior). it, leading to the selection of that candidate over other can- Since the task requires to give only one score per query- didates with moderate scores from all the systems. utterance pair, we determine best score vector per query- Our experiments were done on a computer with Intel i7- utterance pair using a two-step procedure: 4790 CPU (3.60GHz, 8 cores), 16GB RAM. For searching, 1. First step is inspired by Hazen et al.[2]. Scores from all the posteriorgrams for a query-utterance pair were loaded various subsystems S(X|Ki ) are combined according in memory. This caused high memory usage for longer ut- to equation: terances (Peak memory usage of around 15GB). It took us around 80 hours to search approximately 475 seconds of N 1 1 X query in 18 hours of audio database per subsystem, lead- S(X|K1 K2 ...KN ) = − log( exp(−αS(X|Ki ))) ing to SSF of 0.0093 per sec. α N i (1) where varying α between 0 to 1 changes the averaging 4. CONCLUSION function from geometric mean to arithmetic mean (we We have described the system developed at IIT-B for have used α = 1). QUESST task. To combat the effect of noise in data, we used spectral subtraction. Spectral subtraction reduces noise but 2. In the second step, we make use of the combined score is also known to create artifacts in speech and so posteri- obtained in first step to determine the best candidate ors/bottleneck features were not robust enough for the given for an utterance. We retain the individual scores of the noisy and multilingual data. It would be interesting to study subsystems along with the combined score (obtained the performance of our system without noise suppression. using equation 1) corresponding to the best detected The main novelty of our work was a two-step fusion ap- candidate, thus giving us one score vector per query- proach where in the first step we decide the best candidate utterance pair. for a query-utterance pair and in the second step we train a logistic regression classifier. The effect of the first step of All of these score vectors (corresponding to different query- fusion for different values of α on the cross entropy score utterance pairs) are then used to train a binary logistic clas- needs to be investigated. sifier [3] which gives us the fused score representative of query-utterance pair. The fused scores are then calibrated 5. REFERENCES [1] Igor Szőke, Luis J. Rodriguez-Fuentes, Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca, Martin Lojka, and Xiao Xiong. Query by example search on speech at mediaeval 2015. In Working Notes Proceedings of the Mediaeval 2015 Workshop, 14-15 September 2015. [2] T. Hazen, W. Shen, and C. White. Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. ASRU, 2009. [3] Igor Szőke, Lukáš Burget, František Grézl, Jan Černocký, and Lucas Ondel. Calibration and fusion of query-by-example systems - BUT SWS 2013. In Proc. ICASSP, pages 7899–7903, 2014. [4] R. Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing, 9(5):504–512, 2001. [5] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky. Probabilistic and bottle-neck features for LVCSR of meetings. In Proc. ICASSP, pages 757–760, 2007. [6] V. Chourasia, K. Samudravijaya, and M. Chandwani. Phonetically rich hindi sentence corpus for creation of speech database. In Proc. O-Cocosda, pages 132–137, 2005. [7] Y. Zhang and J. Glass. Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams. In Proc. ASRU, pages 398–403, 2009. [8] A. Abad, L. Rodrı́guez-Fuentes, M. Penagarikano, A. Varona, and G. Bordel. On the calibration and fusion of heterogeneous spoken term detection systems. In Proc. INTERSPEECH, pages 20–24, 2013.