The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 Jorge Proença1,2, Luis Castela2, Fernando Perdigão1,2 1 Instituto de Telecomunicações, Coimbra, Portugal 2 Electrical and Computer Eng. Department, University of Coimbra, Portugal {jproenca, fp}@co.it.pt ABSTRACT This document describes the system built by the SPL-IT-UC team 2. SYSTEM DESCRIPTION from the Signal Processing Lab of Instituto de Telecomunicações 2.1 Noise filtering (pole of Coimbra) and University of Coimbra for the Query by First, we apply a high pass filter to the audio signals to remove Example Search on Speech Task (QUESST) of MediaEval 2015. low frequency artefacts. Then, to tackle the existence of The submitted system filters considerable background noise by substantial stationary background noise in both queries and applying spectral subtraction, uses five phonetic recognizers from reference audio, we apply spectral subtraction (SS) to noisy which posterior probabilities are extracted as features, implements signals (not performed for high SNR signals, which was novel modifications of Dynamic Time Warping (DTW) that focus worsening results). This implies a careful selection of samples of on complex queries, and uses linear calibration and fusion to noise from an utterance. For this, we analyze the log mean Energy optimize results. This year’s task proved extra challenging in of the signal, consider only values above -60dB and calculate a terms of acoustic conditions and match cases, though we observe threshold below which segments of more than 100ms are selected the best results when merging all complex approaches. as “noise” samples, whose mean spectrum will be subtracted from the whole signal. 1. INTRODUCTION Using dithering (white noise) to counterbalance the musical MediaEval’s challenge for audio query search on audio, noise effect due to SS didn’t help. Nothing was specifically QUESST [1], keeps adding new relevant problems to tackle, and performed for reverberation or intermittent noise. in depth details can be consulted in the referenced paper. Briefly, 2.2 Phonetic recognizers this year, the audio can have significant background or We continue to use an available external tool based on neural intermittent noise as well as reverberation, and there are queries networks and long temporal context, the phoneme recognizers that originate from spontaneous requests. These conditions further from Brno University of Technology (BUT) [10]. We used the approach real case scenarios of a query search application, which three available systems for 8kHz audio, for three languages: is one of the underlying motivations of the challenge. Czech, Hungarian and Russian. Additionally, we trained two new Systems for Query by Example (QbE) search keep improving systems with the same framework: English (using TIMIT and with recent advances such as combining spectral acoustic and Resource Management databases) and European Portuguese temporal acoustic models [2], combining a high number of (using annotated broadcast news data and a dataset of command subsystems using both Acoustic Keyword Spotting (AKWS) and words and sentences). Using different languages implies dealing Dynamic Time Warping (DTW) and using bottleneck features of with different sets of phonemes, and the fusion of the results will neural networks as input [3], new distance normalization better describe the similarities between what is said in a query and techniques [4] and several approaches to system fusion and the searched audio. This makes our system a low-resource one. calibration [5]. Some attempts have been made to address All de-noised queries and audio files were run through the 5 complex query types that were introduced in QUESST 2014, by systems, extracting frame-wise state-level posterior probabilities segmenting the query in some way such as using a moving (with 3 states per phoneme) to be analyzed separately. window [6] or splitting the query in the middle [7]. Our approach is based on modifying the DTW algorithm to allow paths to be 2.3 Voice Activity Detection created in ways that conform to the complex types, which has Silence or noise segments are undesirable for a query search, shown success in improving overall results [8, 9]. and were cut on queries from all frames that had a high Using bottleneck features could be an important step to improve probability of corresponding to silence or noise, if the sum of the our systems, and although we did not consider them yet, we are 3 state posteriors of silence or noise phones is greater than a 50% certain that there are still improvements to be made not related to threshold for the average of the 5 languages. To account for the feature extractor. Nevertheless, we tried to improve the system queries that may still have significant noise, this threshold is feature-wise by implementing additional phonetic recognizers. incrementally raised if the previous cut is too severe (the obtained Compared to last year’s submission, we reduce severe background query having less than 500ms). noise in the audio by applying spectral subtraction, add two 2.4 Modified Dynamic Time Warping phonetic recognizers (five in total) from which to extract Every query must then be searched on every reference audio. posteriorgrams, remove all silence and noise frames from queries, We continue to use DTW as a way to find paths on the audio that improve and add modifications to Dynamic Time Warping (DTW) may be similar to the query. As in [11], the local distance is based for complex query search (filler inside a query being the novelty), on the dot product of query and audio posterior probability and implement better fusion and calibration methods. vectors, with a back-off of 10-4 and minus log applied to the dot Copyright is held by the author/owner(s). product, resulting in the local distance matrix of a query-audio MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany pair, ready for DTW to be applied. Before dot product, removing the posterior probabilities of silence and noise altogether, and 2.6 Processing Speed normalizing the probabilities of speech phones to 1 did not help. The hardware that processed our systems was the CRAY CX1 In addition to separate searches on distance matrices from Cluster, running windows server 2008 HPC, and using 16 of 56 posteriorgrams of 5 languages, we add a 6th “language”/sub- cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core system (called ML for multi-language) whose distance matrix is and 24GB RAM per node). the average of the 5 matrices. Approximately, the Indexing Speed Factor was 2.14, Searching We employ several modifications to the DTW algorithm to Speed Factor was 0.0034 per sec, and Peak Memory was 120MB. allow intricate paths to be constructed that can correspond logically to the complex match scenarios of query and audio. The 3. SUBMISSIONS AND RESULTS basic approach (named A1) outputs the best path (minimal We submitted 4 systems for evaluation: fusion of all approaches normalized distance) by using the complete query and allows 3 and languages with and without side-info; fusion of harmonic movements in the distance matrix with equal weight: horizontal, mean with and without side-info. Table 1 summarizes the results vertical and diagonal. As in previous work [8, 9], 4 modifications of the Cnxe metric for the 4 systems. are made that do not require repeating DTW with segmentations of the query: Table 1. Results of Cnxe (and MinCnxe) for development and (A2) Considering cuts at the end of query for lexical variations; evaluation datasets. (A3) Considering cuts at the beginning of the query; Fusion Systems Dev: Cnxe, MinCnxe Eval: Cnxe, MinCnxe (A4) Allowing one ‘jump’ for extra content in the audio; (A5) Allowing word-reordering, where an initial part of the All + side-info 0.7782, 0.7716 0.7866, 0.7809 query may be found ahead of the last. Hmean + side-info 0.7862, 0.7800 0.7842, 0.7786 Additionally, we designed an extra approach to address the case All, no side 0.7873, 0.7816 0.7930, 0.7875 of type 3 queries where small fillers or hesitations in the query may exist: Hmean, no side 0.7957, 0.7893 0.7914, 0.7865 (A6) Allowing one ‘jump’ along the query, of maximum 33% of query duration. It can be seen that the best result for the development set was Each distance is normalized per the number of movements our primary system that fused all languages and approaches plus (mean distance of the path). some side-info. As suspected, the weighted combination applied to the Eval set may be too over fitted to the Dev set, as the best 2.5 Fusion and Calibration result on Eval was using the Hmean of approaches. The At this stage, we have distance values for each audio-query pair considered side-info always helped. for 6 languages and 6 DTW strategies (36 vectors). First, Analyzing the best result on Eval per query type (All-0.7842, modifications are performed on the distribution per query per T1-0.7107, T2-0.8147, T3-0.8115), the exact matches of type 1 strategy. While deciding on a maximum distance value to give to are the easiest to detect compared to other types. unsearched cases (such as an audio being too short for a long Using only each individual DTW approach and fusing, the query), we found that drastically truncating large distances results on the Dev set are: A1: 0.8041, A2: 0.7978, A3: 0.8335, (lowering to the same value) improved results. Surprisingly, A4: 0.8137, A5: 0.8184, A6: 0.8460. The overall best performing changing all upper distance values (larger than the mean) to the strategy is allowing cuts at the end of the query (A2), which may mean of the distribution was the overall best. We reason that since help in all cases due to co-articulation or intonation. The new there are a lot of ground truth matches with very high distances strategy of allowing a jump in query (A6) performs badly and (false negatives), lowering these values improves the Cnxe metric should be reviewed. Actually, a filler in a query may be an more than lowering the value of true negatives worsens it. The extension of an existing phone, which leads to a straight path and next step is to normalize per query by subtracting the new mean not a jump. and dividing by the new standard deviation. Distances are Below, we also report the improvements of some steps of our transformed to figures-of-merit by taking the symmetrical value. system on the Dev set (although the comparison may not be to the To fuse results of different strategies and languages we have final approach). Using Spectral Subtraction resulted in 0.8130 two separate approaches/systems, both using weighted linear Cnxe from 0.8368. Fusing 5 languages - 0.7971 Cnxe, using only fusion and transformation trained with the Bosaris toolkit [12], the mean distance matrix ML - 0.8136, using 5 langs and ML - calibrating for the Cnxe metric by taking into account the prior 0.7873. Using per query truncation to the mean – 0. 7873 Cnxe, defined by the task: without truncation – 0.7939. - Fusion of all approaches and all languages (36 vectors). - Fusion of the Harmonic mean (Hmean) of the 6 strategies for 4. CONCLUSIONS a given language (6 vectors, one per language). This is done to Several steps were explored to improve the results of existing possibly counter overfitting to the training data from weighing 36 methods, and the main contributions came from: a careful Spectral vectors, and only languages are weighed. Subtraction to diminish background noise which greatly From a fusion, final result vectors with only one value per influences the output of phonetic recognizers; using the average audio-query pair are obtained for development and test data. distance matrix of all languages as a 6th sub-system for fusion; Additionally, we provide side-info based on query and audio, including side-info of query and audio; and per-query truncation added as extra vectors for all fusions. The 7 extra side-info vectors of large distances. are: mean of distances per query before truncation and Including a DTW strategy that considers gaps in query did not normalization from the best approach and language (the highest prove very successful. This may be due to its target cases being weighted from fusion of all); query size in frames and log of query too few in the dataset, and even some fillers in query being size; 4 vectors of SNR values (original SNR of query and of extensions and not unrelated hesitations. audio, post spectral subtraction SNR of query and of audio). 5. REFERENCES [6] P. Yang, et al., “The NNI Query-by-Example System for [1] I. Szöke, L.J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. MediaEval 2014”, in Working Notes Proceedings of the Metze, J. Proença, M. Lojka, and X. Xiong, "Query by Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17 Example Search on Speech at Mediaeval 2015", in Working [7] I. Szöke, M. Skácel and L. Burget, “BUT QUESST 2014 Notes Proceedings of the Mediaeval 2015 Workshop, system description”, in Working Notes Proceedings of the Wurzen, Germany, September 14-15 Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17 [2] C. Gracia, X. Anguera, and X. Binefa, “Combining temporal [8] J. Proença, A. Veiga and F. Perdigão, “The SPL-IT Query by and spectral information for Query-by-Example Spoken Example Search on Speech system for MediaEval 2014”, in Term Detection,” in Proc. European Signal Processing Working Notes Proceedings of the Mediaeval 2014 Conference (EUSIPCO), Lisbon, Portugal, 2014, pp. 1487- Workshop, Barcelona, Spain, October 16-17 1491. [9] J. Proença, A. Veiga and F. Perdigão, "Query by Example [3] I. Szöke, L. Burget, F. Grezl, J.H. Cernocky, and L. Ondel, Search with Segmented Dynamic Time Warping for Non- “Calibration and fusion of query-by-example systems—BUT Exact Spoken Queries", Proc European Signal Processing SWS 2013,” in Proc IEEE International Conference on Conf. - EUSIPCO, Nice, France, August, 2015. Acoustics, Speech and Signal Processing (ICASSP), [10] Phoneme recognizer based on long temporal context, Brno Florence, Italy, 2014, pp. 7849-7853. University of Technology, FIT, http://speech.fit.vutbr.cz/ [4] L.J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. software/phoneme-recognizer-based-long-temporal-context Bordel, and M. Diez, “High-performance query-by-example [11] T.J. Hazen, W.Shen and C.M. White, “Query-by-example spoken term detection on the SWS 2013 evaluation,” in Proc spoken term detection using phonetic posteriorgram IEEE International Conference on Acoustics, Speech and templates”, In ASRU 2009: 421-426. Signal Processing (ICASSP), Florence, Italy, 2014, pp. 7819-7823. [12] N. Brummer, and E. de Villiers, “The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary [5] A. Abad, L.J. Rodriguez Fuentes, M. Penagarikano, A. Classifer Score Processing,” Technical report, 2011. https:// Varona, M. Diez, and G. Bordel, “On the calibration and sites.google.com/site/bosaristoolkit/ fusion of heterogeneous spoken term detection systems,” in Proc. Interspeech 2013, Lyon, France, 2013, pp. 20-24.