Telefonica System for the Spoken Web Search Task at Mediaeval 2011 Xavier Anguera Telefonica Research Torre Telefonica-Diagonal 00 08019 Barcelona, Spain xanguera@tid.es ABSTRACT that contain information about what has been said while This working paper describes the system proposed by Tele- they are speaker independent so that the system is able to fonica research for the task of spoken voice search within the recognize two instances of the same spoken word, even if Mediaeval benchmarking evaluation campaign in 2011. The they were spoken by different speakers. The query search proposed system is based exclusively on a pattern match- doe a search for every particular query over all acoustic ma- ing approach, which is able to perform a query-by-example terial to identify whether (and where) the query appears. search with no prior knowledge of the acoustics or language Transversal to both modules we applied a simple silence de- being spoken. The system’s main contributions are the us- tection algorithm to eliminate long silences in the queries age of a novel method to obtain speaker independent acous- and in the audio content. Next we will describe these three tic features to later perform the matching through a DTW- modules more in detail. like matching algorithm. Obtained results are promising and 2.1 Silence Detection and Removal show, in our opinion, the potential of such class of techniques Early on in our development we noticed that most queries for this task. were spoken in isolation. This means that the spoken query is always accompanied with some silence at the beginning Categories and Subject Descriptors and end. In addition, some phone call excerpts also con- H.303 [Information Search and Retrieval]: Miscella- tained non-wanted long amounts of silence frames. In order neous to eliminate most silence regions without jeopardizing the non-silence ones we applied a simple energy-based thresh- olding algorithm, individually in every file, as follows: first, General Terms we compute the average energy of the signal over windows of Algorithms, Performance, Experimentation 200ms, every 5ms. Then we search for the smallest energy value and the average of the top 1% highest energy values Keywords (we do not choose a single value in order to mellow down the effect of outliers) . Next we compute a threshold at the Pattern matching, query-by-example, spoken query, search 5% of the resulting dynamic range, above the minimum en- ergy value. Finally we apply such threshold to every 5ms of 1. INTRODUCTION the input signal to differentiate between speech and silence. The objective of the spoken web search task is to search for To avoid fast changes between speech/silence we apply a some given audio query within a set of given audio content, top-hat algorithm with a window of 100ms to the binary for a detailed explanation refer to [4]. The audio content output of the previous step to ensure that no silence/speech in this particular evaluation contains phone call excerpts segments are output with less than 100ms length. recorded in 4 different languages within the World Wide 2.2 Acoustic Features Extraction Telecom Web project [3] conducted by IBM. The system Most of our effort in this year’s evaluation went to design we propose to tackle this task is based on audio pattern- a good acoustic feature extraction module. Our goal was matching between the query and the audio content to re- to extract from the audio signal some features that retained trieve putative matches. No information at all is used re- all acoustic information about what was said while being garding the language that the queries are spoken in or the speaker and background independent. As a side objective, content (i.e. the transcription). we also wanted to be as much independent as possible to outside training data. 2. SYSTEM DESCRIPTION We focused the design of our feature extractor in previous The proposed system can be split into two main blocks: work that started with [1] on using phone posterior proba- the acoustic feature extraction and the query search. For bilities as features, which was then extended by [5] to apply the acoustic feature extraction the goal is to obtain features it to the automatic word discovery task. Similarly to [5], for our main submission we construct a Gaussian Mixture Model and store the Gaussian posterior probabilities (nor- Copyright is held by the author/owner(s). malized to sum 1) as our features. In our case we decided to MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy only use the development data available for the SWS task, therefore no external data was used on the training of this model. In addition, once the GMM has been trained with Table 1: Term Weighted Max Value for the submit- the EMML algorithm we perform a hard assignment of each ted systems frame to their most likely Gaussian and retrain the Gaus- Dataset-termlist Posteriorgrams binary features sian’s mean and variance to optimally model these frames. dev-dev 0.156 0.205 This last step tries to solve the problem most EMML sys- dev-eval 0.019 0.022 tems have, which is focusing on optimizing the Gaussians eval-dev 0.000 0.000 parameters to maximize the overall likelihood of the model eval-eval 0.173 0.222 on the input data, but not to discriminate between the dif- ferent sounds in it. By performing the last assignment and retraining step we push Gaussians apart from each other to stead of the actual value as we did not place much emphasis better model individual groups of frames depending on their on in the development stage at finding an optimum thresh- location and density. old for our system. Still, we observed that for any given set Alternatively, we also submitted a contrastive system that threshold the results remain similar both in dev-dev and in consists on the binarization of the posterior probabilities for eval-eval. each frame to binary form. This is inspired by our recent In general, we find results for dev-dev and eval-eval to be developments in speaker verification [2] where we show that very acceptable. On the other hand we were surprised to see we can effectively build binary models to identify between that our system does not work nearly as well for the cross speakers. Such representations are much smaller for storage conditions. We have observed that channel missmatch might purposes and can be processed much faster as binary dis- have played a major role in these results, as we observed in tances are usually very fast. In this case, for every posterior several cases that development files contain many recordings probabilities vector we turned to 1 the 20%-best probabil- with a much poorer signal quality than those from evaluation ities, and to 0 the rest. The chosen distance between two files. We consider we have achieved a reasonable speaker binary vectors x and y was defined as independence with our features but we are still to apply ways to compensate for differences in the channel. PN Comparing the two submissions we observe that the bi- (x[i]∧y[i]) Sd (x, y) = Pi=1 N (x[i]∨y[i]) (1) nary features are always outperforming the standard pos- i=1 teriorgrams. In our point of view this is a very interesting where ∧ indicates the boolean AND operator and ∨ indicates finding that can be used in the near future to speedup the the boolean OR operator. spoken word search and automatic pattern discovery sys- tems, which together with the proposed novel way to com- 2.3 Query search Algorithm pute the GMM model can achieve fast and quite accurate Given two sequences, X and Y of posterior probabilities, results. respectively obtained from the query and any given phone recording, we compare them using a DTW-like algorithm. The standard DTW algorithm returns the optimum align- 4. REFERENCES ment between any two sequences by finding the optimum [1] G. Aradilla, J. Vepa, and H. Bourlard. Using path between their start (0, 0) and end (xend , yend ) points. posterior-based features in template matching for In our case we constraint the query signal to match between speech recognition. In Proc. ICSLP, 2006. start and end, but we allow the phone recording to start [2] J.-F. Bonastre, X. Anguera, G. H. Sierra, and P.-M. its alignment at any position (0, y) and finish its alignment Bousquet. Speaker modeling using local binary in whenever the dynamic programming algorithm reaches decisions. In Proc. Interspeech, 2011. x = xend . Although we do not set any global constraints, [3] A. Kumar, N. Rajput, D. Chakraborty, S. K. Agarwal, the local constraints are set so that at maximum 2-times or and A. A. Nanavati. Wwtw: The world wide telecom 1 web. In Proc. NSDR 2007 (SIGCOMM workshop), 2 -times warping is allowed by choosing the path that mini- mizes the cost to reach position (i, j) as Kyoto, Japan, August 2007. [4] N. Rajput and F. Metze. Spoken websearch. In  MediaEval 2011 Workshop, Pisa, Italy, September 1-2  D(i − 2, j − 1))/(#(i − 2, j − 1) + 3) cost(i, j) = (d(i, j)+min D(i − 2, j − 2))/(#(i − 2, j − 2) + 4) 2011.  D(i − 1, j − 2))/(#(i − 1, j − 2) + 3) [5] Y. Zhang and J. Glass. Unsupervised spoken keyword (2) spotting via segmental dtw on gaussian posteriorgrams. Where D(i, j) is the accumulated (non-normalized) distance In Proc. ASRU, pages 398–403, Merano, Italy, of all optimum paths until position (i, j), d(i, j) is the local December 2009. distance between frames xi and yj from both compared se- quences, and #(i, j) is the number of jumps of the optimum path until that point. Note than when normalizing the dif- ferent possible paths we slightly favor the diagonal match. 3. RESULTS AND DISCUSSION Table 1 shows the official results we obtained with our systems, for the primary (posteriorgrams features) and con- trastive (binarized features) submissions. In all cases we report the Term Weighted Maximum Value (TWMV) in-