The L2F Spoken Web Search system for Mediaeval 2013 Alberto Abad Ramón F. Astudillo INESC-ID Lisboa / Instituto Superior Técnico INESC-ID Lisboa alberto@l2f.inesc-id.pt ramon@l2f.inesc-id.pt Isabel Trancoso INESC-ID Lisboa/ Instituto Superior Técnico imt@l2f.inesc-id.pt ABSTRACT based on hybrid connectionist methods or as a feature ex- The INESC-ID’s Spoken Language Systems Laboratory (L2 F) traction component for DTW based term detection. primary system developed for the Spoken Web Search task of the Mediaeval 2013 evaluation campaign consists of the 2.1 Phonetic network classifiers fusion of six individual sub-systems exploiting 3 different L2 F systems exploit multi-layer perceptron (MLP) networks language-dependent phonetic classifiers. For each phonetic that are part of our in-house hybrid connectionist ASR sys- classifier, an acoustic keyword spotting (AKWS) sub-system tem. The phonetic class posterior probabilities are in fact based on connectionist speech recognition and a dynamic the result of the combination of four MLP outputs trained time warping (DTW) based sub-system have been devel- with Perceptual Linear Prediction features (PLP, 13 static oped. The diversity in terms of phonetic classifiers and + first derivative), PLP with log-RelAtive SpecTrAl speech methods, together with the efficient fusion and calibration processing features (PLP-RASTA, 13 static + first deriva- approach applied for heterogeneous sub-systems, are the key tive), Modulation SpectroGram features (MSG, 28 static) elements of the L2 F submission. Besides the primary sub- and Advanced Font-End from ETSI features (ETSI, 13 static mission, two additional systems based on the fusion of only + first and second derivatives). The language-dependent the AKWS and the DTW sub-systems have been developed MLP networks were trained using different amounts of an- for comparison purposes. A final multi-site system formed notated data [2]. Each MLP network is characterized by by the fusion of the L2F and the GTTS primary submis- the size of its input layer that depends on the particular sions has been also submitted to explore the potential of the parametrization and the frame context size (13 for PLP, fusion approach for very heterogeneous systems. PLP-RASTA and ETSI; 15 for MSG), the number of units of the two hidden layers (500), and the size of the output 1. INTRODUCTION layer. In this case, only monophone units are modelled, re- This document introduces the Spoken Web Search systems sulting in MLP networks of 39 (38 phonemes + 1 silence) developed by the INESC-ID’s Spoken Language Systems soft-max outputs in the case of pt, 40 for br (39 phonemes Laboratory (L2 F) for the Mediaeval 2013 campaign. The + 1 silence) and 30 for es (29 phonemes + 1 silence). targeted task in this challenge is query-by-example spoken term detection. Detailed information about the task and 2.2 Acoustic KWS systems AKWS sub-systems exploit the phonetic networks as acous- the data used can be found in the evaluation plan [5]. One tic models for both phonetic tokenization and query search primary and three contrastive systems (one of them in col- based on hybrid ANN/HMM approaches for ASR [6]. The laboration with another participating group) have been sub- decoder used is based on a weighted finite-state transducer mitted. The primary system consists of the fusion of six in- (WFST) approach to large vocabulary speech recognition. dividual sub-systems. The proposed systems present three First, the phonetic transcription of each spoken query is ob- main novelties with respect to the systems developed for tained for every sub-system using a phone-loop grammar. the previous year evaluation campaign [1]: 1) the number of Simple 1-best phoneme chain output has been used. Then, language-dependent phonetic networks has been limited to search is carried out with a sliding window of 5 seconds (2.5 three; 2) DTW-based sub-systems exploiting log-posterior seconds time shift) using an equally-likely 1-gram language features have been incorporated; and 3) a recently proposed model formed by the target query and a competing speech method for discriminative calibration and fusion of hetero- background model. On the one hand, keyword/query mod- geneous spoken term detection systems [4] has been applied. els are described by the sequence of phonetic units obtained in the tokenization. On the other hand, the likelihood of 2. THE L2 F SWS SYSTEM DESCRIPTION a background speech unit representing “general speech” is Six sub-systems form the core of the L2 F SWS system ex- estimated based on the other phonetic classes [3]. The out- ploiting three different language-dependent phonetic net- put score for each candidate detection is computed as the works trained for European Portuguese (pt), Brazilian Por- average of the phonetic log-likelihood ratios that form the tuguese (br ) and European Spanish (es). The phonetic net- detected query term. More details can be found in [1]. works are used either as acoustic models in acoustic KWS 2.3 Dynamic Time Warping systems Copyright is held by the author/owner(s). DTW sub-systems use the language-dependent phonetic net- Mediaeval 2013 Workshop, October 18-19 2013, Barcelona, Spain works to extract log-posterior features. The silence class of the phonetic network is also used for voice activity de- tection. To this end, the segments identified as silence at Table 1: L2 F SWS2013 performance scores the beginning and end of each query and document are re- dev eval System moved. For each query-document pair, N euclidean distance mtwv atwv mtwv atwv based DTWs are run on N starting candidate positions of primary 0.3905 0.3883 0.3420 0.3376 the document. To select the candidate positions, the query- contrastive1 0.3205 0.3071 0.2515 0.2364 document euclidean distance matrix of the DTW is used. contrastive2 0.2753 0.2743 0.2463 0.2459 The minimum of each column of the matrix represents the contrastive3 0.4865 0.4850 0.4658 0.4639 minimum distance among all query feature vectors to a given document feature vector. The average of these minima on Table 1 shows the actual and maximum TWV official scores a sliding window of query size is used as an approximation obtained by the L2 F SWS systems for the two query sets: of DTW without the warping constraints, from which the dev and eval. Notice that the theoretical Bayes threshold best N candidates are selected. The number of candidates has been used in both dev and eval experiments. It is N was made equal to the length of the document in fea- worth noticing the remarkable performance improvements ture vectors divided by 100 with a minimum of 100 candi- when very heterogeneous (from different sites) systems are dates. In a second stage, DTWs of the size of the query combined, like in the case of the contrastive3 system. Re- are evaluated at each one of the N candidate positions, and garding the amount of processing resources, we have used a the three candidates with lower normalized cumulative dis- cluster of machines with 90 nodes. The estimated cost fig- tance, and separated by at least 0.5 seconds, are kept. The ures [7] are pessimistic since the cluster was not exclusively reduction of the search space to N candidates as explained used for the challenge. For each AKWS sub-system, the above provided a reduction of the search time by a factor indexing speed factor (ISF), searching speed factor (SSF), of around 5, while having a minimal impact on the perfor- maximum memory indexing (MMI) and maximum mem- mance. It should be noted that the DTW, including the ory searching (MMS) values are 0.75, 77.33, 0.17 GBytes distance matrix, was computed using the R programming and 0.073 GBytes, respectively. For the DTW based sub- language, while the candidate selection and remaining tasks systems, the ISF, SSF, MMI and MMS are 0.17, 193.34, 0.18 were implemented in Python1 . This framework benefited GBytes and 0.43 GBytes, respectively. Considering these particularly from the candidate selection scheme proposed. values, the total processing load (PL) is 239.76: 3 times the PL of AKWS (5.09) and DTW (74.83) sub-systems. 2.4 Discriminative calibration and fusion 4. ACKNOWLEDGEMENTS The combination of systems is based on a recently pro- This work was partially funded by the DIRHA European posed method for discriminative calibration/fusion of het- project (FP7-ICT-2011-7-288121) and the Portuguese Foun- erogeneous spoken term detection (STD) systems [4]2 . Un- dation for Science and Technology (FCT), through the projects der this approach, missing scores for systems that do not de- PEst-OE/EEI/LA0021/2013 and PTDC/EIA-CCO/122542/ tect a given candidate are hypothesized based on heuristics. 2010, and the grant number SFRH/BPD/68428/2010. In this way, the original problem of several unaligned detec- tion candidates is converted into a verification task. As for other verification tasks, system weights and offsets are then 5. REFERENCES [1] A. Abad and R. F. Astudillo. The L2F Spoken Web estimated through linear logistic regression. As a result, Search system for Mediaeval 2012. In MediaEval 2012 the combined scores are well calibrated, and the detection Workshop, Pisa, Italy, October 4-5 2012. threshold is automatically given by application parameters [2] A. Abad, J. Luque, and I. Trancoso. Parallel (priors and costs). The method permits easy integration Transformation Network features for Speaker with majority voting schemes and it is convenient if scores Recognition. In ICASSP, May 2011. from heterogeneous systems are in the same ranges (we ap- ply a per-query zero-mean and unit-variance normalization [3] A. Abad, A. Pompili, A. Costa, and I. Trancoso. q-norm [1]). Moreover, the maximum number of detection Automatic word naming recognition for treatment and candidates for a certain query provided by any sub-system assessment of aphasia. In Interspeech 2012, Sep 2012. was limited to 200 before score normalization and fusion. [4] A. Abad, L. J. Rodriguez Fuentes, M. Penagarikano, A. Varona, M. Diez, and G. Bordel. On the Calibration and Fusion of Heterogeneous Spoken Term Detection 3. SUBMITTED SYSTEMS AND RESULTS Systems. In Interspeech 2013, August 25-29 2013. One primary and two contrastive “on-time” systems were [5] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J. submitted. The primary system consists of the fusion of the Rodriguez-Fuentes. The Spoken Web Search Task. In six sub-systems previously described, while the contrastive1 MediaEval 2013 Workshop, October 18-19 2013. and contrastive2 submissions correspond to the fusion of [6] N. Morgan and H. Bourlad. An introduction to hybrid only the DTW and only the AKWS sub-systems, respec- HMM/connectionist continuous speech recognition. tively. Additionally, a “late” contrastive3 system based on IEEE Signal Processing Magazine, 12(3):25–42, 1995. the fusion of the primary systems of the L2 F and GTTS[8] teams was also submitted. All the submitted systems are ex- [7] L. Rodriguez-Fuentes and M. Penagarikano. MediaEval pected to generate well-calibrated log-likelihood ratios, such 2013 Spoken Web Search Task: System Performance that the theoretical minimum expected cost Bayes thresh- Measures. Technical report, 2013. old can be used (θBayes = log β, see [4] for more details). [8] L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, and M. Diez. GTTS Systems for the SWS 1 https://www.l2f.inesc-id.pt/wiki/index.php/DTW Task at MediaEval 2013. In MediaEval 2013 Workshop, 2 https://www.l2f.inesc-id.pt/wiki/index.php/STDfusion Barcelona, Spain, October 18-19 2013.