Irisa MediaEval 2011 Spoken Web Search System Armando Muscariello Guillaume Gravier Irisa/Inria Irisa/CNRS Rennes, France Rennes, France amuscari@irisa.fr ggravier@irisa.fr ABSTRACT DISTANCE MATRIX (QUERY VS UTTERANCE) These working notes describe the main aspects of IRISA 10 20 DTW BEST PATH submission for the Spoken Web Search at the MediaEval 30 QUERY 40 2011 campaign. We test a language-independent audio-only 50 system based on a combination of template matching tech- 60 70 niques. A brief overview of the main components of the 50 100 150 200 250 architecture is followed by reporting on the evaluation on SSM (query) UTTERANCE SSM (matching subsegment in utterance) the development and test data provided by the organizers. 20 20 Categories and Subject Descriptors 40 60 40 SSM COMPARISON 60 H.3.3 [Information Systems Applications]: Spoken Term 80 80 Detection—zero-resource speech processing, template match- ing, posteriorgrams 20 40 60 80 20 40 60 80 1. MOTIVATION Figure 1: Example of combined use of DTW and In [1] we have recently proposed a zero-resource audio-only SSM-based comparisons for similarity scoring of system for spoken term detection (STD), i.e a system for templates. performing keyword spotting at the acoustic level, in the ab- sence of any language or domain-specific knowledge, training speech data and models. Main motivation behind our parte- phoneme recognizer [3], independently trained on (Czech, cipation at the campaign is the opportunity of benchmarking Hungary, Russian) 8 KHz telephonic data. the system on a different, more challenging data set [2], and learn of alternative solutions and respective performance. We have used the Euclidean distance to computed the pair- wise distance between feature frames, and −log(p � q) as a 2. ARCHITECTURE OF THE SYSTEM distance-like measure of closeness between two posterior vec- The STD computational system relies on two main compo- tors p and q. nents: the acoustic features that represent queries and ut- terances, and the pattern matching techniques that identify 2.2 Pattern matching combination occurrences of the queries within the utterances and provide The search for an occurrence of the query within the ut- the respective measure of (dis)-similarity. terance is performed directly on the feature sequences by a cascade of two different pattern matching techniques. A seg- 2.1 Acoustic features mental variant of DTW, named segmental locally-normalized We have experimented different type of speech parametriza- dynamic time warping (SLNDTW) is responsible of select- tions, namely MFCC features and several type of posterior- ing the subsegment of the utterance most similar to the grams, that is 1) posteriors estimated from a Gaussian mix- searched query, according to a DTW score DDTW . This ture model (GMM) trained in an unsupervised fashion on score can directly be used to decide upon the similarity of the same development data provided [2], and posteriors out- the two segments, or refined by the use of additional scores. put by a language-specific (Czech, Hungary, Russian) BUT In our system, the two candidate keyword occurences are further subjected to the comparison of the respective self- � similarity matrices (SSMs), and the two SSM scores, DSSM �� and DSSM , resulting from such comparison are then com- bined with DDTW to obtain a unique dissimilarity score S (see figure 1). The global score S is computed as: Copyright is held by the author/owner(s). MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy DDTW D� �� DSSM S = αDTW · � + αSSM · SSM + α �� SSM · (1) thDTW th�SSM th��SSM Table 1: Development queries on development ut- Table 2: Evaluation runs: primary system terances: results DTW+SSM DEV- EVAL-EVAL EVAL- EVAL DEV DTW+SSM MFCC GMM HU CZ RU P(FA) 0.0003 0.00007 0.00006 P(FA) 0 0.0003 0.02 0.02 0.016 P(Mis) 0.999 0.831 0.962 P(Mis) 0.82 0.77 0.66 0.66 0.70 AWTV -0.29 0.10 -0.022 AWTV 0.18 -0.10 -19.9 -20.7 -15.9 MAP (%) 0.26 6.61 1.10 0.82 0.72 Table 3: Evaluation runs: secondary system so that S < 1 implies the detection of a match. DTW+SSM DEV- EVAL-EVAL EVAL- EVAL DEV P(FA) 0.00019 0.00013 0.00017 3. SYSTEM TUNING P(Mis) 0.97 0.788 0.97 The data set described in [2] is particulary challenging for AWTV -0.17 -0.10 -0.14 such a system, because it is 8 KHz telephonic quality, presents portion of silences in the queries and a large pronunciation variability due to non native English speakers. We have preliminarily removed silences from the queries thanks to It is worth noting that searching for the evaluation queries a speech detector, both for the development end evalua- on the evaluation utterances perform better than conducting tion queries. The thresholds thDTW , th�SSM , th��SSM have been a cross-dataset spoken term detection, which is likely due to tuned on word samples from a different data set (see [1]) the limited variability among patterns extracted from the and the pattern matching weights have been set to αDTW = same set. � �� 0.50, αSSM = 0.20, αSSM = 0.30 following [1]. Despite the availability of the ground truth for the development data set, reliable tuning of the thresholds on this data has not been 5. CONCLUSION The IRISA architecture for spoken term detection, presented successful, as many true hits exhibit a dissimilarity score in [1], was evaluated on the data set provided by the Me- higher than false alarms. This highlights the poor discrimi- diaEval 2011 Spoken Web Search. This dataset has proven native properties of the employed features in this task. The extremely challenging for the system in its current form, results for the different features are shown on table 1, for the yielding poor results for all type of acoustic features em- system jointly employing the DTW and SSM-based compar- ployed. For this particular data set, given the presence of isons, and the metrics: P(FA), that is the average false alarm many English keywords, training a phone recognizer based rate, P(Mis), the average false rejection rate, the average on English phone models would have likely improved perfor- weighted term value AWTV (the primary performance indi- mance, although our team did not dispose of such training cator), and the mean average precision MAP. The posterior data (indeed one of the reasons why pursuing research on features estimated by the BUT recognizer are the least per- zero-resource systems would benefit the community). One forming according to the AWTV, as their P(FA), weighted possible idea is to combine posteriors from different recog- by a factor β = 1000, is greater by order of magnitudes nizers to increase robustness to multiple languages, although than the P(FA) for the MFCC and GMM features. Gaus- in this specific case the results for Hungarian, Czech and sian posteriorgrams yield the highest MAP value among the Russian-based posteriorgrams were bad enough to prevent features tested, although very disappointing if compared to any satisfying application of this solution. Also, the Gaus- the values reported by this same system and features in the sian posteriors were only estimated from models trained on evaluation conducted in [1]. While yielding the highest miss the development utterances; performance could have been, detection rate P(Mis), the raw MFCC features report the at least slightly, improved by training the GMM on the com- best AWTV, as no false alarm has been collected. Accord- bined development-evaluation data set, in particular for the ing to this metric, the MFCC-based system has been selected cross-data detection that yielded the poorest results. as the primary one. 4. RESULTS ON EVALUATION DATA 6. REFERENCES [1] A. Muscariello, G. Gravier, and F. Bimbot. The results of the evaluation of the system on the test data Zero-resource audio-only spoken term detection based are summarized by table 2, as for the primary runs and on a combination of template matching techniques. In table 3, as for the secondary runs, where Gaussian posteri- Interspeech, 2011. orgrams have been used. Not suprisingly, the figures reflect [2] N. Rajput and F. Metze. Spoken web search. In substantially the poor results of the experiments on the de- MediaEval Workshop, 2011. velopment data set. The system operates in a completely unsupervised fashion and the knowledge of the performance [3] P. Schwartz, P. M. P., and J. Černocký. Towards lower on the development data are not exploited in any way, and error rates in phoneme recognition. In International therefore do not bear any impact on the result. Indeed, the Conference on Text, Speech and Dialogue, 2004., 2004. only parameteres needed to be tuned were estimated on a different data set.