=Paper=
{{Paper
|id=None
|storemode=property
|title=BUT-HCTLab approaches for Spoken Web Search - MediaEval 2011
|pdfUrl=https://ceur-ws.org/Vol-807/Szoeke_BUT_SWS_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SzokeTFC11
}}
==BUT-HCTLab approaches for Spoken Web Search - MediaEval 2011==
BUT-HCTLab APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2011 Igor Szöke Javier Tejedor Michal Fapšo Speech@FIT, Brno University HCTLab, Universidad Speech@FIT, Brno University of Technology, Czech Republic Autónoma de Madrid, Spain of Technology, Czech Republic szoke@fit.vutbr.cz javier.tejedor@uam.es ifapso@fit.vutbr.cz José Colás HCTLab, Universidad Autónoma de Madrid, Spain jose.colas@uam.es ABSTRACT which employs 5 context net outputs. Relevant parameters We present the three approaches submitted to the Spoken are in Table 1 and more details are in [5]. Web Search. Two of them rely on Acoustic Keyword Spot- Table 1: Language parameter specifications: Number of phones, |U C| ting (AKWS) while the other relies on Dynamic Time Warp- which represents the size of the hidden layer in the universal context ing. Features are 3-state phone posterior. Results suggest NN, |M er| which represents the size of the hidden layers in the merger that applying a Karhunen-Loeve transform to the log-phone with 3-state phone posteriors output, and |out| which represents the size posteriors representing the query to build a GMM/HMM of the 3-state phone posteriors output layer. |U C| |M er| |Out| for each query and a subsequent AKWS system performs Language Phones Czech 37 1373 495 114 the best. English 44 1298 488 135 Hungarian 64 1128 470 193 Levantine 32 1432 500 99 Categories and Subject Descriptors Polish Russian 33 49 1408 1228 498 481 105 157 Slovak 41 1305 489 133 H.3.3 [Information systems]: Information Storage and Retrieval, Information Search and Retrieval, Search process 3. APPROACHES General Terms 3.1 Parallel Acoustic Keyword Spotting Experimentation (PAKWS) We combined decisions from 6-language dependent Acous- Keywords tic Keyword Spotters (AKWS) (from all the languages in Ta- query-by-example spoken term detection, acoustic keyword ble 1 except the Polish one due to its worse performance on spotting, dynamic time warping, spoken web search dev data). One AKWS consists of two steps: Query recogni- tion done by a phone recognizer and query detection done by AKWS. They only differ in the decoder. Features (3-state 1. MOTIVATION phone posteriors) extracted from the audio are fed into a The Spoken Web Search (SWS) task aims at building a phone decoder – unrestricted phone loop without any phone language-independent query-by-example spoken term detec- insertion penalty. AKWS filler model-based recognition net- tion system without any knowledge of the target language works [4] are built according to the detected phone string per and query transcriptions. In so doing, our approaches are each query. The filler/background models are represented by based on the combination of as many language-dependent a phone loop. Each phone model is represented by a 3-state “recognizers” as possible. 1 HMM tied to 3-state phone posteriors. The output of the AKWS is a set of putative hits. The score is logarithm of 2. FEATURE EXTRACTION likelihood ratio normalized by the length of the detection. These detections are converted into a matrix for each ut- Our feature extractor outputs 3-state phone posteriors as terance. The size of this matrix is #queries × #f rames. features [2]. The phone posterior estimator [5] contains a Next, matrices for all 6 languages are “log added”. Finally, Neural Network (NN) classifier with a hierarchical structure the combined matrix is converted back to the list of detec- called bottle-neck universal context network. It consists of tions and the detection for which all 6 detectors agree has a a context network, trained as a 5-layer NN, and a merger higher score. 1 Part of this work was done will JT was a visitor research at Speech@FIT, BUT. 3.2 GMM/HMM term modeling Inspired in the previous approach, this relies on a sin- gle AKWS as query detection with these differences: (1) Copyright is held by the author/owner(s). the background model is a GMM/HMM with 1-state mod- MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy eled with 10-GMM components, (2) the query model is rep- resented with a GMM/HMM whose number of states is 3 times the number of phones according to the phone recogni- Results for the required runs [1] are given in Table 2. The tion with 1 GMM component each and (3) all the languages PAKWS approach has two versions (with and without score in Table 1 have been employed to produce the final fea- calibration). We clearly see that the GMM/HMM term ture super-vector. Queries represented by a single phone modeling approach outperforms the two other in a great have been modeled with 6 states, as if the query contained extent for unseen queries/data even with score calibration 2 phones. We used the number of phones output by the Slo- applied on the PAKWS approach. We consider this is due vak recognizer due to its best performance in terms of the to: (1) The KLT statistics, computed from dev data and Upper-bound Term Weighted Value metric (UBTWV) [3]. applied on queries and data, plays the role of “adaptation” Features used for background and query modeling were got towards the target domain, which differs from that used to as follows: (1) the log-phone posteriors got from the feature train the phone estimators from which the 3-state phone extractor are applied a Karhunen-Loeve transform (KLT) posteriors are computed, (2) the use of a single example to for each invididual language, (2) we keep the features that train the query model is more robust against uncertainties explain up to 95% of the variance after KLT for each individ- than the set of features itself (used in the DTW approach) ual language, (3) we build a 152-dimensional feature super- and the phone transcription got from phone decoding (used vector, from them. The KLT statistics have been computed in the PAKWS approach), (3) the prior combination of the from the dev data and next applied over both the dev/eval most relevant features after KLT given to the GMM/HMM queries and dev/eval data. approach, opposite to the PAKWS approach, based on a posterior combination from the detections got from each in- 3.3 Dynamic Time Warping (DTW) dividual AKWS system and (4) by comparing the MTWV A similarity matrix from phonetic posteriorgrams [5] stores (pooled) and UBTWV (non pooled) for PAKWS Qdev-Ddev the similarity between each query frame and each utterance 0.133 and 0.253, Qeval-Ddev 0.002 and 0.056, Qdev-Deval frame with the cosine distance as similarity function and a 0.030 and 0.157 and Qeval-Deval 0.033 and 0.223 respec- DTW search hypothesises putative hits. DTW is run iter- tively, suggests that the PAKWS system is the most sensi- atively starting in every frame in the utterance and ending tive to data mismatch. The GMM/HMM-based term mod- in a frame on the utterance [5]. The features that represent eling approach is less sensitive to data mismatch with the the phonetic posteriorgrams are the concatenation of the 3- following values: Qdev-Ddev 0.103 and 0.238, Qeval-Ddev state phone posteriors corresponding to every language in 0.019 and 0.035, Qdev-Deval 0.010 and 0.179 and Qeval- Table 1. Deval 0.131 and 0.267 respectively. For the DTW similar pattern as that of GMM/HMM is observed: Qdev-Ddev 4. FILTERING AND CALIBRATION 0.020 and 0.106, Qeval-Ddev 0 and 0.011, Qdev-Deval 0 and 0.099 and Qeval-Deval 0.014 and 0.055. To deal with the score calibration and some problem- atic query length under certain approaches issues, detections Table 2: ATWV results for the approaches. “PAKWS-cal” denotes the PAKWS ap- were post-processed in the following steps: 1]“Filtering” de- proach with score calibration and “PAKWS-nocal” denotes the PAKWS approach without score tections according to length difference from “average length”. calibration. “Qx-Dy” denotes the set of “x” queries searched on the set of “y” data. Approach Qdev-Ddev Qeval-Ddev Qdev-Deval Qeval-Deval Average length of a query is calculated as the average length PAKWS-cal 0.133 −0.221 −1.141 −0.307 PAKWS-nocal 0.093 −0.359 0.009 −0.110 of speech (phones) across the 6 phone recognizers used in GMM/HMM 0.103 0.008 0.024 0.101 the PAKWS approach. It was applied on all the approaches DTW 0.020 −0.005 −0.032 −0.115 except the DTW as follows: Q 6. CONCLUSIONS L −L(det) Q min Sc(det) − L Q , L min > L(det) Our GMM/HMM-based term modeling approach achieves min ScF (det) = L(det)−L Q max , Q (1) the best performance, whereas the two other, PAKWS and Sc(det) − Q Lmax > L(det) Sc(det), Lmax otherwise DTW, fail due to the unreliable phone transcription derived in the former and the “meaningless” phone posteriors by where Q identifies the query to which the detection be- themselves used in the latter when facing to the language- longs, Sc(det) is the original score, ScF (det) is the “fil- independency issue. Future work will investigate new fea- tered” score, L(det) is the length of the detection in frames, tures to enhance the performance of the best approach. LQ Q min = 0.8Laver is 80% of the average query length and LQ 7. REFERENCES Q max = 1.4L aver is 140% of the average query length. The detection score remains the same if the detection length is [1] N. Rajput and F. Metze. Spoken web search. In longer than 80% and shorter than 140% of the average query MediaEval 2011 Workshop, Pisa, Italy, 2011. length. Otherwise the score is lowered the shorter/longer the [2] P. Schwarz et al. Towards lower error rates in phoneme detection is according to the original query. recognition. In Proc. of TSD, pages 465–472, 2004. 2] Calibration, applied only in our PAKWS approach, pro- duces the final score of each detection as follows: [3] I. Szöke. Hybrid word-subword spoken term detection. PhD thesis, 2010. ScC(det) = ScF (det) + A1 + A2 ∗ Occ(Q), (2) [4] I. Szöke et al. Phoneme based acoustics keyword where Occ(Q) is number of query detection occurrences in spotting in informal continuous speech. LNAI, the data, and A1 = −1.0807 and A2 = −0.0001 are calibra- 3658(2005):302–309, 2005. tion parameters. These were estimated from best thresholds [5] J. Tejedor et al. Novel methods for query selection and (UBTWV) on dev data using linear regression. query combination in query-by-example spoken term detection. In Proc. of SSCS, pages 15–20, Florence, 5. RESULTS AND DISCUSSION Italy, 2010.