=Paper=
{{Paper
|id=Vol-1263/paper73
|storemode=property
|title=CUHK System for QUESST Task of MediaEval 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_73.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL14
}}
==CUHK System for QUESST Task of MediaEval 2014==
CUHK System for QUESST Task of MediaEval 2014 Haipeng Wang, Tan Lee DSP-STL, Dept. of EE The Chinese University of Hong Kong {hpwang,tanlee}@ee.cuhk.edu.hk ABSTRACT task in 2012 [5]. The system involves seven tokenizers, in- This paper describes a spoken keyword search system de- cluding a GMM tokenizer, five phoneme recognizers, and an veloped at the Chinese University of Hong Kong (CUHK) ASM tokenizer [8]. Using these tokenizers, the query ex- for the query by example search on speech (QUESST) task amples and test utterances are converted into frame-level of MediaEval 2014. This system utilizes posterior features posteriorgrams. Different tokenizers may use different algo- and dynamic time warping (DTW) for keyword matching. rithms to generate posteriorgrams. Let Qi denote the query Multiple types of posterior features are generated with dif- posteriorgram generated by the ith tokenizer, and let Ti ferent tokenizers, and then fused by a linear combination on denote the corresponding test posteriorgram. The distance the DTW distance matrices. The main contribution of this matrix Di was computed as the inner-product [3], year’s system is a multiview segment clustering (MSC) ap- Di = − log(QTi × Ti ) i = 1, 2, ..., 7. (1) proach for unsupervised ASM tokenizer construction. The Cnxe and ATWV of our submitted results on the Evaluation To exploit the complementary information from different set are 0.682 and 0.412, respectively. tokenizers, the distance matrices were combined linearly to give a new distance matrix D, 1. INTRODUCTION 7 X The query by example search on speech (QUESST) task D= wi Di , (2) aims at detecting the keyword occurrences in a unlabeled i=1 speech collection using spoken queries in a language inde- where wi denotes the weighting coefficients for Di and was pendent fashion. In this year’s QUESST dataset, the speech simply set to 71 . collection involves about 23 hours of speech data from 6 lan- Subsequently, DTW detection was applied to the com- guages, and the query set includes 560 development queries bined distance matrix D to locate the top matching regions. and 555 evaluation queries. The average duration of queries DTW detection was performed with a sliding window with is about 0.9 second after voice activity detection (VAD). a window shift of 5 frames. The adjustment window con- More details about the task description can be found in [2]. straint was imposed on the DTW alignment path. Let dq,t Our system was designed only for the type 1 query match- denote the normalized DTW alignment distance between the ing. It followed the posteriorgram-based template matching qth query on the tth hit region. The raw detection score was framework [3], in which speech tokenizers were used to gen- computed by erate posteriorgrams, and DTW was applied for keyword detection. The tokenizers were either built from the search- sq,t = exp(−dq,t /β), (3) ing speech collection given in the task, or developed from where the scaling factor β was set to 0.6. To calibrate the some resource-rich languages. In order to exploit the com- score distribution of different queries, a 0/1 normalization plementary information of multiple tokenizers, the DTW was used, matrix combination method [7] was used. Raw DTW de- tection scores were then normalized to zero mean and unit ŝq,t = (sq,t − µq )/δq , (4) variance. On the evaluation set, the Cnxe and ATWV of where ŝq,t is the calibrated score, and µq and are theδq2 our submission are 0.682 and 0.412. If only considering the mean and variance of the raw scores of the qth query. type 1 query matching, the Cnxe and ATWV are 0.611 and 0.526. 2.2 GMM Tokenizer The GMM tokenizer was trained from the given searching 2. SYSTEM DESCRIPTION speech collection. It contained 1024 Gaussian components. The input of the GMM tokenizer was 39-dimensional MFCC 2.1 System Overview feature vector. The MFCC features were processed with In this year’s evaluation, our system employs a similar VAD and utterance-based mean and variance normalization framework as our previous system for spoken web search (MVN). Vocal tract length normalization (VTLN) was then applied to the MFCC features o alleviate the influence of speaker variation. Copyright is held by the author/owner(s). The warping factors of VTLN were estimated iteratively MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain. as proposed in [9]. The iteration started with training a GMM from the unwarped MFCC features. Then the warp- were stored in the memory. This caused very high memory ing factors were estimated with a maximum-likelihood grid cost (>10GB). The computation cost in the searching pro- search using the GMM. A new GMM was trained using cess was mainly caused by DTW detection. The searching the warped features, and new warping factors were then speed factor of our system was about 0.021. The slow search- re-estimated. This process was iterated four times in our ing speed is one main drawback of our system and needs to implementation. The usefulness of VTLN for this task was be improved. experimentally demonstrated in our previous paper [8]. 2.3 Phoneme Recognizers Table 1: System performances on all the queries. System 1 corresponds to the submitted results. Our system involved five phoneme recognizers, namely System No. actCnxe minCnxe ATWV MTWV Czech, Hungarian, Russian, English and Mandarin phoneme recognizers. All these phoneme recognizers used the split 1 0.682 0.659 0.412 0.413 temporal context network structure [4]. The Czech, Hungar- 2 0.638 0.585 0.412 0.413 ian, Russian phoneme recognizers were developed at Brno University of Technology (BUT) and released in [1]. The Table 2: System performances on the type 1 queries. English phoneme recognizer was trained on about 15-hour System 1 corresponds to the submitted results. speech data from the Fisher corpus and Swichboard Cellu- System No. actCnxe minCnxe ATWV MTWV lar corpus. The Mandarin phoneme recognizer was trained 1 0.526 0.486 0.611 0.613 on about 15-hour speech data from the CallHome corpus 2 0.508 0.420 0.611 0.613 and the CallFriend corpus. These phoneme recognizers were used to generate mono-phone state-level posteriorgrams with- 4. CONCLUSION out any language model constraint. We have described an overview of the CUHK system sub- 2.4 ASM Tokenizer mitted to the MediaEval 2014 QUESST task along with the evaluation results. Our system involves seven tokenizers Acoustic segment modeling (ASM) is a way to build an and uses DTW matrix combination for fusion. Only type HMM-based speech tokenizer from unlabeled speech data. It 1 query matching is considered in the system development. consists of three steps, namely initial segmentation, segment The main highlight of our system lies in the MSC approach labeling, and iterative training and decoding. Initial seg- in the ASM tokenizer construction. In general we think the mentation searches for the acoustic discontinuities and par- performances for type 1 query matching are acceptable, but titions speech utterances into short-time speech segments. the slow searching speed and high memory cost need to be In our implementation, we simply used the one-best recog- substantially improved. nition results of the Hungarian phoneme recognizer to get the hypothesised segment boundaries. Segment labeling is to assign a label to each short-time 5. REFERENCES [1] http://speech.fit.vutbr.cz/software/phoneme- speech segment. We used a multiview segment clustering recognizer-based-long-temporal-context. (MSC) approach for segment labeling. The MSC approach took in multiple segment-level posterior features, computed [2] X. Anguera, L. Rodriguez-Fuentes, A. B. I. Szoke, and the similarity matrix and Laplacian matrix of the speech seg- F. Metze. Query by example search on speech at ments for each type of posterior feature, and made a linear mediaeval 2014. In Working Notes Proceedings of the combination on the Laplacian matrices. With the combined Mediaeval 2014 Workshop, Barcelona, Spain, October Laplacian matrix, eigen-decomposition was performed to de- 16-17, 2014. rive the spectral embedding representations, and k-means [3] T. Hazen, W. Shen, and C. White. Query-by-example was applied to find 100 clusters. Details of the MSC ap- spoken term detection using phonetic posteriorgram proach are described in [6]. templates. In ASRU, pages 421–426, 2009. The cluster labels were used as initializations for iterative [4] P. Schwarz. Phoneme recognition based on long training and decoding, in which HMM training and decoding temporal context, PhD thesis. 2009. were performed iteratively until converge. [5] H. Wang and T. Lee. CUHK system for the spoken web search task at mediaeval 2012. In Working Notes 3. RESULTS Proceedings of the Mediaeval 2012 Workshop, Pisa, Italy, October 4-5, 2012. Table 1 shows the results obtained by our system on eval- [6] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li. uation queries. Based on our previous experience on TWV Acoustic segment modeling with spectral clustering values, we only submitted a small portion of the scores which methods. in submission to IEEE/ASM TASLP. were higher than a threshold. This gives us the results of System 1. However, if all the scores of all the trials are [7] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li. Using considered, we obtain the results of System 2, which gives parallel tokenizers with DTW matrix combination for obvious reductions on the Cnxe values. Similar observations low-resource spoken term detection. In ICASSP, 2013. can also be made when only considering the type 1 query [8] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li. An matching. Corresponding results are shown in Table 2. The acoustic segment modeling approach to difference between Cnxe and TWV metrics needs to be care- query-by-example spoken term detection. In ICASSP, fully examined in the future. 2012. To run the experiments, we used a computer with Intel i7- [9] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin. 3770K CPU (3.50GHz, 4 cores), 32GB RAM and 1T hard Speaker normalization on conversational telephone drive. In the online searching process, all the posteriorgrams speech. In ICASSP, 1996.