=Paper=
{{Paper
|id=None
|storemode=property
|title=The CUHK Spoken Web Search System for MediaEval 2013
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_68.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL13
}}
==The CUHK Spoken Web Search System for MediaEval 2013==
The CUHK Spoken Web Search System for MediaEval 2013 Haipeng Wang, Tan Lee DSP-STL, Dept. of EE The Chinese University of Hong Kong {hpwang,tanlee}@ee.cuhk.edu.hk ABSTRACT The online process is to perform the detection task given an input This paper describes an audio keyword detection system de- query. It involves query expansion, query posteriorgram genera- veloped at the Chinese University of Hong Kong (CUHK) for tion, DTW detection and score normalization. The query expansion the spoken web search (SWS) task of MediaEval 2013. The is based on the PSOLA [3] technique, which modifies the duration system was built only on the provided unlabeled data, and of the original query example and generates a number of query ex- each query term was represented by only one query exam- amples of different lengths. We refer to the original query examples ple (from the basic set for required runs). This system was and the generated query examples as the expanded query set. Af- designed following the posteriorgram-based template match- ter converting the expanded query set into posteriorgrams, DTW ing framework, which used a tokenizer to convert the speech detection is applied to get the raw scores. DTW is performed with data into posteriorgrams, and then applied dynamic time a sliding window on the log-inner-product distance matrix of the warping (DTW) for keyword detection. The main features posteriorgrams of the query set and the spoken documents. Details of the system are: 1) a new approach of tokenizer con- of the DTW detection in our system can be found in [5]. Lastly struction based on Gaussian component clustering (GCC) mean and variance normalization is applied to the raw scores to and 2) query expansion based on the technique called pitch obtain the final detection score. synchronous overlap and add (PSOLA). The MTWV and In practice, when the query example was very short, the returned ATWV of our system on the SWS2013 Evaluation set are hits would contain many false alarms. A duration threshold of 0.35 0.306 and 0.304. second was applied to the input queries. If the duration of a query example (after silence removal) was less than the threshold, the system rejected this query example and did not return any results. 1. INTRODUCTION The spoken web search (SWS) task of MediaEval 2013 aims at 2.2 Feature Extraction detecting the keyword occurrences in a set of spoken documents Our system used 39-dimensional MFCC features. The MFCC using audio keyword queries in a language-independent fashion. features were processed with voice activity detection (VAD), mean The spoken documents involves about 20 hours of unlabeled speech and variance normalization (MVN) on the utterance level. Vocal data from 9 languages. More details about the task description can tract length normalization (VTLN) was then used to alleviate the be found in [1]. The focus of our work was on a completely un- influence of speaker variation. The warping factors were deter- supervised setting, i.e., only the unlabeled data from the spoken mined with a maximum-likelihood grid search using a GMM with documents was used in the system development. For each query 256 components. The usefulness of VTLN for this task was exper- term, only one audio example was used in our system. imentally demonstrated in our previous paper [6]. Our system follows the posteriorgram-based template matching framework [2]. New methods have been developed for tokenizer 2.3 Tokenizer Construction construction and query expansion. In addition, it was found that The tokenizer was used to generate posteriorgrams. It was trained score normalization brought significant improvement. from the unlabeled data of the spoken documents. We used a new Gaussian component clustering (GCC) approach to find phoneme- like units, and modeled the corresponding context-dependent states 2. SYSTEM DESCRIPTION by a 5-layer neural network. The posteriorgrams were composed of the state posterior probabilities produced by the neural network. 2.1 System Overview The GCC approach involved 4 steps. First, a GMM with 4096 Fig. 1 gives the overall architecture of our system. It involves components was estimated. Second, unsupervised phoneme seg- offline process and online process. The offline process (marked by mentation was performed on the spoken documents. Third, each the dashed window in Fig. 1) is to build the system from the spo- speech segment was represented by a Gaussian posterior vector, ken documents. It is divided into the stages of feature extraction, which is computed by averaging the frame-level Gaussian poste- tokenizer construction, and posteriorgram generation. The offline rior probabilities. Stacking the Gaussian posterior vectors, we ob- process results in a speech tokenizer and the posteriorgrams of the tained a Gaussian-by-segment data matrix, which is denoted by X. spoken documents. Finally, we computed the similarity matrix W of the Gaussian com- ponents as W = XXT , and apply spectral clustering on the simi- Copyright is held by the author/owner(s). larity matrix to find 150 clusters of Gaussian components. Details MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain. This research is partially supported by the CUHK MoE-Microsoft Key Lab- of the GCC approach can be found in [4]. oratory and CUHK-PKU Joint Centre for Intelligence Engineering. Each cluster of Gaussian components was viewed as the acoustic Figure 1: System Framework model of a discovered unit. These acoustic models were refined by We think this improvement is quite encouraging. And more experi- an iterative process [6], and updated to context-dependent models ments and analysis will be done to claim the usefulness of the query with 1198 states. These states were then modeled by a deep neu- expansion in the future work. The final observation is that the use ral network, which had 3 hidden layers with 1000 units per layer. of score normalization brings two considerable benefits. First, it The input layer corresponds to a context window of 9 successive brings about 7.7% MTWV gain on Dev set, and 7.0% on Eval set. frames. The outputs of the neural network were the state posterior This is different from our observation in the previous work [5]. We probabilities and used to construct the posteriorgrams. suspect this is related to the nonlinear transformation in (1) and the large size of the spoken documents. Second, score normalization 2.4 Query Expansion seemed to make the decision threshold quite stable, so that the gap Query expansion aimed at generating variable length examples, between MTWV and ATWV on Eval set becomes very small. so as to cover larger duration variation of the query term. The PSOLA technique was implemented for this purpose. PSOLA is Table 1: System Configurations and Performances. able to perform time-scale modifications while preserving the spec- The basic system is without query expansion and tral characteristics as much as possible. The implementation in- score normalization. System No. 1 2 3 volved three steps. First, pitch epochs were detected by an autocor- √ √ √ relation method. Second, the periodic waveform cycles identified Basic System √ √ by the pitch marks were duplicated or eliminated according to the Query Expansion √ time-scaling factors. Finally, the overlap-and-add algorithm was Score Normalization used to synthesize the new speech example. In the system, two Dev Query Set (MTWV) 0.263 0.290 0.367 time-scaling factors were used: 0.7 and 1.3. For a query exam- Eval Query Set (MTWV) 0.216 0.236 0.306 ple with duration L, we had one generated example with duration Dev Query Set (ATWV) – – 0.367 0.7×L and another with duration 1.3×L. Therefore the expanded Eval Query Set (ATWV) – – 0.304 query set would have three examples for each term. Given a query term and an utterance in the spoken documents, the detection score was the maximum value among the scores provided by the three 4. HARDWARE, MEMORY, AND CPU TIME examples. All the experiments were performed on a computer with Intel i7- 3770K CPU (3.50GHz, 4 cores), 32GB RAM and 1T hard drive. In 2.5 Score Normalization the online process, all the posteriorgrams of the spoken documents Let dq,t denote the DTW alignment distance between the qth were stored in the memory. This accelerated the online detection, query on the tth hit region. The corresponding raw detection score but involved very high memory cost (>10GB). The computation was computed by cost in the online process was mainly caused by DTW detection. The searching speed factor of the system No.3 was about 0.018. sq,t = exp(−dq,t /β), (1) 5. REFERENCES where the scaling factor β was set to 5. To calibrate the scores of [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and different query terms, a simple 0/1 normalization was used. The L. Rodriguez-Fuentes. The spoken web search task. In normalization was performed as MediaEval 2013 Workshop, 2013. [2] T. Hazen, W. Shen, and C. White. Query-by-example ŝq,t = (sq,t − µq )/δq , (2) spoken term detection using phonetic posteriorgram templates. In ASRU,2009. where µq and δq2 are the mean and variance of the top 400 raw [3] E. Moulines and F. Charpentier. Pitch-synchronous scores for the qth query. waveform processing techniques for text-to-speech synthesis using diphones. Speech communication, 1990. [4] H. Wang, T. Lee, C.C. Leung, B. Ma, and H. Li. 3. PERFORMANCE AND ANALYSIS Unsupervised mining of acoustic subword units with Table 1 lists the performances of our systems with different con- segment-level Gaussian posteriorgrams. In Interspeech, figurations. System No. 3 is our submitted system for this task. 2013. All these three systems belong to the required run condition de- [5] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li. Using fined in [1]. From Table 1, we have observed severe performance parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In ICASSP, 2013. degradation (≥ 5%) from the Dev query set to the Eval query set. [6] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li. An This may be due to the mismatch between the Dev set and Eval acoustic segment modeling approach to set. Another observation is that the use of query expansion indeed query-by-example spoken term detection. In ICASSP, brings improvements (≥ 2%) for both the Dev set and the Eval set. 2012.