MOTIVATION

Jozef Vavrek

Peter Viszlay

Peter.Viszlay@tuke.sk 0

Martin Lojka

Martin.Lojka@tuke.sk 0

Matúš Pleva

Matus.Pleva@tuke.sk 0

Jozef Juhár

0 0 Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice Park Komenského 13 , 041 20 Košice , Slovakia

2014

16 17

Two approaches to QbE (Query-by-Example) retrieving system, proposed by the Technical University of Kosice (TUKE) for the query by example search on speech task (QUESST), are presented in this paper. Our main interest was focused on building such QbE system, which is able to retrieve all given queries with and without using any external speech resources. Therefore we developed posteriorgram-based keyword matching system, which utilizes a novel weighted fast sequential variant of DTW (WFS-DTW) algorithm in order to detect occurrences of each query within the particular utterance le, using two GMM-based acoustic units modeling approaches. The rst one, referred as low-resource approach, employs language-dependent phonetic decoders to convert queries and utterances into posteriorgrams. The second one, de ned as zero-resource approach, implements combination of unsupervised segmentation and clustering techniques by using only provided utterance les.

MOTIVATION

The motivation for developing our system was to assess the ability of proposed WFS-DTW algorithm to detect various spoken query terms by implementing low and zeroresource posteriorgram-based matching approach.

WFS-DTW SEARCHING ALGORITHM

Searching algorithm for QUESST task follows the one used in our paper [ 8 ]. Proposed solution is a modi cation of segmental DTW algorithm we applied in spoken web search task last year [ 7 ]. There are three main contributions to this algorithm: 1) one step forward moving strategy, when each DTW search is carried out sequentially, block by block, with size equal to the length of query; 2) linear time-aligned accumulated distance for speeding up sequential DTW without considerable loss in retrieving performance; 3) optimization of global minimum for set of alignment paths by implementing weighted cumulative distance (WCD) parameter.

LOW-RESOURCE APPROACH

The low-resource approach includes 4 language-dependent subsystems, each represented by GMM-based acoustic model. The acoustic models were trained previously using four databases: 2 Speechdat (Slovak, 66h and Czech, 89h) [ 6 ], Slovak ParDat1 (40h) [ 3 ] and English TIMIT (10h) [ 4 ].

The well-trained models were intended to generate timealigned and labelled segments for each utterance through Viterbi decoding. The phonetic decoder employed a phonelevel vocabulary and a phone network. We found that the phoneme insertion log probability p in Viterbi segmentation has signi cant impact to time-alignment. Since the best results were obtained with p = 0, we used this value in the whole setup. The time-alignments were used to train a new GMM-based acoustic model using the development data. It means that each language-dependent model was replaced by its re ned version, which was nally used to generate the posteriorgrams for utterances and queries.

Note that we used 39-dimensional MFCC (Mel-Frequency Cepstral Coe cients) features for Viterbi segmentation and GMM training. In low-resource approach we did not need any voice activity detector (VAD) because the silent parts of the audio stream were identi ed in the Viterbi segmentation. 4.

ZERO-RESOURCE APPROACH

In keeping with the zero-resource approach, we did not assume any prior knowledge of the acoustic units or pronunciation lexicon. In order to train the acoustic models, it was rstly necessary to identify the acoustic speech units in the audio data automatically. In this work, we utilized four di erent zero-resource approaches to address this problem.

Type 1: This one uses a PCA-based VAD to discriminate the voice active segments from the silent ones [ 8 ]. The initial feature selection, based on simple PCA (principal component analysis) [ 5 ], is carried out after extracting rst 13 MFCCs. Only those speech active feature vectors are selected, whose variance achieves values greater than 90% at the rst principal component. Then, K-means clustering with K = 75 clusters and correlation distance metric is computed on the reduced data. The clustering starts by selecting K points uniformly. Finally, speech segmentation is performed by computing the squared Euclidean distance between feature vectors and K mean vectors, where the label of the mean vector with minimum distance is assigned in collaboration with VAD.

Type 2: Type 2 approach comes directly out from the Type 1 and is further extended by Viterbi segmentation and new GMM training. These two steps are identical to those already described in Section 3. The main di erence is that the acoustic model from the Type 1 is used to generate the time-alignments through Viterbi segmentation.

Type 3: The third approach is based on the well-known at start training procedure [ 9 ]. It does not need any segdev mentation or clustering because the utterances are uniformly segmented using the Baum-Welch embedded re-estimation. Therefore, an alternative GMM initialization strategy is applied, where all phone models are initialized identically with state means and variances equal to the global mean and variance. The phone models are then moved straight to embedded training and simultaneously updated and expanded to the higher GMs (Gaussian Mixtures) [ 9 ]. The key element in at start training is the phone-level transcription, obtained from the phone-based recognition using the acoustic model acquired from the rst type zero-resource approach.

Type 4: Type 4 approach implements GMM-based segmentation and ergodic HMM (EHMM) training. Firstly, an unsupervised GMM training is performed on whole database, where each acoustic unit is represented by one GM. Each GM is then associated with one of the 64 states in EHMM and new GMs for each acoustic unit are trained iteratively.

Note that we used conventional 39-dimensional MFCCs for each zero-resource processing (except the Type 1). We did not use any VAD here (except the Type 1) because the <sil> labels were available from the Viterbi segmentation.

POST-PROCESSING: SCORE NORMAL IZATION AND FUSION

Score parameter was represented by WCD, normalized by scaling factor 0/1, similarly as we used in [ 8 ]. This step helped us to uni ed score ranges for the rst 500 detection candidates per each query. Then the score fusion for four di erent subsystems was carried out, employing a simple max-score merging strategy, similarly as Anguera et al. did in [ 1 ]. Detection candidates from each individual subsystem were merged together, keeping the one with the highest score in case of overlap. Merged candidates for each query were subsequently normalized by z-normalization and aligned according to the score value. The nal set was obtained by keeping rst 45-150 candidates, according to the length of query (the shorter query the lower number of candidates).

RESULTS AND CONCLUSION

We submitted four runs obtained from low-resource (primary) and zero-resource (general) systems for QUESST 2014 task [ 2 ]. The primary systems employ language-dependent acoustic modeling using Viterbi segmentation with 128 GMs (ParDat1, TIMIT) and 256 GMs (Speechdat SK, CZ). The general systems use 32 GMs for Type 1,2,3 and 64 GM for Type 4. The best-one-win strategy was used at rst runs (on time). Thus, only the subsystem with best performance was submitted, namely p-low using Speechdat SK and gzero Type 2 subsystem. Late submissions include maxscore merging fusion of four subsystems for both primary and general approaches. Results in Tab. 1 show that there are still big di erences in performance between p-low and g-zero approaches, even if the score fusion technique was applied. Even more, there is also considerable gap between act and min Cnxe despite the fact that the act and max T W V are perfectly calibrated. Therefore, an improved calibration/fusion models based on a ne transformation and linear-regression will be investigated in the future.

The indexing was done using 2xIBM x3650 (Intel E5530 @ 2.4 GHz, 8 cores), 28 GB RAM, under Debian OS. Searching algorithm was running on 52xIBM dx360 M3 cluster (Intel E5645 @ 2.4GHz, 624 cores), 48 GB RAM per node, running on Scienti c Linux 6 and Torque (see Tab. 2). 7.

ACKNOWLEDGMENTS

This publication is the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (100%).

[1]

Anguera et al. The Telefonica Research Spoken Web Search System for MediaEval 2013 . In Working Notes Proc. of the MediaEval 2013 , 2013 .

[2]

Anguera ,

Metze ,

Buzo , I. Szoke , and

L. J.

Rodriguez-Fuentes . Query by Example Search on Speech at Mediaeval 2014 . In Working Notes Proc. of the MediaEval 2014 Workshop , Barcelona, Spain, 16 - 17 October 2014 .

[3]

Darjaa et al. Rule-based Triphone Mapping for Acoustic Modeling in Automatic Speech Recognition . In Proc. of the 14th Intl. Conf. on Text, Speech and Dialogue , TSD'11 , pages 268 { 275 , 2011 .

[4]

J. S.

Garofolo et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 1993 . Linguistic Data Consortium, Philadelphia.

[5]

Juhar and

Viszlay . Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition . In Modern Speech Recognition Approaches with Case Studies , pages 131 { 154 . InTech Open Access, 2012 .

[6] H. van den Heuvel et al. SpeechDat-E: ve eastern european speech databases for voice-operated teleservices completed . In Proc. of INTERSPEECH , pages 2059 { 2062 , 2001 .

[7]

Vavrek et al. TUKE at MediaEval 2013 Spoken Web Search Task . In Working Notes Proc. of the MediaEval 2013 , 2013 .

[8]

Vavrek et al. Query-by-Example Retrieval via Fast Sequential Dynamic Time Warping Algorithm . In TSP 2014 , Berlin, DE, pages 469 { 473 . IEEE, July 2014 .

[9]

Young et al. The HTK Book (for HTK Version 3.4) . Cambridge University, 2006 .