=Paper=
{{Paper
|id=Vol-1263/paper80
|storemode=property
|title=TUKE System for MediaEval 2014 QUESST
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_80.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/VavrekVLPJ14
}}
==TUKE System for MediaEval 2014 QUESST==
TUKE System for MediaEval 2014 QUESST
Jozef Vavrek, Peter Viszlay, Martin Lojka, Matúš Pleva, and Jozef Juhár
Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice
Park Komenského 13, 041 20 Košice, Slovakia
{Jozef.Vavrek, Peter.Viszlay, Martin.Lojka, Matus.Pleva, Jozef.Juhar}@tuke.sk
ABSTRACT Slovak ParDat1 (40h) [3] and English TIMIT (10h) [4].
Two approaches to QbE (Query-by-Example) retrieving sys- The well-trained models were intended to generate time-
tem, proposed by the Technical University of Košice (TUKE) aligned and labelled segments for each utterance through
for the query by example search on speech task (QUESST), Viterbi decoding. The phonetic decoder employed a phone-
are presented in this paper. Our main interest was focused level vocabulary and a phone network. We found that the
on building such QbE system, which is able to retrieve all phoneme insertion log probability p in Viterbi segmentation
given queries with and without using any external speech re- has significant impact to time-alignment. Since the best
sources. Therefore we developed posteriorgram-based key- results were obtained with p = 0, we used this value in the
word matching system, which utilizes a novel weighted fast whole setup. The time-alignments were used to train a new
sequential variant of DTW (WFS-DTW) algorithm in order GMM-based acoustic model using the development data. It
to detect occurrences of each query within the particular ut- means that each language-dependent model was replaced by
terance file, using two GMM-based acoustic units modeling its refined version, which was finally used to generate the
approaches. The first one, referred as low-resource approach, posteriorgrams for utterances and queries.
employs language-dependent phonetic decoders to convert Note that we used 39-dimensional MFCC (Mel-Frequency
queries and utterances into posteriorgrams. The second one, Cepstral Coefficients) features for Viterbi segmentation and
defined as zero-resource approach, implements combination GMM training. In low-resource approach we did not need
of unsupervised segmentation and clustering techniques by any voice activity detector (VAD) because the silent parts of
using only provided utterance files. the audio stream were identified in the Viterbi segmentation.
1. MOTIVATION 4. ZERO-RESOURCE APPROACH
The motivation for developing our system was to assess In keeping with the zero-resource approach, we did not
the ability of proposed WFS-DTW algorithm to detect var- assume any prior knowledge of the acoustic units or pro-
ious spoken query terms by implementing low and zero- nunciation lexicon. In order to train the acoustic models, it
resource posteriorgram-based matching approach. was firstly necessary to identify the acoustic speech units in
the audio data automatically. In this work, we utilized four
2. WFS-DTW SEARCHING ALGORITHM different zero-resource approaches to address this problem.
Type 1: This one uses a PCA-based VAD to discriminate
Searching algorithm for QUESST task follows the one
the voice active segments from the silent ones [8]. The ini-
used in our paper [8]. Proposed solution is a modification of
tial feature selection, based on simple PCA (principal com-
segmental DTW algorithm we applied in spoken web search
ponent analysis) [5], is carried out after extracting first 13
task last year [7]. There are three main contributions to this
MFCCs. Only those speech active feature vectors are se-
algorithm: 1) one step forward moving strategy, when each
lected, whose variance achieves values greater than 90% at
DTW search is carried out sequentially, block by block, with
the first principal component. Then, K-means clustering
size equal to the length of query; 2) linear time-aligned accu-
with K = 75 clusters and correlation distance metric is
mulated distance for speeding up sequential DTW without
computed on the reduced data. The clustering starts by
considerable loss in retrieving performance; 3) optimization
selecting K points uniformly. Finally, speech segmentation
of global minimum for set of alignment paths by implement-
is performed by computing the squared Euclidean distance
ing weighted cumulative distance (WCD) parameter.
between feature vectors and K mean vectors, where the la-
bel of the mean vector with minimum distance is assigned
3. LOW-RESOURCE APPROACH in collaboration with VAD.
The low-resource approach includes 4 language-dependent Type 2: Type 2 approach comes directly out from the
subsystems, each represented by GMM-based acoustic model. Type 1 and is further extended by Viterbi segmentation and
The acoustic models were trained previously using four da- new GMM training. These two steps are identical to those
tabases: 2× Speechdat (Slovak, 66h and Czech, 89h) [6], already described in Section 3. The main difference is that
the acoustic model from the Type 1 is used to generate the
time-alignments through Viterbi segmentation.
Copyright is held by the author/owner(s). Type 3: The third approach is based on the well-known
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain flat start training procedure [9]. It does not need any seg-
Table 1: Evaluation of primary low-resource (p-low) Table 2: Processing resources measures
and general zero-resource (g-zero) systems (* indi- system ISF SSF P M UI P M US PL
cates late submission) p-low (dev) 0.61 0.0034 0.05 2.46 0.0106
g-zero (dev) 1.5 0.0042 1.4 3.92 0.225
eval dev
system Cnxe TWV Cnxe TWV
(act/min) (act/max) (act/min) (act/max) zero Type 2 subsystem. Late submissions include max-
p-low 0.959/0.891 0.154/0.154 0.960/0.892 0.161/0.162 score merging fusion of four subsystems for both primary
g-zero 0.973/0.934 0.075/0.077 0.974/0.934 0.091/0.091 and general approaches. Results in Tab. 1 show that there
p-low* 0.947/0.853 0.168/0.169 0.948/0.854 0.191/0.191 are still big differences in performance between p-low and
g-zero* 0.970/0.921 0.102/0.103 0.971/0.922 0.106/0.107 g-zero approaches, even if the score fusion technique was
applied. Even more, there is also considerable gap between
act and min Cnxe despite the fact that the act and max
T W V are perfectly calibrated. Therefore, an improved cal-
mentation or clustering because the utterances are uniformly ibration/fusion models based on affine transformation and
segmented using the Baum-Welch embedded re-estimation. linear-regression will be investigated in the future.
Therefore, an alternative GMM initialization strategy is ap- The indexing was done using 2xIBM x3650 (Intel E5530 @
plied, where all phone models are initialized identically with 2.4 GHz, 8 cores), 28 GB RAM, under Debian OS. Searching
state means and variances equal to the global mean and vari- algorithm was running on 52xIBM dx360 M3 cluster (Intel
ance. The phone models are then moved straight to embed- E5645 @ 2.4GHz, 624 cores), 48 GB RAM per node, running
ded training and simultaneously updated and expanded to on Scientific Linux 6 and Torque (see Tab. 2).
the higher GMs (Gaussian Mixtures) [9]. The key element in
flat start training is the phone-level transcription, obtained
from the phone-based recognition using the acoustic model 7. ACKNOWLEDGMENTS
acquired from the first type zero-resource approach. This publication is the result of the Project implementa-
Type 4: Type 4 approach implements GMM-based seg- tion: University Science Park TECHNICOM for Innovation
mentation and ergodic HMM (EHMM) training. Firstly, an Applications Supported by Knowledge Technology, ITMS:
unsupervised GMM training is performed on whole database, 26220220182, supported by the Research & Development
where each acoustic unit is represented by one GM. Each Operational Programme funded by the ERDF (100%).
GM is then associated with one of the 64 states in EHMM
and new GMs for each acoustic unit are trained iteratively. 8. REFERENCES
Note that we used conventional 39-dimensional MFCCs [1] X. Anguera et al. The Telefonica Research Spoken Web
for each zero-resource processing (except the Type 1). We Search System for MediaEval 2013. In Working Notes
did not use any VAD here (except the Type 1) because the Proc. of the MediaEval 2013, 2013.
labels were available from the Viterbi segmentation. [2] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
Rodriguez-Fuentes. Query by Example Search on
5. POST-PROCESSING: SCORE NORMAL- Speech at Mediaeval 2014. In Working Notes Proc. of
the MediaEval 2014 Workshop, Barcelona, Spain, 16-17
IZATION AND FUSION October 2014.
Score parameter was represented by WCD, normalized by [3] S. Darjaa et al. Rule-based Triphone Mapping for
scaling factor 0/1, similarly as we used in [8]. This step Acoustic Modeling in Automatic Speech Recognition.
helped us to unified score ranges for the first 500 detection In Proc. of the 14th Intl. Conf. on Text, Speech and
candidates per each query. Then the score fusion for four Dialogue, TSD’11, pages 268–275, 2011.
different subsystems was carried out, employing a simple [4] J. S. Garofolo et al. TIMIT Acoustic-Phonetic
max-score merging strategy, similarly as Anguera et al. did Continuous Speech Corpus, 1993. Linguistic Data
in [1]. Detection candidates from each individual subsystem Consortium, Philadelphia.
were merged together, keeping the one with the highest score [5] J. Juhár and P. Viszlay. Linear Feature
in case of overlap. Merged candidates for each query were Transformations in Slovak Phoneme-Based Continuous
subsequently normalized by z-normalization and aligned ac- Speech Recognition. In Modern Speech Recognition
cording to the score value. The final set was obtained by Approaches with Case Studies, pages 131–154. InTech
keeping first 45-150 candidates, according to the length of Open Access, 2012.
query (the shorter query the lower number of candidates). [6] H. van den Heuvel et al. SpeechDat-E: five eastern
european speech databases for voice-operated
6. RESULTS AND CONCLUSION teleservices completed. In Proc. of INTERSPEECH,
We submitted four runs obtained from low-resource (pri- pages 2059–2062, 2001.
mary) and zero-resource (general) systems for QUESST 2014 [7] J. Vavrek et al. TUKE at MediaEval 2013 Spoken Web
task [2]. The primary systems employ language-dependent Search Task. In Working Notes Proc. of the MediaEval
acoustic modeling using Viterbi segmentation with 128 GMs 2013, 2013.
(ParDat1, TIMIT) and 256 GMs (Speechdat SK, CZ). The [8] J. Vavrek et al. Query-by-Example Retrieval via Fast
general systems use 32 GMs for Type 1,2,3 and 64 GM for Sequential Dynamic Time Warping Algorithm. In TSP
Type 4. The best-one-win strategy was used at first runs 2014, Berlin, DE, pages 469–473. IEEE, July 2014.
(on time). Thus, only the subsystem with best performance [9] S. Young et al. The HTK Book (for HTK Version 3.4).
was submitted, namely p-low using Speechdat SK and g- Cambridge University, 2006.