=Paper= {{Paper |id=Vol-1263/paper80 |storemode=property |title=TUKE System for MediaEval 2014 QUESST |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_80.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/VavrekVLPJ14 }} ==TUKE System for MediaEval 2014 QUESST== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_80.pdf
                  TUKE System for MediaEval 2014 QUESST

                Jozef Vavrek, Peter Viszlay, Martin Lojka, Matúš Pleva, and Jozef Juhár
            Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice
                                 Park Komenského 13, 041 20 Košice, Slovakia
            {Jozef.Vavrek, Peter.Viszlay, Martin.Lojka, Matus.Pleva, Jozef.Juhar}@tuke.sk



ABSTRACT                                                          Slovak ParDat1 (40h) [3] and English TIMIT (10h) [4].
Two approaches to QbE (Query-by-Example) retrieving sys-             The well-trained models were intended to generate time-
tem, proposed by the Technical University of Košice (TUKE)       aligned and labelled segments for each utterance through
for the query by example search on speech task (QUESST),          Viterbi decoding. The phonetic decoder employed a phone-
are presented in this paper. Our main interest was focused        level vocabulary and a phone network. We found that the
on building such QbE system, which is able to retrieve all        phoneme insertion log probability p in Viterbi segmentation
given queries with and without using any external speech re-      has significant impact to time-alignment. Since the best
sources. Therefore we developed posteriorgram-based key-          results were obtained with p = 0, we used this value in the
word matching system, which utilizes a novel weighted fast        whole setup. The time-alignments were used to train a new
sequential variant of DTW (WFS-DTW) algorithm in order            GMM-based acoustic model using the development data. It
to detect occurrences of each query within the particular ut-     means that each language-dependent model was replaced by
terance file, using two GMM-based acoustic units modeling         its refined version, which was finally used to generate the
approaches. The first one, referred as low-resource approach,     posteriorgrams for utterances and queries.
employs language-dependent phonetic decoders to convert              Note that we used 39-dimensional MFCC (Mel-Frequency
queries and utterances into posteriorgrams. The second one,       Cepstral Coefficients) features for Viterbi segmentation and
defined as zero-resource approach, implements combination         GMM training. In low-resource approach we did not need
of unsupervised segmentation and clustering techniques by         any voice activity detector (VAD) because the silent parts of
using only provided utterance files.                              the audio stream were identified in the Viterbi segmentation.


1.   MOTIVATION                                                   4.   ZERO-RESOURCE APPROACH
  The motivation for developing our system was to assess             In keeping with the zero-resource approach, we did not
the ability of proposed WFS-DTW algorithm to detect var-          assume any prior knowledge of the acoustic units or pro-
ious spoken query terms by implementing low and zero-             nunciation lexicon. In order to train the acoustic models, it
resource posteriorgram-based matching approach.                   was firstly necessary to identify the acoustic speech units in
                                                                  the audio data automatically. In this work, we utilized four
2.   WFS-DTW SEARCHING ALGORITHM                                  different zero-resource approaches to address this problem.
                                                                     Type 1: This one uses a PCA-based VAD to discriminate
   Searching algorithm for QUESST task follows the one
                                                                  the voice active segments from the silent ones [8]. The ini-
used in our paper [8]. Proposed solution is a modification of
                                                                  tial feature selection, based on simple PCA (principal com-
segmental DTW algorithm we applied in spoken web search
                                                                  ponent analysis) [5], is carried out after extracting first 13
task last year [7]. There are three main contributions to this
                                                                  MFCCs. Only those speech active feature vectors are se-
algorithm: 1) one step forward moving strategy, when each
                                                                  lected, whose variance achieves values greater than 90% at
DTW search is carried out sequentially, block by block, with
                                                                  the first principal component. Then, K-means clustering
size equal to the length of query; 2) linear time-aligned accu-
                                                                  with K = 75 clusters and correlation distance metric is
mulated distance for speeding up sequential DTW without
                                                                  computed on the reduced data. The clustering starts by
considerable loss in retrieving performance; 3) optimization
                                                                  selecting K points uniformly. Finally, speech segmentation
of global minimum for set of alignment paths by implement-
                                                                  is performed by computing the squared Euclidean distance
ing weighted cumulative distance (WCD) parameter.
                                                                  between feature vectors and K mean vectors, where the la-
                                                                  bel of the mean vector with minimum distance is assigned
3.   LOW-RESOURCE APPROACH                                        in collaboration with VAD.
  The low-resource approach includes 4 language-dependent            Type 2: Type 2 approach comes directly out from the
subsystems, each represented by GMM-based acoustic model.         Type 1 and is further extended by Viterbi segmentation and
The acoustic models were trained previously using four da-        new GMM training. These two steps are identical to those
tabases: 2× Speechdat (Slovak, 66h and Czech, 89h) [6],           already described in Section 3. The main difference is that
                                                                  the acoustic model from the Type 1 is used to generate the
                                                                  time-alignments through Viterbi segmentation.
Copyright is held by the author/owner(s).                            Type 3: The third approach is based on the well-known
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain    flat start training procedure [9]. It does not need any seg-
Table 1: Evaluation of primary low-resource (p-low)                     Table 2: Processing resources measures
and general zero-resource (g-zero) systems (* indi-                       system      ISF     SSF     P M UI   P M US     PL
cates late submission)                                                 p-low (dev)    0.61   0.0034    0.05     2.46    0.0106
                                                                       g-zero (dev)    1.5   0.0042     1.4     3.92     0.225
                       eval                      dev
 system      Cnxe         TWV          Cnxe        TWV
             (act/min)    (act/max)    (act/min)   (act/max)     zero Type 2 subsystem. Late submissions include max-
 p-low       0.959/0.891 0.154/0.154   0.960/0.892 0.161/0.162   score merging fusion of four subsystems for both primary
 g-zero      0.973/0.934 0.075/0.077   0.974/0.934 0.091/0.091   and general approaches. Results in Tab. 1 show that there
 p-low*      0.947/0.853 0.168/0.169   0.948/0.854 0.191/0.191   are still big differences in performance between p-low and
 g-zero*     0.970/0.921 0.102/0.103   0.971/0.922 0.106/0.107   g-zero approaches, even if the score fusion technique was
                                                                 applied. Even more, there is also considerable gap between
                                                                 act and min Cnxe despite the fact that the act and max
                                                                 T W V are perfectly calibrated. Therefore, an improved cal-
mentation or clustering because the utterances are uniformly     ibration/fusion models based on affine transformation and
segmented using the Baum-Welch embedded re-estimation.           linear-regression will be investigated in the future.
Therefore, an alternative GMM initialization strategy is ap-        The indexing was done using 2xIBM x3650 (Intel E5530 @
plied, where all phone models are initialized identically with   2.4 GHz, 8 cores), 28 GB RAM, under Debian OS. Searching
state means and variances equal to the global mean and vari-     algorithm was running on 52xIBM dx360 M3 cluster (Intel
ance. The phone models are then moved straight to embed-         E5645 @ 2.4GHz, 624 cores), 48 GB RAM per node, running
ded training and simultaneously updated and expanded to          on Scientific Linux 6 and Torque (see Tab. 2).
the higher GMs (Gaussian Mixtures) [9]. The key element in
flat start training is the phone-level transcription, obtained
from the phone-based recognition using the acoustic model        7.   ACKNOWLEDGMENTS
acquired from the first type zero-resource approach.                This publication is the result of the Project implementa-
   Type 4: Type 4 approach implements GMM-based seg-             tion: University Science Park TECHNICOM for Innovation
mentation and ergodic HMM (EHMM) training. Firstly, an           Applications Supported by Knowledge Technology, ITMS:
unsupervised GMM training is performed on whole database,        26220220182, supported by the Research & Development
where each acoustic unit is represented by one GM. Each          Operational Programme funded by the ERDF (100%).
GM is then associated with one of the 64 states in EHMM
and new GMs for each acoustic unit are trained iteratively.      8.   REFERENCES
   Note that we used conventional 39-dimensional MFCCs           [1] X. Anguera et al. The Telefonica Research Spoken Web
for each zero-resource processing (except the Type 1). We            Search System for MediaEval 2013. In Working Notes
did not use any VAD here (except the Type 1) because the             Proc. of the MediaEval 2013, 2013.
 labels were available from the Viterbi segmentation.       [2] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
                                                                     Rodriguez-Fuentes. Query by Example Search on
5.   POST-PROCESSING: SCORE NORMAL-                                  Speech at Mediaeval 2014. In Working Notes Proc. of
                                                                     the MediaEval 2014 Workshop, Barcelona, Spain, 16-17
     IZATION AND FUSION                                              October 2014.
  Score parameter was represented by WCD, normalized by          [3] S. Darjaa et al. Rule-based Triphone Mapping for
scaling factor 0/1, similarly as we used in [8]. This step           Acoustic Modeling in Automatic Speech Recognition.
helped us to unified score ranges for the first 500 detection        In Proc. of the 14th Intl. Conf. on Text, Speech and
candidates per each query. Then the score fusion for four            Dialogue, TSD’11, pages 268–275, 2011.
different subsystems was carried out, employing a simple         [4] J. S. Garofolo et al. TIMIT Acoustic-Phonetic
max-score merging strategy, similarly as Anguera et al. did          Continuous Speech Corpus, 1993. Linguistic Data
in [1]. Detection candidates from each individual subsystem          Consortium, Philadelphia.
were merged together, keeping the one with the highest score     [5] J. Juhár and P. Viszlay. Linear Feature
in case of overlap. Merged candidates for each query were            Transformations in Slovak Phoneme-Based Continuous
subsequently normalized by z-normalization and aligned ac-           Speech Recognition. In Modern Speech Recognition
cording to the score value. The final set was obtained by            Approaches with Case Studies, pages 131–154. InTech
keeping first 45-150 candidates, according to the length of          Open Access, 2012.
query (the shorter query the lower number of candidates).        [6] H. van den Heuvel et al. SpeechDat-E: five eastern
                                                                     european speech databases for voice-operated
6.   RESULTS AND CONCLUSION                                          teleservices completed. In Proc. of INTERSPEECH,
  We submitted four runs obtained from low-resource (pri-            pages 2059–2062, 2001.
mary) and zero-resource (general) systems for QUESST 2014        [7] J. Vavrek et al. TUKE at MediaEval 2013 Spoken Web
task [2]. The primary systems employ language-dependent              Search Task. In Working Notes Proc. of the MediaEval
acoustic modeling using Viterbi segmentation with 128 GMs            2013, 2013.
(ParDat1, TIMIT) and 256 GMs (Speechdat SK, CZ). The             [8] J. Vavrek et al. Query-by-Example Retrieval via Fast
general systems use 32 GMs for Type 1,2,3 and 64 GM for              Sequential Dynamic Time Warping Algorithm. In TSP
Type 4. The best-one-win strategy was used at first runs             2014, Berlin, DE, pages 469–473. IEEE, July 2014.
(on time). Thus, only the subsystem with best performance        [9] S. Young et al. The HTK Book (for HTK Version 3.4).
was submitted, namely p-low using Speechdat SK and g-                Cambridge University, 2006.