=Paper= {{Paper |id=Vol-1263/paper73 |storemode=property |title=CUHK System for QUESST Task of MediaEval 2014 |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_73.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL14 }} ==CUHK System for QUESST Task of MediaEval 2014== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_73.pdf
          CUHK System for QUESST Task of MediaEval 2014

                                                     Haipeng Wang, Tan Lee
                                                      DSP-STL, Dept. of EE
                                                The Chinese University of Hong Kong
                                              {hpwang,tanlee}@ee.cuhk.edu.hk



ABSTRACT                                                            task in 2012 [5]. The system involves seven tokenizers, in-
This paper describes a spoken keyword search system de-             cluding a GMM tokenizer, five phoneme recognizers, and an
veloped at the Chinese University of Hong Kong (CUHK)               ASM tokenizer [8]. Using these tokenizers, the query ex-
for the query by example search on speech (QUESST) task             amples and test utterances are converted into frame-level
of MediaEval 2014. This system utilizes posterior features          posteriorgrams. Different tokenizers may use different algo-
and dynamic time warping (DTW) for keyword matching.                rithms to generate posteriorgrams. Let Qi denote the query
Multiple types of posterior features are generated with dif-        posteriorgram generated by the ith tokenizer, and let Ti
ferent tokenizers, and then fused by a linear combination on        denote the corresponding test posteriorgram. The distance
the DTW distance matrices. The main contribution of this            matrix Di was computed as the inner-product [3],
year’s system is a multiview segment clustering (MSC) ap-                      Di = − log(QTi × Ti )      i = 1, 2, ..., 7.         (1)
proach for unsupervised ASM tokenizer construction. The
Cnxe and ATWV of our submitted results on the Evaluation              To exploit the complementary information from different
set are 0.682 and 0.412, respectively.                              tokenizers, the distance matrices were combined linearly to
                                                                    give a new distance matrix D,
1.    INTRODUCTION                                                                              7
                                                                                                X
   The query by example search on speech (QUESST) task                                    D=          wi Di ,                       (2)
aims at detecting the keyword occurrences in a unlabeled                                        i=1

speech collection using spoken queries in a language inde-          where wi denotes the weighting coefficients for Di and was
pendent fashion. In this year’s QUESST dataset, the speech          simply set to 71 .
collection involves about 23 hours of speech data from 6 lan-         Subsequently, DTW detection was applied to the com-
guages, and the query set includes 560 development queries          bined distance matrix D to locate the top matching regions.
and 555 evaluation queries. The average duration of queries         DTW detection was performed with a sliding window with
is about 0.9 second after voice activity detection (VAD).           a window shift of 5 frames. The adjustment window con-
More details about the task description can be found in [2].        straint was imposed on the DTW alignment path. Let dq,t
   Our system was designed only for the type 1 query match-         denote the normalized DTW alignment distance between the
ing. It followed the posteriorgram-based template matching          qth query on the tth hit region. The raw detection score was
framework [3], in which speech tokenizers were used to gen-         computed by
erate posteriorgrams, and DTW was applied for keyword
detection. The tokenizers were either built from the search-                           sq,t = exp(−dq,t /β),                        (3)
ing speech collection given in the task, or developed from          where the scaling factor β was set to 0.6. To calibrate the
some resource-rich languages. In order to exploit the com-          score distribution of different queries, a 0/1 normalization
plementary information of multiple tokenizers, the DTW              was used,
matrix combination method [7] was used. Raw DTW de-
tection scores were then normalized to zero mean and unit                              ŝq,t = (sq,t − µq )/δq ,                    (4)
variance. On the evaluation set, the Cnxe and ATWV of               where ŝq,t is the calibrated score, and µq and    are theδq2
our submission are 0.682 and 0.412. If only considering the         mean and variance of the raw scores of the qth query.
type 1 query matching, the Cnxe and ATWV are 0.611 and
0.526.                                                              2.2   GMM Tokenizer
                                                                      The GMM tokenizer was trained from the given searching
2.    SYSTEM DESCRIPTION                                            speech collection. It contained 1024 Gaussian components.
                                                                    The input of the GMM tokenizer was 39-dimensional MFCC
2.1    System Overview                                              feature vector. The MFCC features were processed with
   In this year’s evaluation, our system employs a similar          VAD and utterance-based mean and variance normalization
framework as our previous system for spoken web search              (MVN). Vocal tract length normalization (VTLN) was then
                                                                    applied to the MFCC features o alleviate the influence of
                                                                    speaker variation.
Copyright is held by the author/owner(s).                             The warping factors of VTLN were estimated iteratively
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain.     as proposed in [9]. The iteration started with training a
GMM from the unwarped MFCC features. Then the warp-                were stored in the memory. This caused very high memory
ing factors were estimated with a maximum-likelihood grid          cost (>10GB). The computation cost in the searching pro-
search using the GMM. A new GMM was trained using                  cess was mainly caused by DTW detection. The searching
the warped features, and new warping factors were then             speed factor of our system was about 0.021. The slow search-
re-estimated. This process was iterated four times in our          ing speed is one main drawback of our system and needs to
implementation. The usefulness of VTLN for this task was           be improved.
experimentally demonstrated in our previous paper [8].

2.3    Phoneme Recognizers                                         Table 1: System performances on all the queries.
                                                                   System 1 corresponds to the submitted results.
   Our system involved five phoneme recognizers, namely
                                                                     System No. actCnxe minCnxe ATWV MTWV
Czech, Hungarian, Russian, English and Mandarin phoneme
recognizers. All these phoneme recognizers used the split                 1       0.682    0.659   0.412    0.413
temporal context network structure [4]. The Czech, Hungar-                2       0.638    0.585   0.412    0.413
ian, Russian phoneme recognizers were developed at Brno
University of Technology (BUT) and released in [1]. The            Table 2: System performances on the type 1 queries.
English phoneme recognizer was trained on about 15-hour            System 1 corresponds to the submitted results.
speech data from the Fisher corpus and Swichboard Cellu-             System No. actCnxe minCnxe ATWV MTWV
lar corpus. The Mandarin phoneme recognizer was trained                   1       0.526    0.486    0.611    0.613
on about 15-hour speech data from the CallHome corpus                     2       0.508    0.420    0.611    0.613
and the CallFriend corpus. These phoneme recognizers were
used to generate mono-phone state-level posteriorgrams with-       4.   CONCLUSION
out any language model constraint.
                                                                     We have described an overview of the CUHK system sub-
2.4    ASM Tokenizer                                               mitted to the MediaEval 2014 QUESST task along with the
                                                                   evaluation results. Our system involves seven tokenizers
   Acoustic segment modeling (ASM) is a way to build an
                                                                   and uses DTW matrix combination for fusion. Only type
HMM-based speech tokenizer from unlabeled speech data. It
                                                                   1 query matching is considered in the system development.
consists of three steps, namely initial segmentation, segment
                                                                   The main highlight of our system lies in the MSC approach
labeling, and iterative training and decoding. Initial seg-
                                                                   in the ASM tokenizer construction. In general we think the
mentation searches for the acoustic discontinuities and par-
                                                                   performances for type 1 query matching are acceptable, but
titions speech utterances into short-time speech segments.
                                                                   the slow searching speed and high memory cost need to be
In our implementation, we simply used the one-best recog-
                                                                   substantially improved.
nition results of the Hungarian phoneme recognizer to get
the hypothesised segment boundaries.
   Segment labeling is to assign a label to each short-time        5.   REFERENCES
                                                                   [1] http://speech.fit.vutbr.cz/software/phoneme-
speech segment. We used a multiview segment clustering
                                                                       recognizer-based-long-temporal-context.
(MSC) approach for segment labeling. The MSC approach
took in multiple segment-level posterior features, computed        [2] X. Anguera, L. Rodriguez-Fuentes, A. B. I. Szoke, and
the similarity matrix and Laplacian matrix of the speech seg-          F. Metze. Query by example search on speech at
ments for each type of posterior feature, and made a linear            mediaeval 2014. In Working Notes Proceedings of the
combination on the Laplacian matrices. With the combined               Mediaeval 2014 Workshop, Barcelona, Spain, October
Laplacian matrix, eigen-decomposition was performed to de-             16-17, 2014.
rive the spectral embedding representations, and k-means           [3] T. Hazen, W. Shen, and C. White. Query-by-example
was applied to find 100 clusters. Details of the MSC ap-               spoken term detection using phonetic posteriorgram
proach are described in [6].                                           templates. In ASRU, pages 421–426, 2009.
   The cluster labels were used as initializations for iterative   [4] P. Schwarz. Phoneme recognition based on long
training and decoding, in which HMM training and decoding              temporal context, PhD thesis. 2009.
were performed iteratively until converge.                         [5] H. Wang and T. Lee. CUHK system for the spoken web
                                                                       search task at mediaeval 2012. In Working Notes
3.    RESULTS                                                          Proceedings of the Mediaeval 2012 Workshop, Pisa,
                                                                       Italy, October 4-5, 2012.
  Table 1 shows the results obtained by our system on eval-
                                                                   [6] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li.
uation queries. Based on our previous experience on TWV
                                                                       Acoustic segment modeling with spectral clustering
values, we only submitted a small portion of the scores which
                                                                       methods. in submission to IEEE/ASM TASLP.
were higher than a threshold. This gives us the results of
System 1. However, if all the scores of all the trials are         [7] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li. Using
considered, we obtain the results of System 2, which gives             parallel tokenizers with DTW matrix combination for
obvious reductions on the Cnxe values. Similar observations            low-resource spoken term detection. In ICASSP, 2013.
can also be made when only considering the type 1 query            [8] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li. An
matching. Corresponding results are shown in Table 2. The              acoustic segment modeling approach to
difference between Cnxe and TWV metrics needs to be care-              query-by-example spoken term detection. In ICASSP,
fully examined in the future.                                          2012.
  To run the experiments, we used a computer with Intel i7-        [9] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin.
3770K CPU (3.50GHz, 4 cores), 32GB RAM and 1T hard                     Speaker normalization on conversational telephone
drive. In the online searching process, all the posteriorgrams         speech. In ICASSP, 1996.