=Paper= {{Paper |id=None |storemode=property |title=The CUHK Spoken Web Search System for MediaEval 2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_68.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL13 }} ==The CUHK Spoken Web Search System for MediaEval 2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_68.pdf
The CUHK Spoken Web Search System for MediaEval 2013

                                                      Haipeng Wang, Tan Lee
                                                       DSP-STL, Dept. of EE
                                                 The Chinese University of Hong Kong
                                                {hpwang,tanlee}@ee.cuhk.edu.hk



ABSTRACT                                                                     The online process is to perform the detection task given an input
This paper describes an audio keyword detection system de-                query. It involves query expansion, query posteriorgram genera-
veloped at the Chinese University of Hong Kong (CUHK) for                 tion, DTW detection and score normalization. The query expansion
the spoken web search (SWS) task of MediaEval 2013. The                   is based on the PSOLA [3] technique, which modifies the duration
system was built only on the provided unlabeled data, and                 of the original query example and generates a number of query ex-
each query term was represented by only one query exam-                   amples of different lengths. We refer to the original query examples
ple (from the basic set for required runs). This system was               and the generated query examples as the expanded query set. Af-
designed following the posteriorgram-based template match-                ter converting the expanded query set into posteriorgrams, DTW
ing framework, which used a tokenizer to convert the speech               detection is applied to get the raw scores. DTW is performed with
data into posteriorgrams, and then applied dynamic time                   a sliding window on the log-inner-product distance matrix of the
warping (DTW) for keyword detection. The main features                    posteriorgrams of the query set and the spoken documents. Details
of the system are: 1) a new approach of tokenizer con-                    of the DTW detection in our system can be found in [5]. Lastly
struction based on Gaussian component clustering (GCC)                    mean and variance normalization is applied to the raw scores to
and 2) query expansion based on the technique called pitch                obtain the final detection score.
synchronous overlap and add (PSOLA). The MTWV and                            In practice, when the query example was very short, the returned
ATWV of our system on the SWS2013 Evaluation set are                      hits would contain many false alarms. A duration threshold of 0.35
0.306 and 0.304.                                                          second was applied to the input queries. If the duration of a query
                                                                          example (after silence removal) was less than the threshold, the
                                                                          system rejected this query example and did not return any results.
1.    INTRODUCTION
   The spoken web search (SWS) task of MediaEval 2013 aims at             2.2    Feature Extraction
detecting the keyword occurrences in a set of spoken documents               Our system used 39-dimensional MFCC features. The MFCC
using audio keyword queries in a language-independent fashion.            features were processed with voice activity detection (VAD), mean
The spoken documents involves about 20 hours of unlabeled speech          and variance normalization (MVN) on the utterance level. Vocal
data from 9 languages. More details about the task description can        tract length normalization (VTLN) was then used to alleviate the
be found in [1]. The focus of our work was on a completely un-            influence of speaker variation. The warping factors were deter-
supervised setting, i.e., only the unlabeled data from the spoken         mined with a maximum-likelihood grid search using a GMM with
documents was used in the system development. For each query              256 components. The usefulness of VTLN for this task was exper-
term, only one audio example was used in our system.                      imentally demonstrated in our previous paper [6].
   Our system follows the posteriorgram-based template matching
framework [2]. New methods have been developed for tokenizer              2.3    Tokenizer Construction
construction and query expansion. In addition, it was found that             The tokenizer was used to generate posteriorgrams. It was trained
score normalization brought significant improvement.                      from the unlabeled data of the spoken documents. We used a new
                                                                          Gaussian component clustering (GCC) approach to find phoneme-
                                                                          like units, and modeled the corresponding context-dependent states
2.    SYSTEM DESCRIPTION                                                  by a 5-layer neural network. The posteriorgrams were composed
                                                                          of the state posterior probabilities produced by the neural network.
2.1    System Overview                                                       The GCC approach involved 4 steps. First, a GMM with 4096
   Fig. 1 gives the overall architecture of our system. It involves       components was estimated. Second, unsupervised phoneme seg-
offline process and online process. The offline process (marked by        mentation was performed on the spoken documents. Third, each
the dashed window in Fig. 1) is to build the system from the spo-         speech segment was represented by a Gaussian posterior vector,
ken documents. It is divided into the stages of feature extraction,       which is computed by averaging the frame-level Gaussian poste-
tokenizer construction, and posteriorgram generation. The offline         rior probabilities. Stacking the Gaussian posterior vectors, we ob-
process results in a speech tokenizer and the posteriorgrams of the       tained a Gaussian-by-segment data matrix, which is denoted by X.
spoken documents.                                                         Finally, we computed the similarity matrix W of the Gaussian com-
                                                                          ponents as W = XXT , and apply spectral clustering on the simi-
Copyright is held by the author/owner(s).                                 larity matrix to find 150 clusters of Gaussian components. Details
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain.
This research is partially supported by the CUHK MoE-Microsoft Key Lab-   of the GCC approach can be found in [4].
oratory and CUHK-PKU Joint Centre for Intelligence Engineering.              Each cluster of Gaussian components was viewed as the acoustic
                                                    Figure 1: System Framework

model of a discovered unit. These acoustic models were refined by      We think this improvement is quite encouraging. And more experi-
an iterative process [6], and updated to context-dependent models      ments and analysis will be done to claim the usefulness of the query
with 1198 states. These states were then modeled by a deep neu-        expansion in the future work. The final observation is that the use
ral network, which had 3 hidden layers with 1000 units per layer.      of score normalization brings two considerable benefits. First, it
The input layer corresponds to a context window of 9 successive        brings about 7.7% MTWV gain on Dev set, and 7.0% on Eval set.
frames. The outputs of the neural network were the state posterior     This is different from our observation in the previous work [5]. We
probabilities and used to construct the posteriorgrams.                suspect this is related to the nonlinear transformation in (1) and the
                                                                       large size of the spoken documents. Second, score normalization
2.4    Query Expansion                                                 seemed to make the decision threshold quite stable, so that the gap
   Query expansion aimed at generating variable length examples,       between MTWV and ATWV on Eval set becomes very small.
so as to cover larger duration variation of the query term. The
PSOLA technique was implemented for this purpose. PSOLA is             Table 1: System Configurations and Performances.
able to perform time-scale modifications while preserving the spec-    The basic system is without query expansion and
tral characteristics as much as possible. The implementation in-       score normalization.
                                                                                  System No.       1      2      3
volved three steps. First, pitch epochs were detected by an autocor-                               √     √       √
relation method. Second, the periodic waveform cycles identified                 Basic System
                                                                                                         √       √
by the pitch marks were duplicated or eliminated according to the              Query Expansion
                                                                                                                 √
time-scaling factors. Finally, the overlap-and-add algorithm was             Score Normalization
used to synthesize the new speech example. In the system, two              Dev Query Set (MTWV) 0.263 0.290 0.367
time-scaling factors were used: 0.7 and 1.3. For a query exam-             Eval Query Set (MTWV) 0.216 0.236 0.306
ple with duration L, we had one generated example with duration            Dev Query Set (ATWV)    –      –    0.367
0.7×L and another with duration 1.3×L. Therefore the expanded              Eval Query Set (ATWV)   –      –    0.304
query set would have three examples for each term. Given a query
term and an utterance in the spoken documents, the detection score
was the maximum value among the scores provided by the three
                                                                       4.    HARDWARE, MEMORY, AND CPU TIME
examples.                                                                All the experiments were performed on a computer with Intel i7-
                                                                       3770K CPU (3.50GHz, 4 cores), 32GB RAM and 1T hard drive. In
2.5    Score Normalization                                             the online process, all the posteriorgrams of the spoken documents
  Let dq,t denote the DTW alignment distance between the qth           were stored in the memory. This accelerated the online detection,
query on the tth hit region. The corresponding raw detection score     but involved very high memory cost (>10GB). The computation
was computed by                                                        cost in the online process was mainly caused by DTW detection.
                                                                       The searching speed factor of the system No.3 was about 0.018.
                      sq,t = exp(−dq,t /β),                     (1)
                                                                       5.    REFERENCES
where the scaling factor β was set to 5. To calibrate the scores of    [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and
different query terms, a simple 0/1 normalization was used. The            L. Rodriguez-Fuentes. The spoken web search task. In
normalization was performed as                                             MediaEval 2013 Workshop, 2013.
                                                                       [2] T. Hazen, W. Shen, and C. White. Query-by-example
                      ŝq,t = (sq,t − µq )/δq ,                 (2)        spoken term detection using phonetic posteriorgram
                                                                           templates. In ASRU,2009.
where µq and δq2 are the mean and variance of the top 400 raw          [3] E. Moulines and F. Charpentier. Pitch-synchronous
scores for the qth query.                                                  waveform processing techniques for text-to-speech
                                                                           synthesis using diphones. Speech communication, 1990.
                                                                       [4] H. Wang, T. Lee, C.C. Leung, B. Ma, and H. Li.
3.    PERFORMANCE AND ANALYSIS                                             Unsupervised mining of acoustic subword units with
   Table 1 lists the performances of our systems with different con-       segment-level Gaussian posteriorgrams. In Interspeech,
figurations. System No. 3 is our submitted system for this task.           2013.
All these three systems belong to the required run condition de-       [5] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li. Using
fined in [1]. From Table 1, we have observed severe performance            parallel tokenizers with DTW matrix combination for
                                                                           low-resource spoken term detection. In ICASSP, 2013.
degradation (≥ 5%) from the Dev query set to the Eval query set.       [6] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li. An
This may be due to the mismatch between the Dev set and Eval               acoustic segment modeling approach to
set. Another observation is that the use of query expansion indeed         query-by-example spoken term detection. In ICASSP,
brings improvements (≥ 2%) for both the Dev set and the Eval set.          2012.