<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The CUHK Spoken Web Search System for MediaEval 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haipeng Wang</string-name>
          <email>hpwang@ee.cuhk.edu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tan Lee DSP-STL</string-name>
          <email>tanlee@ee.cuhk.edu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dept. of EE</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Chinese University of Hong Kong</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes an audio keyword detection system developed at the Chinese University of Hong Kong (CUHK) for the spoken web search (SWS) task of MediaEval 2013. The system was built only on the provided unlabeled data, and each query term was represented by only one query example (from the basic set for required runs). This system was designed following the posteriorgram-based template matching framework, which used a tokenizer to convert the speech data into posteriorgrams, and then applied dynamic time warping (DTW) for keyword detection. The main features of the system are: 1) a new approach of tokenizer construction based on Gaussian component clustering (GCC) and 2) query expansion based on the technique called pitch synchronous overlap and add (PSOLA). The MTWV and ATWV of our system on the SWS2013 Evaluation set are 0.306 and 0.304.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The spoken web search (SWS) task of MediaEval 2013 aims at
detecting the keyword occurrences in a set of spoken documents
using audio keyword queries in a language-independent fashion.
The spoken documents involves about 20 hours of unlabeled speech
data from 9 languages. More details about the task description can
be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The focus of our work was on a completely
unsupervised setting, i.e., only the unlabeled data from the spoken
documents was used in the system development. For each query
term, only one audio example was used in our system.
      </p>
      <p>
        Our system follows the posteriorgram-based template matching
framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. New methods have been developed for tokenizer
construction and query expansion. In addition, it was found that
score normalization brought significant improvement.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>System Overview</title>
      <p>Fig. 1 gives the overall architecture of our system. It involves
offline process and online process. The offline process (marked by
the dashed window in Fig. 1) is to build the system from the
spoken documents. It is divided into the stages of feature extraction,
tokenizer construction, and posteriorgram generation. The offline
process results in a speech tokenizer and the posteriorgrams of the
spoken documents.</p>
      <p>
        The online process is to perform the detection task given an input
query. It involves query expansion, query posteriorgram
generation, DTW detection and score normalization. The query expansion
is based on the PSOLA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] technique, which modifies the duration
of the original query example and generates a number of query
examples of different lengths. We refer to the original query examples
and the generated query examples as the expanded query set.
After converting the expanded query set into posteriorgrams, DTW
detection is applied to get the raw scores. DTW is performed with
a sliding window on the log-inner-product distance matrix of the
posteriorgrams of the query set and the spoken documents. Details
of the DTW detection in our system can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Lastly
mean and variance normalization is applied to the raw scores to
obtain the final detection score.
      </p>
      <p>In practice, when the query example was very short, the returned
hits would contain many false alarms. A duration threshold of 0.35
second was applied to the input queries. If the duration of a query
example (after silence removal) was less than the threshold, the
system rejected this query example and did not return any results.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Feature Extraction</title>
      <p>
        Our system used 39-dimensional MFCC features. The MFCC
features were processed with voice activity detection (VAD), mean
and variance normalization (MVN) on the utterance level. Vocal
tract length normalization (VTLN) was then used to alleviate the
influence of speaker variation. The warping factors were
determined with a maximum-likelihood grid search using a GMM with
256 components. The usefulness of VTLN for this task was
experimentally demonstrated in our previous paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Tokenizer Construction</title>
      <p>The tokenizer was used to generate posteriorgrams. It was trained
from the unlabeled data of the spoken documents. We used a new
Gaussian component clustering (GCC) approach to find
phonemelike units, and modeled the corresponding context-dependent states
by a 5-layer neural network. The posteriorgrams were composed
of the state posterior probabilities produced by the neural network.</p>
      <p>
        The GCC approach involved 4 steps. First, a GMM with 4096
components was estimated. Second, unsupervised phoneme
segmentation was performed on the spoken documents. Third, each
speech segment was represented by a Gaussian posterior vector,
which is computed by averaging the frame-level Gaussian
posterior probabilities. Stacking the Gaussian posterior vectors, we
obtained a Gaussian-by-segment data matrix, which is denoted by X.
Finally, we computed the similarity matrix W of the Gaussian
components as W = XXT , and apply spectral clustering on the
similarity matrix to find 150 clusters of Gaussian components. Details
of the GCC approach can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Each cluster of Gaussian components was viewed as the acoustic
We think this improvement is quite encouraging. And more
experiments and analysis will be done to claim the usefulness of the query
expansion in the future work. The final observation is that the use
of score normalization brings two considerable benefits. First, it
brings about 7.7% MTWV gain on Dev set, and 7.0% on Eval set.
This is different from our observation in the previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We
suspect this is related to the nonlinear transformation in (1) and the
large size of the spoken documents. Second, score normalization
seemed to make the decision threshold quite stable, so that the gap
between MTWV and ATWV on Eval set becomes very small.
model of a discovered unit. These acoustic models were refined by
an iterative process [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and updated to context-dependent models
with 1198 states. These states were then modeled by a deep
neural network, which had 3 hidden layers with 1000 units per layer.
The input layer corresponds to a context window of 9 successive
frames. The outputs of the neural network were the state posterior
probabilities and used to construct the posteriorgrams.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Query Expansion</title>
      <p>Query expansion aimed at generating variable length examples,
so as to cover larger duration variation of the query term. The
PSOLA technique was implemented for this purpose. PSOLA is
able to perform time-scale modifications while preserving the
spectral characteristics as much as possible. The implementation
involved three steps. First, pitch epochs were detected by an
autocorrelation method. Second, the periodic waveform cycles identified
by the pitch marks were duplicated or eliminated according to the
time-scaling factors. Finally, the overlap-and-add algorithm was
used to synthesize the new speech example. In the system, two
time-scaling factors were used: 0.7 and 1.3. For a query
example with duration L, we had one generated example with duration
0:7 L and another with duration 1:3 L. Therefore the expanded
query set would have three examples for each term. Given a query
term and an utterance in the spoken documents, the detection score
was the maximum value among the scores provided by the three
examples.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Score Normalization</title>
      <p>Let dq;t denote the DTW alignment distance between the qth
query on the tth hit region. The corresponding raw detection score
was computed by
(1)
(2)</p>
    </sec>
    <sec id="sec-8">
      <title>HARDWARE, MEMORY, AND CPU TIME</title>
      <p>All the experiments were performed on a computer with Intel
i73770K CPU (3.50GHz, 4 cores), 32GB RAM and 1T hard drive. In
the online process, all the posteriorgrams of the spoken documents
were stored in the memory. This accelerated the online detection,
but involved very high memory cost (&gt;10GB). The computation
cost in the online process was mainly caused by DTW detection.
The searching speed factor of the system No.3 was about 0.018.
where the scaling factor was set to 5. To calibrate the scores of
different query terms, a simple 0/1 normalization was used. The
normalization was performed as
s^q;t = (sq;t
q)= q;
where q and q2 are the mean and variance of the top 400 raw
scores for the qth query.</p>
    </sec>
    <sec id="sec-9">
      <title>PERFORMANCE AND ANALYSIS</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Szoke</given-names>
            , and L.
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The spoken web search task</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Query-by-example spoken term detection using phonetic posteriorgram templates</article-title>
          .
          <source>In ASRU</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Moulines</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Charpentier</surname>
          </string-name>
          .
          <article-title>Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones</article-title>
          .
          <source>Speech communication</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.C.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams</article-title>
          .
          <source>In Interspeech</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Leung</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>An acoustic segment modeling approach to query-by-example spoken term detection</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>