<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The NNI Query-by-Example System for MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jingyong Hou</string-name>
          <email>jyhou@nwpu-aslp.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Van Tung Pham</string-name>
          <email>VANTUNG001@e.ntu.edu.sg</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheung-Chi Leung</string-name>
          <email>ccleung@i2r.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haihua Xu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hang Lv</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Xie</string-name>
          <email>lxie@nwpu.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhonghua Fu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chongjia Ni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiong Xiao</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongjie Chen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shaofei Zhang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sining Sun</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yougen Yuan</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pengcheng Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tin Lay Nwe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Sivadas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eng Siong Chng</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haizhou Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Infocomm Research (I</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>STAR</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer Science, Northwestern Polytechnical University (NWPU)</institution>
          ,
          <addr-line>Xi'an</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the system developed by the NNI team for the Query-by-Example Search on Speech Task (QUESST) in the MediaEval 2015 evaluation. Our submitted system mainly used bottleneck features/stacked bottleneck features (BNF/SBNF) trained from various resources. We investigated noise robustness techniques to deal with the noisy data of this year. The submitted system obtained the actual normalized cross entropy (actCnxe) of 0.761 and the actual Term Weighted Value (actTWV) of 0.270 on all types of queries of the evaluation data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This year's data is more challenging in terms of acoustic
and noise conditions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Noise robustness techniques,
including adding noise to the training data of tokenizers and a
speech enhancement method, were investigated to deal with
the noisy data. Our submitted system involves dynamic
time warping (DTW) and symbolic search (SS) based
approaches as last year. This year, the nal submitted system
was obtained by fusing 66 systems from our 3 groups,
including 15 DTW systems (selected from 26 original systems
using FoCal toolkit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) from NWPU, 39 DTW systems from
I2R, and 8 DTW and 4 SS systems from NTU. Moreover,
various voice activity detection (VAD) methods were used
in the DTW systems.
      </p>
    </sec>
    <sec id="sec-2">
      <title>ADDING NOISE TO TRAINING DATA</title>
      <p>
        To reduce the mismatch problem between the training
data of tokenizers and this year's development and test data,
noise was added to the training data. We used two
methods to obtain two sets of noise from the development data.
The method used to obtain the rst set of noise (noise1) is
summarized as follows [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]:
      </p>
      <p>Perform voice/unvoice detection on the development
data and obtain segments of noise from the utterance.
Estimate the noise power spectrum of each utterance
and generate minimum phase signal according to the
This work was partially supported by the National Natural
Science Foundation of China (61175018 and 61571363).
power spectrum of each sentence and design the
minimum phase lter.</p>
      <p>Use EM algorithm to estimate the parameters of the
noise amplitude distribution (empirically select
Gaussian distribution and set the number of Gaussian
mixtures to 2).</p>
      <p>Generate a random white noise with the target noise
amplitude distribution.</p>
      <p>Filter the random white noise using the minimum phase
lter.</p>
      <p>
        The second set of noise (noise2) was also estimated from
the development data by using a method in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The time
domain noise was reconstructed by inverse short-time
Fourier transform of the estimated instantaneous noise spectrum.
Please refer to [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] for details.
      </p>
      <p>When noise was added, we had to ensure that the
signalto-noise ratio (SNR) distribution of the resultant training
data was similar to that of this year's development data.
Moreover, since not all the utterances in this year were
highly noisy or reverberated, we only added noise to randomly
selected 50 percent of training data.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>SPEECH ENHANCEMENT</title>
      <p>
        A Wiener lter [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] was used to reduce the noise in the data.
The noise was reduced in the time domain and the enhanced
data was used for VAD and feature extraction. Initial results
(detailed in section 8) showed that the enhanced data led to
better DTW performance for some tokenizers.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>VOICE ACTIVITY DETECTION</title>
      <p>
        For exact matching DTW systems, we used two voice
activity detectors (VADs), including a frequency band energy
based VAD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (VAD1) and a statistical model based VAD
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (VAD2), because we found that they performed the best
in di erent types of queries. For phoneme-sequence based
approximate matching DTW systems (detailed in section 5)
with phoneme posterior features, we used their single-best
decoding hypotheses to perform VAD and obtain phoneme
boundary information. For a phoneme-sequence
approximate matching DTW systems with SBNF, we simply
borrowed the single-best decoding hypothesis of a phoneme
recognizer to perform VAD and obtained the phoneme
boundary information.
      </p>
      <p>
        Exact matching and approximate matching DTW systems
were developed to deal with di erent types of queries. An
exact matching system matched each query with a
subsequence of a test utterance using DTW [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. It found a
path on the cosine distance matrix of the speech feature of
the query and the test utterance. The system output the
similarity score between the query and the matched
subsequence of the test utterance.
      </p>
      <p>
        We used two di erent kinds of approximate matching DTW
systems in total, including xed-window [
        <xref ref-type="bibr" rid="ref12 ref14">12, 14</xref>
        ] and
phonemesequence [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] approximate matching systems, to deal with
type 2 and type 3 queries. In xed-window approximate
matching systems, when the window was shifted, the
corresponding segment of the query was matched with a test
utterance. The highest similarity score which corresponds to
a query segment and the test utterance was used as the score
of the query-utterance pair of the system. The window sizes
were set between 70 and 90 frames and the window shifts
were set between 5 and 10 frames. In phoneme-sequence
approximate matching systems, the size of the window was
determined by the phoneme boundary information derived
from phoneme recognizers. The window size was set to 8
phonemes, as it provided best results on the development
data.
      </p>
    </sec>
    <sec id="sec-5">
      <title>6. SYMBOLIC SEARCH</title>
      <p>
        Weighted nite state transducer (WFST) based
symbolic search systems were used as last year [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Phonemesequence approximate matching [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] was used to faciliate
type 2 and type 3 queries, and to reduce the miss rate. A
sequence length of 6 phonemes was chosen, as it provided
best matching results on the development data.
      </p>
    </sec>
    <sec id="sec-6">
      <title>TOKENIZERS AND SYSTEMS</title>
      <p>Spectral features, phoneme-state posterior features and
BNF/SBNF were used in our DTW systems.</p>
      <p>
        NWPU extracted truncated PLP [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (a1), posterior
features from 3 BUT phoneme recognizers [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] (Czech,
Hungarian and Russian; a2-a4), 3 sets of SBNF (1 being
monophone state using original training data and 2 being triphone
state with noise1 and noise2 added in training data
respectively; a5-a7) trained from the English Switchboard corpus
(SWBD), and 1 set of triphone state SBNF (a8) trained
from the SEAME corpus [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        I2R extracted 4 sets of BNF (b1-b4) and 4 sets of
SBNF (b5-b8) trained from four LDC corpora (SWBD,
Fisher Spanish, HKUST Mandarin and CallHome Egyptian),
and 5 sets of BNF (b9-b13) (4 language-dependent and one
language-independent [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) trained from 4 development
languages in the OpenKWS evaluation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        NTU extracted 3 sets of BNF (c1-c3) trained (1 being
triphone state with original training data and 2 being
triphone state with Noisex92 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] added in training data once
and twice respectively) from SWBD, and 1 set of BNF (c4)
trained from the 6 development languages in the OpenKWS
evaluation.
      </p>
      <p>NWPU's 26 DTW systems consisted of 9 exact
matching systems (using a1-a8, c4) and 4 phoneme-sequence
approximate matching systems (using a2-a4, a6). The rest 13
systems were exactly the same as the previous 13 systems
except the enhanced data was used in VAD and feature
extraction.
I2R's 39 DTW systems consisted of 13 exact matching
systems (using b1-b13) and 13 xed-window approximate
matching systems (using b1-b13) with VAD1, and 13 exact
matching systems (using b1-b13) with VAD2.</p>
      <p>
        NTU's 12 systems consisted of 4 exact matching (using
c1c4) and 4 xed-window approximate matching (using c1-c4)
DTW systems with VAD1, and 4 phoneme-sequence
approximate matching SS systems with 4 acoustic models trained
from SWBD and a Malay speech corpus [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        The scores of all systems in each group were fused to a
single system internally and the 3 resultant systems were
further fused to obtain the nal submitted system. In each
fusion step, scores were rst normalized to zero mean and
unit variance, and then fused with the FoCal toolkit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
8.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND CONCLUSION</title>
      <p>Table 1 shows the performance gain of an exact matching
DTW system on the development set when noise1 and noise2
were added to the SWBD data for training triphone SBNF.
The results show that adding the noise to the training data
gives 1.8% relative improvement on all query types and 3.8%
relative improvement on type 1 queries in minCnxe.</p>
      <p>When the enhanced data was used to extract SWBD
monophone SBNF, BUT Czech and Hungarian phoneme-state
posterior features for our DTW systems, we observed
relative improvements of 1.9-3.1% on all query types and relative
improvements of 2.7-6.3% on type 1 queries in minCnxe.</p>
      <p>Table 2 shows the performance of our nal submitted
system on this year's data. In the intra-group fusion, each
group experienced performance gains by fusing exact
matching and approximate matching systems, and fusing sytems
using di erent speech preprocessing techniques and di erent
tokenizers. Compared with our single best exact matching
DTW system (s2 in table 1), system fusion brings around
13.5% relative improvement in minCnxe on the development
data (all query types).</p>
      <p>The peak memory usage (PMU) of all DTW systems is
1.45GB when 1 set of 30 dimensional SBNF are loaded, and
the searching speed factor (SSF) is around 0.0044 in each
DTW system. The PMU of all SS systems is 45GB, and the
SSF is around 0.0012 in each SS system.</p>
      <p>We adopted noise robustness techniques to deal with the
noise condition of data, which led to better search
performance. We also experienced performance gains by fusing
systems using di erent tokenizers, di erent VADs and
different search algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szoke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Proenca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lojka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , \
          <article-title>Query by example search on speech at mediaeval 2015,"</article-title>
          <source>Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          , Sept.
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2015</year>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bru</surname>
          </string-name>
          mmer, \
          <article-title>FoCal: Toolkit for Evaluation, Fusion and Calibration of statistical pattern recognizers," https://sites</article-title>
          .google.com/site/nikobrummer/focal.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Yao</surname>
          </string-name>
          , \
          <article-title>Analyzing classical spectral estimation by MATLAB,"</article-title>
          <source>Journal of Huazhong University of Science and Technology</source>
          , vol.
          <volume>4</volume>
          , p.
          <fpage>021</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Oppenheim</surname>
          </string-name>
          , \
          <article-title>Signal reconstruction from phase or magnitude," Acoustics, Speech and Signal Processing</article-title>
          , IEEE Transactions on, vol.
          <volume>28</volume>
          , no.
          <issue>6</issue>
          , pp.
          <volume>672</volume>
          {
          <issue>680</issue>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Gruber</surname>
          </string-name>
          , \
          <article-title>Statistical digital signal processing and modeling,"</article-title>
          <source>Technometrics</source>
          , vol.
          <volume>39</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>335</volume>
          {
          <issue>336</issue>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benesty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Doclo</surname>
          </string-name>
          , \
          <article-title>New insights into the noise reduction Wiener lter," Audio, Speech, and Language Processing</article-title>
          , IEEE Transactions on, vol.
          <volume>14</volume>
          , no.
          <issue>4</issue>
          , pp.
          <volume>1218</volume>
          {
          <issue>1234</issue>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Benesty</surname>
          </string-name>
          , \
          <article-title>Filtering techniques for noise reduction and speech enhancement,"</article-title>
          <source>in Adaptive Signal Processing</source>
          . Springer,
          <year>2003</year>
          , pp.
          <volume>129</volume>
          {
          <fpage>154</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Diethorn</surname>
          </string-name>
          , \
          <article-title>Subband noise reduction methods for speech enhancement," in Audio Signal Processing for Next-Generation Multimedia Communication Systems</article-title>
          . Springer,
          <year>2004</year>
          , pp.
          <volume>91</volume>
          {
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benesty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and T. Gaensle, \
          <article-title>On single-channel noise reduction in the time domain,"</article-title>
          <source>in Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2011 IEEE International Conference on. IEEE</source>
          ,
          <year>2011</year>
          , pp.
          <volume>277</volume>
          {
          <fpage>280</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cornu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sheikhzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Brennan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Abutalebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Iles</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Wong</surname>
          </string-name>
          , \
          <article-title>ETSI AMR-2 VAD: evaluation and ultra low-resource implementation,"</article-title>
          <source>in Multimedia and Expo</source>
          ,
          <year>2003</year>
          . ICME'
          <fpage>03</fpage>
          .
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          . 2003 International Conference on, vol.
          <volume>2</volume>
          . IEEE,
          <year>2003</year>
          , pp.
          <source>II{841.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huijbregts</surname>
          </string-name>
          and F. De Jong, \
          <article-title>Robust speech/non-speech classi cation in heterogeneous multimedia content,"</article-title>
          <source>Speech Communication</source>
          , vol.
          <volume>53</volume>
          , no.
          <issue>2</issue>
          , pp.
          <volume>143</volume>
          {
          <issue>153</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Leung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Leow</surname>
          </string-name>
          et al., \
          <article-title>The NNI query-by-example system</article-title>
          for
          <source>MediaEval</source>
          <year>2014</year>
          ,
          <article-title>"</article-title>
          <source>Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona, Spain, Oct, pp.
          <volume>16</volume>
          {
          <issue>17</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Muscariello</surname>
          </string-name>
          , G. Gravier, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Bimbot</surname>
          </string-name>
          , \
          <article-title>Audio keyword extraction by unsupervised word discovery,"</article-title>
          <source>in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Leung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Leow</surname>
          </string-name>
          et al.,
          <article-title>\Language independent query-by-example spoken term detection using n-best phone sequences and partial matching,"</article-title>
          <source>in Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2015 IEEE International Conference on. IEEE</source>
          ,
          <year>2015</year>
          , pp.
          <volume>5191</volume>
          {
          <fpage>5195</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Leung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>E. S.</given-names>
          </string-name>
          <string-name>
            <surname>Chng</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>\Spoken term detection technology based on DTW(to be published),"</article-title>
          <source>Journal of Tsinghua University (Sci and Tech)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dupoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldwater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , S. Khudanpur,
          <string-name>
            <given-names>K.</given-names>
            <surname>Church</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hermansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rose</surname>
          </string-name>
          , \
          <article-title>A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition,"</article-title>
          <source>in Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2013 IEEE International Conference on. IEEE</source>
          ,
          <year>2013</year>
          , pp.
          <volume>8111</volume>
          {
          <fpage>8115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Matejka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cernocky</surname>
          </string-name>
          , \
          <article-title>Hierarchical structures of neural networks for phoneme recognition,"</article-title>
          <source>in Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2006 IEEE International Conference on. IEEE</source>
          ,
          <year>2006</year>
          , pp.
          <volume>325</volume>
          {
          <fpage>328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , \
          <article-title>SEAME: a Mandarin-English code-switching speech corpus in South-East Asia."</article-title>
          <source>INTERSPEECH 2010: 11th Annual Conference of the International Speech Communication Association</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Vesely</surname>
          </string-name>
          , M. Kara at, F. Grezl,
          <string-name>
            <given-names>M.</given-names>
            <surname>Janda</surname>
          </string-name>
          , and E. Egorova, \
          <article-title>The language-independent bottleneck features,"</article-title>
          <source>in Spoken Language Technology Workshop (SLT)</source>
          ,
          <year>2012</year>
          IEEE. IEEE,
          <year>2012</year>
          , pp.
          <volume>336</volume>
          {
          <fpage>341</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20] \
          <article-title>Open keyword search 2015 evaluation,"</article-title>
          http://www.nist.gov/itl/iad/mig/openkws15.cfm.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Varga</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Steeneken</surname>
          </string-name>
          , \
          <article-title>Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the e ect of additive noise on speech recognition systems," Speech communication</article-title>
          , vol.
          <volume>12</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>247</volume>
          {
          <issue>251</issue>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Chng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , \
          <article-title>MASS: A Malay language LVCSR corpus resource," in Speech Database</article-title>
          and Assessments,
          <source>2009 Oriental COCOSDA International Conference on. IEEE</source>
          ,
          <year>2009</year>
          , pp.
          <volume>25</volume>
          {
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>