<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The IIT-B Query-by-Example System for MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hitesh Tulsiani</string-name>
          <email>hitesh26@ee.iitb.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preeti Rao</string-name>
          <email>prao@ee.iitb.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering, Indian Institute of Technology Bombay</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the system developed at I.I.T. Bombay for Query-by-Example Search on Speech Task (QUESST) within the MediaEval 2015 evaluation framework. Our system preprocesses the data to remove noise and performs subsequence DTW on posterior/bottleneck features obtained using four phone recognition systems to detect the queries. Scores from each of these subsystems are fused to get the single score per query-utterance pair which is then calibrated with respect to the cross entropy evaluation metric.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the QUESST task within the MediaEval 2015
framework is to determine the presence of a spoken query
in an unlabeled speech data set by building a language
independent system. In this year's QUESST task, the data
consisted of about 18 hours of noisy audio from 7 di erent
languages. More details about the task can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        To minimize the e ect of noise, we preprocess our data
(both the queries and utterances) and follow it with speech
activity detection to remove silence frames. Our approach,
to solve the task, is inspired by Hazen et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A
blockdiagram of our system is shown in Figure 1 and is inspired
by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION 2. 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing - Noise Removal</title>
      <p>
        We use spectral subtraction to remove noise from the
audio. Power spectral density (PSD) of noise is estimated
using the minimum statistics technique described by R. Martin
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The technique used to estimate noise PSD makes the
assumption that during speech pause or within brief periods
in between words the speech energy is close to zero. Thus,
by tracking the minimum power within a nite window large
enough to bridge high power speech segments the noise oor
can be estimated. We next remove the silence at the start
and end of an utterance using a simple energy based speech
activity detector.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Subsystems</title>
      <p>
        We make use of 4 subsystems:
1. Two DNN based phone recognisers (Hungarian and
Russian) trained on the SpeechDat-E corpus by Brno
University of Technology (BUT)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These are used to
extract posterior and bottleneck features.
2. A phone recogniser trained on Hindi database [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
(referred to as TIFR phone recogniser from here on).
TIFR phone recogniser is MLP based and is trained
using 39 dimensional MFCC features. It has a single
hidden layer with 700 neurons and 36 output neurons.
We extract phone posteriors using the TIFR phone
recogniser.
3. 64-GMM system trained in unsupervised manner on
QUESST-2015 database using 36 dimensional MFCC
features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (the energy of the audio was not used as
feature because large energy variations were observed
across utterances). We used this system to extract
Gaussian posteriorgrams.
2.3
      </p>
      <p>DTW</p>
      <p>
        We use the standard subsequence DTW as implemented
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The query is allowed to start at any frame of the test
utterance and the locally optimal detection is the one that
has the smallest accumulated distance. Also, to avoid the
preference for the shorter paths, accumulated distances are
normalized by the corresponding detected path lengths. For
distance measure, we have used Pearson product-moment
correlation for bottleneck features (BUT - Hungarian and
Russian) and inner product for posteriors (BUT -
Hungarian and Russian, TIFR, 64-GMM). A ltering step is then
applied to remove detected candidates which are very large
or very small in duration compared to the query length.
2.4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Fusion and Calibration</title>
      <p>
        Our approach is most similar to the discriminative fusion
approach proposed by A. Abad et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Scores are rst
normalized to zero mean and unit variance per query to allow
for use of a single threshold. Then the detections are aligned
and only those detections for which at least half the systems
show overlap in time are retained (majority voting) to
reduce the false alarms. This leaves us with multiple
detections of a query in an utterance. So for each query-utterance
pair we will get multiple score vectors (A score vector is a
collection of scores from all the subsystems for a possible
detection of query in an utterance). Our score vector has six
elements (BUT Hungarian-Posterior and Bottleneck, BUT
Russian-Posterior and Bottleneck, TIFR - Posterior, GMM
- Posterior).
      </p>
      <p>
        Since the task requires to give only one score per
queryutterance pair, we determine best score vector per
queryutterance pair using a two-step procedure:
1. First step is inspired by Hazen et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Scores from
various subsystems S(XjKi) are combined according
to equation:
S(XjK1K2:::KN ) =
1
      </p>
      <p>N
log( 1 X exp(</p>
      <p>N
i</p>
      <p>S(XjKi)))
(1)
where varying between 0 to 1 changes the averaging
function from geometric mean to arithmetic mean (we
have used = 1).
2. In the second step, we make use of the combined score
obtained in rst step to determine the best candidate
for an utterance. We retain the individual scores of the
subsystems along with the combined score (obtained
using equation 1) corresponding to the best detected
candidate, thus giving us one score vector per
queryutterance pair.</p>
      <p>
        All of these score vectors (corresponding to di erent
queryutterance pairs) are then used to train a binary logistic
classi er [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which gives us the fused score representative of
query-utterance pair. The fused scores are then calibrated
with respect to cross entropy evaluation metric to give us
log-likelihood score.
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSION</title>
      <p>Table 1 shows our results for development and evaluation
queries. Probably due to the high amount of noise (and
reverberation) in the dataset, the overall cross-entropy score
is poor even after noise removal. If we look at the scores
for each query type, clearly our system works best for the
T1 query type. This can be attributed to the fact that we
didn't take any special steps to counter T2 and T3 query
types like word level reordering (for T2 queries) and partial
matching (for T3 queries). Also, we didn't calibrate our
score for Term Weighted Values (TWV) resulting in very
low ATWV/MTWV scores.</p>
      <p>We observed that after subsequence DTW many possible
detections (candidates) were found for a query in an
utterance. This clearly suggests that posteriors and bottleneck
features used were not robust enough for the given noisy
and multilingual data. Also, we rely heavily on our rst
step of fusion which is nothing but the arithmetic mean of
scores (since = 1) from various subsystems to detect the
best candidate for a given query-utterance pair. So a high
score from even one of the subsystems can make the
combined score (obtained after Step 1 of fusion) biased towards
it, leading to the selection of that candidate over other
candidates with moderate scores from all the systems.</p>
      <p>Our experiments were done on a computer with Intel
i74790 CPU (3.60GHz, 8 cores), 16GB RAM. For searching,
all the posteriorgrams for a query-utterance pair were loaded
in memory. This caused high memory usage for longer
utterances (Peak memory usage of around 15GB). It took us
around 80 hours to search approximately 475 seconds of
query in 18 hours of audio database per subsystem,
leading to SSF of 0.0093 per sec.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>We have described the system developed at IIT-B for
QUESST task. To combat the e ect of noise in data, we used
spectral subtraction. Spectral subtraction reduces noise but
is also known to create artifacts in speech and so
posteriors/bottleneck features were not robust enough for the given
noisy and multilingual data. It would be interesting to study
the performance of our system without noise suppression.
The main novelty of our work was a two-step fusion
approach where in the rst step we decide the best candidate
for a query-utterance pair and in the second step we train
a logistic regression classi er. The e ect of the rst step of
fusion for di erent values of on the cross entropy score
needs to be investigated.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Sz</surname>
          </string-name>
          }oke, Luis J.
          <string-name>
            <surname>Rodriguez-Fuentes</surname>
            , Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca,
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Lojka</surname>
            , and
            <given-names>Xiao</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
          </string-name>
          .
          <article-title>Query by example search on speech at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the Mediaeval 2015 Workshop</source>
          ,
          <fpage>14</fpage>
          -
          <issue>15</issue>
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Query-by-example spoken term detection using phonetic posteriorgram templates</article-title>
          .
          <source>In Proc. ASRU</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Sz</surname>
          </string-name>
          }oke, Lukas Burget, Frantisek Grezl, Jan Cernocky, and
          <string-name>
            <given-names>Lucas</given-names>
            <surname>Ondel</surname>
          </string-name>
          .
          <article-title>Calibration and fusion of query-by-example systems - BUT SWS 2013</article-title>
          .
          <source>In Proc. ICASSP</source>
          , pages
          <volume>7899</volume>
          {
          <fpage>7903</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Noise power spectral density estimation based on optimal smoothing and minimum statistics</article-title>
          .
          <source>IEEE Transactions on Speech and Audio Processing</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ):
          <volume>504</volume>
          {
          <fpage>512</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Grezl</surname>
          </string-name>
          , M. Kara at, S. Kontar, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>Probabilistic and bottle-neck features for LVCSR of meetings</article-title>
          .
          <source>In Proc. ICASSP</source>
          , pages
          <volume>757</volume>
          {
          <fpage>760</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Chourasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Samudravijaya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Chandwani</surname>
          </string-name>
          .
          <article-title>Phonetically rich hindi sentence corpus for creation of speech database</article-title>
          .
          <source>In Proc. O-Cocosda</source>
          , pages
          <volume>132</volume>
          {
          <fpage>137</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams</article-title>
          .
          <source>In Proc. ASRU</source>
          , pages
          <volume>398</volume>
          {
          <fpage>403</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodr</surname>
          </string-name>
          guez-Fuentes,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bordel</surname>
          </string-name>
          .
          <article-title>On the calibration and fusion of heterogeneous spoken term detection systems</article-title>
          .
          <source>In Proc. INTERSPEECH</source>
          , pages
          <volume>20</volume>
          {
          <fpage>24</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>