<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SWS task: Articulatory Phonetic Units and Sliding DTW</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gautam Varma Mantena</string-name>
          <email>@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bajibabu B</string-name>
          <email>@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kishore Prahallad</string-name>
          <email>@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Speech and Vision Lab, International Institute of, Information Technology</institution>
          ,
          <addr-line>Hyderabad, Andhra Pradesh</addr-line>
          ,
          <country country="IN">India</country>
          ,
          <addr-line>bajibabu.b</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Speech and Vision Lab, International Institute of, Information Technology</institution>
          ,
          <addr-line>Hyderabad, Andhra Pradesh</addr-line>
          ,
          <country country="IN">India</country>
          ,
          <addr-line>gautam.mantena</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Speech and Vision Lab, International Institute of, Information Technology</institution>
          ,
          <addr-line>Hyderabad, Andhra Pradesh</addr-line>
          ,
          <country>India, kishore</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>This paper describes the experiments conducted for spoken web search at MediaEval 2011 evaluations. The task consists of searching for audio segments within audio content using an audio query. The current approach uses a broad articulatory phonetic units for indexing the audio files and to obtain audio segments. Sliding DTW is applied on the audio segments to determine the time instants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The approach attempted aims at identifying audio
segments within an audio content using an audio query.
Language independency is one of the primary constraints for
the spoken web search task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We have implemented a two
stage process in obtaining the most likely audio segments.
Initially all the audio files are indexed based on their
corresponding articulatory phonetic units. The input audio query
is decoded into its corresponding articulatory phonetic units
and necessary audio segments which have a similar sequence
are selected. Sliding window type of approach using dynamic
time warping (DTW) algorithm has been used in
determining the time stamps within the audio segment. The following
approach is provided in more detail in section 2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>A variant of DTW search is used to identify the audio
segments within an audio content. DTW algorithm is time
consuming and so it is necessary for the system to select
the necessary segments beforehand. The system consists of
an indexing step which would improve the retrieval time
by selecting the required audio segment within the audio
content. Following approach is a two level process where it
prunes appropriate segments from each level. The two levels
implemented are as follows:
1. First level: Indexing the audio data in terms of its
articulatory phonetic units and using them to obtain
the most likely segments for the input audio query.</p>
      <p>The procedure for the audio search task is as described in
sections 2.1, 2.2
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Indexing using Articulatory Phonetic Units</title>
      <p>The primary motivation for this approach is to have speech
specific features rather than language specific features like
phone models. Advantage is that the articulatory phonetic
units (selected well) could represent a broader set of
languages. This would enable us to build articulatory units
from one language and use it for other languages.</p>
      <p>Articulatory units selected are as shown in table 1. For
example the articulatory unit CON VEL UN is a consonant
velar unvoiced sound. A more detailed description of the
tags is given in table 2.</p>
      <p>
        Audio content is decoded into their corresponding
articulatory units using HMM models with 64 Gaussian mixture
models using the HTK Tool Kit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The models were built
using 15 hours of telephone Telugu data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] consisting of
200 speakers. Using the decoded articulatory phonetic
output, trigrams were used for indexing. The audio query was
also decoded and the audio segments were selected, if there
was a match in any of the trigrams. Let tstart and tend
are the start and the end time stamps for the trigram in
the audio content which matches with one of the trigrams
from the audio query. Then the likely segment from the
audio content would be (tstart − audio query length) and
(tend + audio query length). This is would enable to
capture speech segments with varying speaking rate.
      </p>
      <p>These time stamps would provide the audio segments that
are likely to contain the audio query. Sliding DTW search
was applied on these audio segments to obtain the
appropriate time stamps for the query which is explained in detail
in section 2.2.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Sliding Window DTW Search</title>
      <p>In a regular DTW algorithm the audio segments are
assumed to have a timing difference and the algorithm helps in
time normalization, i.e. it fixes the beginning and the end of
the auido segments. In spoken term detection we also need
to identify the right audio segment within an audio with
appropriate time stamps.</p>
      <p>We propose an approach where we consider an audio
content segment of length twice the length of the audio query,
and a DTW is performed. After a segment has been
compared the window is moved by one feature shift and DTW
search is computed again. MFCC features, with window
length 20 msec and 10 msec window shift have been used to
represent the speech signal. Consider an audio content
segment S and an audio query Q. Construct a substitution
matrix M of size q x sq where q is the size of Q and sq = 2 ∗ q.
We also define M[i,j] as the node measuring the optimal
alignment of the segments Q[1:i] and S[1:j]</p>
      <p>
        During DTW search, at some instants M[q,j] (j &lt; sq)
will be reached. Then the time instants from column j to
column sq are the possible end points for the audio segment.
Euclidean distance measure have been used to calculate the
costs for the matrix M. The above procedure was adapted
from a similar approach as mentioned in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The scores corresponding to all the possible end points are
considered for k-means clustering. For k = 3, mean scores
are calculated. Minimum score is used as a threshold to
select segments. The segment with the lowest score among
the overlapping segments (overlap of 70%) is considered.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>
        From the audio content and audio query, speech and
nonspeech segments are detected. Indexing and DTW search
are then applied on the speech segments. Zero filtered signal
was generated from the audio signal using a zero frequency
resonator. This zero frequency signal is used to detect voiced
and unvoiced regions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. If duration of unvoiced segment is
more than 300ms then it is a non-speech segment.
      </p>
      <p>
        The system was evaluated based on NIST spoken term
detection evaluation scheme [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Miss probability and false
alarm probability scores on the development data is ranging
from 70% - 98% and 0.1% - 0.6% respectively. Miss
probability and false alarm probability scores on the evaluation data
is ranging from 96% - 98% and 0.1% - 0.2% respectively.
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSIONS</title>
      <p>For indexing the audio content, trigrams were used. The
approach can be extended for bigram or four-gram. Use of
bigram would return a lot of segments and would be a
problem due to slow speed of sliding DTW search. In case of
four-gram, recall of the number of audio content files has
drastically dropped. Trigram seems to strike a balance
between bigram and four-gram indexing.</p>
      <p>For estimating the end point, k-means clustering was
applied on the DTW scores obtained from all the audio
segments. We suspect that this might be the reason to loose
certain segments. DTW scores obtained from similar kind
of pronunciation might mask the scores from the other
pronunciation variations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Arun</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Nitendra Rajput, Dipanjan Chakraborty,
          <string-name>
            <surname>Sheetal K. Agarwal</surname>
          </string-name>
          , and Amit Anil Nanavati,
          <source>“WWTW: The World Wide Telecom Web,” in NSDR 2007 (SIGCOMM workshop)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Steve</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Kershaw</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ollason</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Valtchev</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Woodland</surname>
          </string-name>
          ,
          <source>The HTK Book Version 3.4</source>
          , Cambridge University Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gopalakrishna</given-names>
            <surname>Anumanchipalli</surname>
          </string-name>
          , Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh,
          <string-name>
            <given-names>R.N.V.</given-names>
            <surname>Sitaram</surname>
          </string-name>
          , and
          <string-name>
            <surname>S P Kishore</surname>
          </string-name>
          , “
          <article-title>Development of indian language speech databases for large vocabulary speech recognition systems,”</article-title>
          <source>in Proc. SPECOM</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kishore</given-names>
            <surname>Prahallad</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Black</surname>
          </string-name>
          , “
          <article-title>Segmentation of monologues in audio books for building synthetic voices,”</article-title>
          <source>in IEEE Transactions on Audio, Speech and Language Processing</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dhananjaya</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yegnanarayana</surname>
          </string-name>
          , “
          <article-title>Voiced/nonvoiced detection based on robustness of voiced epochs,”</article-title>
          <source>in IEEE Signal Processing Letters</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fiscus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ajot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garofolo</surname>
          </string-name>
          , and G. Doddington, “
          <article-title>Results of the 2006 spoken term detection evaluation,”</article-title>
          <source>in Proc. of the ACM SIGIR</source>
          <year>2007</year>
          , Workshop in Searching Spontaneous Conversational Speech,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>