<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNIZA System for the Spoken Web Search Task at MediaEval2013</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Roman Jarina</institution>
          ,
          <addr-line>Michal Kuba, Róbert Gubka, Michal Chmulik</addr-line>
          ,
          <institution>Martin Paralič Audiolab, Department of Telecommunications and Multimedia, University of Žilina</institution>
          ,
          <addr-line>Univerzitná 1, 010 26 Žilina</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper, we present an approach to detect spoken keywords according to a given query, as part of the MediaEval benchmark. The proposed approach is based on a concept of modelling the speech query as a concatenation of language-independent quasiphoneme models, which are derived by unsupervised clustering on various audio data. Since only an initial version of the system is presented, issues concerning further system improvements are also discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The developed system generates stochastic models of “elementary
sounds” (ES) derivated from the provided speech data in various
languages. These ESs are used as building blocks for speech
modelling instead of conventional phoneme based models. We
have recently adopted this approach to the task of generic sound
modelling and retrieval [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with promising results. The approach
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is built on the assumption that, in general, many types of
generic sounds can be modelled as a sequences of ES units, which
are picked from a sufficiently large (much greater than the
number of speech units), though finite inventory. Due to the
diverse nature of generic sounds, it is infeasible to create an
acoustical form of the elementary units; instead the ESs can be
defined only by their stochastic models.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. PROPOSED APPROACH</title>
      <p>
        System layout is depicted in Figure 1. Each utterance (query) is
modelled by a HMM-based statistical model. Due to lack of
information about language and linguistic structure of the
utterances, obviously, models of phonemes (or any other
linguistic units) could not be built in advance. Instead, we have
proposed to build models of language-independent quasi
phoneme-like ES units with aid of unsupervised clustering. Then,
the HMM for each query is built up by concatenation of such ES
models (or fillers). Searching the query over the database is
performed by Viterbi decoding [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The decoder generates output
cumulative probabilities (confidence) and starting positions, from
which the cumulative probabilities were computed. Candidates for
search results are obtained by thresholding the confidence curve.
We used only the provided SWS2013 development data during
system development.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Feature extraction</title>
      <p>We applied a 30 ms sliding window with 10 ms shift for
computation of MFCCs (Mel Frequency Cepstral Coefficients),
which may be considered as a standard in speech recognition.
Each frame is represented by the 39–dimensional feature vector
that is composed of 13 MFCCs along with their delta and delta–
delta derivates (incl. 0th cepstral coefficient), followed by
Cepstral Mean Normalisation. The feature vectors form
observations for further acoustical modelling by HMM.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Concept of quasi phoneme-like models</title>
      <p>The overall performance of our proposed approach for the SWS
task relies on how precisely the developed models of ES units
mimic real phoneme-based units. Thus they should preserve the
linguistic information as much as possible while the non-linguistic
features (related to speaker/gender variability, emotion,
background noise, channel characteristics, etc) should be
suppressed. This is a very challenging problem and can be solved
only to a certain extent. We are aware of the fact that the
developed ES units are only a rough approximation of linguistic
units if no information about language structure is taken into
account.</p>
      <p>
        In these initial experiments, we developed the ES models by the
following procedure:
1) All 20 hours of audio of the SWS2013 dataset were
parameterized as a sequence of MFCC vectors.
2) Means and variances of Gaussian components of
semicontinuous Gaussian Mixture PDF were estimated by
unsupervised clustering of all MFCC vectors from the dataset.
The clustering was performed by the K–variable K–means
procedure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. During the clustering, small clusters (containing
less than 30 vectors) were discarded. This iterative procedure
converged into 257 clusters, whose centroids defined the means of
PDF’s Gaussian components.
3) In the next step, the sequence of the MFCC vectors computed
from the SWS2013 dataset was divided into segments of 20
vectors with 50% overlap, each spanning 0.1 seconds of audio,
which is initially considered the average duration of ES units..
Then 1-state semi-continuous density HMM (SD-HMM) was
estimated for each segment. Since all SD-HMMs utilize the PDFs
with the same Gaussian components, only weights of the
Gaussians in the mixtures were computed during ES model
estimation. Thus the HMM states are defined by PDFs composed
of Gaussian mixtures described by 257-dimensional weight
vectors. Such approach is much less computational demanding
than conventional HMM training. Since it results in the creation
of about 1.5 million models, the amount of weight vectors was
massively reduced, again by the K-variable K-means clustering
(only 207 clusters with the highest amount of vectors were
preserved). This procedure results in the creation of 207 models
of ES units.
      </p>
      <p>Audio data
Feature extraction
C luster analysis
- estim ation of PD F
G auss. com ponents</p>
      <p>D ata stream segm entation</p>
      <p>E S m odel estim ation</p>
      <p>
        C luster analysis
- reduction of m odels num ber
The queries were represented by concatenation of these ES
models as it is shown in Figure 1. For each query, the best
sequence of the ES models was obtained by Viterbi decoding
according to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It should be remarked that during decoding, each
1-state HMM was replaced by the 10-equal-state HMM (obtained
by concatenation of the same states), with an aim to avoid
detection of very short segments. This procedure secured that the
decoded sequence passed through the same state at least 10 times,
i.e. the process retained in the same state at least 0.1 seconds.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. RESULTS AND DISCUSSION</title>
      <p>
        Maximal confidence obtained by Viterbi decoding was chosen as
the score in submission. The segments with score about the given
threshold (the threshold was tuned on the development queries)
were labelled as YES decision. With an effort to decrease high
false alarm rate, the segments with duration D &gt; 2Q or D &lt; 0.5Q
(where Q is the duration of the query) were filtered out.
The performance of the submitted system was evaluated in terms
of several measures, as defined by the SWS2013 task [
        <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
        ]:
Actual/ Maximum Term-Weighted Values ATWV/ MTWV
(weighted combination of miss and false alarm error rates); and a
normalized cross-entropy metric Cnxe. The official results (late
submission) are summarized in Table 1. In addition the amount of
processing resources, namely Indexing/Searching Speed Factors
(ISF/SSF) as defined in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], are estimated.
      </p>
      <p>We reckon that the following bottlenecks caused very low
performance of this first version of the submitted system:
- Since the ES models are obtained by unsupervised clustering,
a very important issue is that the features belonging to the same
linguistic unit should be grouped together in the feature vector
space. The problem with MFCC based features is that they are
advisable for both speech recognition as well as speaker
discrimination tasks, what is disadvantageous in our case. Some
discriminant analysis transform on MFCC, to obtain the speaker–
independent features, might be applied prior to clustering;
- We haven’t investigated the impact of the size of the ES
inventory on the search performance. Due to lack of time also
only a simplified HMM training without re-estimation was
applied for ES model development;
- After inspection of the results we have noticed that ES models
of noise and silence were very often found in the decoding
sequences, and overall confidence was affected by these “non
linguistic” models. Prior proper speech activity detection on
queries as well as audio content (to avoid modelling of
nonspeech events) would also help.</p>
      <sec id="sec-5-1">
        <title>Devel.</title>
      </sec>
      <sec id="sec-5-2">
        <title>Eval.</title>
        <p>The system was run on a workstation with 32-core CPU. All the
programming was made in Matlab and it was not optimized for
speed. In pre-processing (indexing) phase, only MFCC features
were precomputed. It took about 1/60 of real-time, that means
ISF = 1205/ (71839+696 sec.) = 0.017 for the evaluation data.
Note that ES models computation is not considered in ISF. In the
submitted version of the system, audio content-based adaptation
of ES models is not considered, thus the development of ES
models might be seen as an extra part. The processing time
employed in searching (recomputed for single CPU) is approx.
SSF = 1.08 x 106 / (71839 x 696 sec.) = 0.022. The peak memory
usage in searching is very low because only the Viterbi algorithm
is performed. Rough estimation is about 1 – 10 MBytes. Hence,
the proposed system, if it is tuned (and compiled in other more
suitable programming environment), might be computationally
very efficient.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          , “
          <article-title>The spoken web search task</article-title>
          ,” MediaEval 2013 Workshop, Oct.
          <volume>18</volume>
          -
          <fpage>19</fpage>
          ,
          <year>2013</year>
          , Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nouza</surname>
          </string-name>
          , J. Silovsky, “
          <article-title>Fast keyword spotting in telephone speech”</article-title>
          .
          <source>Radioengineering</source>
          ,
          <volume>18</volume>
          (
          <issue>4</issue>
          ),
          <year>2009</year>
          ,
          <fpage>665</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gubka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuba</surname>
          </string-name>
          ,
          <article-title>"Elementary sound based audio pattern searching,"</article-title>
          <source>23rd Int. Conf. Radioelektronika</source>
          <year>2013</year>
          , April 16-
          <issue>17</issue>
          ,
          <year>2013</year>
          ,
          <fpage>325</fpage>
          -
          <lpage>328</lpage>
          . DOI = http://dx.doi.org/10.1109/ RadioElek.
          <year>2013</year>
          .6530940
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Reyes-Gomez</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D.P.W.</given-names>
          </string-name>
          , ”
          <article-title>Selection, parameter estimation, and discriminative training of hidden Markov models for general audio modeling</article-title>
          ,
          <source>” Int. Conf. on Multimedia and Expo</source>
          , ICME '
          <fpage>03</fpage>
          ,
          <year>July 2003</year>
          .
          <fpage>I73</fpage>
          -
          <lpage>76</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <article-title>"MediaEval 2013 Spoken Web Search Task: System Performance Measures," n</article-title>
          .
          <source>TR-2013-1</source>
          , Dept. of Electricity and Electronics,
          <source>Univ. of the Basque Country</source>
          ,
          <year>2013</year>
          , http://gtts.ehu.es/ gtts/NT/fulltext/rodriguezmediaeval13.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>