<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BUT-HCTLab APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2011</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Colás</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HCTLab, Universidad Autónoma de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Javier Tejedor HCTLab, Universidad Autónoma de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Michal Fapšo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>We present the three approaches submitted to the Spoken Web Search. Two of them rely on Acoustic Keyword Spotting (AKWS) while the other relies on Dynamic Time Warping. Features are 3-state phone posterior. Results suggest that applying a Karhunen-Loeve transform to the log-phone posteriors representing the query to build a GMM/HMM for each query and a subsequent AKWS system performs the best.</p>
      </abstract>
      <kwd-group>
        <kwd>query-by-example spoken term detection</kwd>
        <kwd>acoustic keyword spotting</kwd>
        <kwd>dynamic time warping</kwd>
        <kwd>spoken web search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Experimentation</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATION</title>
      <p>The Spoken Web Search (SWS) task aims at building a
language-independent query-by-example spoken term
detection system without any knowledge of the target language
and query transcriptions. In so doing, our approaches are
based on the combination of as many language-dependent
“recognizers” as possible. 1</p>
    </sec>
    <sec id="sec-3">
      <title>FEATURE EXTRACTION</title>
      <p>Our feature extractor outputs 3-state phone posteriors as
features [2]. The phone posterior estimator [5] contains a
Neural Network (NN) classifier with a hierarchical structure
called bottle-neck universal context network. It consists of
a context network, trained as a 5-layer NN, and a merger</p>
    </sec>
    <sec id="sec-4">
      <title>APPROACHES 3. 3.1</title>
    </sec>
    <sec id="sec-5">
      <title>Parallel Acoustic Keyword Spotting (PAKWS)</title>
      <p>We combined decisions from 6-language dependent
Acoustic Keyword Spotters (AKWS) (from all the languages in
Table 1 except the Polish one due to its worse performance on
dev data). One AKWS consists of two steps: Query
recognition done by a phone recognizer and query detection done by
AKWS. They only differ in the decoder. Features (3-state
phone posteriors) extracted from the audio are fed into a
phone decoder – unrestricted phone loop without any phone
insertion penalty. AKWS filler model-based recognition
networks [4] are built according to the detected phone string per
each query. The filler/background models are represented by
a phone loop. Each phone model is represented by a 3-state
HMM tied to 3-state phone posteriors. The output of the
AKWS is a set of putative hits. The score is logarithm of
likelihood ratio normalized by the length of the detection.
These detections are converted into a matrix for each
utterance. The size of this matrix is #queries × #f rames.
Next, matrices for all 6 languages are “log added”. Finally,
the combined matrix is converted back to the list of
detections and the detection for which all 6 detectors agree has a
higher score.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>GMM/HMM term modeling</title>
      <p>
        Inspired in the previous approach, this relies on a
single AKWS as query detection with these differences: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
the background model is a GMM/HMM with 1-state
modeled with 10-GMM components, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the query model is
represented with a GMM/HMM whose number of states is 3
times the number of phones according to the phone
recognition with 1 GMM component each and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) all the languages
in Table 1 have been employed to produce the final
feature super-vector. Queries represented by a single phone
have been modeled with 6 states, as if the query contained
2 phones. We used the number of phones output by the
Slovak recognizer due to its best performance in terms of the
Upper-bound Term Weighted Value metric (UBTWV) [3].
Features used for background and query modeling were got
as follows: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the log-phone posteriors got from the feature
extractor are applied a Karhunen-Loeve transform (KLT)
for each invididual language, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) we keep the features that
explain up to 95% of the variance after KLT for each
individual language, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) we build a 152-dimensional feature
supervector, from them. The KLT statistics have been computed
from the dev data and next applied over both the dev/eval
queries and dev/eval data.
3.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Dynamic Time Warping (DTW)</title>
      <p>A similarity matrix from phonetic posteriorgrams [5] stores
the similarity between each query frame and each utterance
frame with the cosine distance as similarity function and a
DTW search hypothesises putative hits. DTW is run
iteratively starting in every frame in the utterance and ending
in a frame on the utterance [5]. The features that represent
the phonetic posteriorgrams are the concatenation of the
3state phone posteriors corresponding to every language in
Table 1.</p>
    </sec>
    <sec id="sec-8">
      <title>4. FILTERING AND CALIBRATION</title>
      <p>To deal with the score calibration and some
problematic query length under certain approaches issues, detections
were post-processed in the following steps: 1]“Filtering”
detections according to length difference from “average length”.
Average length of a query is calculated as the average length
of speech (phones) across the 6 phone recognizers used in
the PAKWS approach. It was applied on all the approaches
except the DTW as follows:</p>
      <p>
        ScF (det) =  SScc((ddeett)) −− LL(QmdieLnt)Qm−−iLLn(Qmdeatx) ,, LLQmQmianx &gt;&gt;LL((ddeett))




 LQmax
 Sc(det), otherwise
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where Q identifies the query to which the detection
belongs, Sc(det) is the original score, ScF (det) is the
“filtered” score, L(det) is the length of the detection in frames,
      </p>
      <p>Q
LLQmmianx == 10..48LLaQaQvveerr isis18400%% ooff tthhee aavveerraaggee qquueerryylelennggthth. Tanhde
detection score remains the same if the detection length is
longer than 80% and shorter than 140% of the average query
length. Otherwise the score is lowered the shorter/longer the
detection is according to the original query.</p>
      <p>
        2] Calibration, applied only in our PAKWS approach,
produces the final score of each detection as follows:
ScC(det) = ScF (det) + A1 + A2 ∗ Occ(Q),
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where Occ(Q) is number of query detection occurrences in
the data, and A1 = −1.0807 and A2 = −0.0001 are
calibration parameters. These were estimated from best thresholds
(UBTWV) on dev data using linear regression.
      </p>
      <p>
        Results for the required runs [1] are given in Table 2. The
PAKWS approach has two versions (with and without score
calibration). We clearly see that the GMM/HMM term
modeling approach outperforms the two other in a great
extent for unseen queries/data even with score calibration
applied on the PAKWS approach. We consider this is due
to: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The KLT statistics, computed from dev data and
applied on queries and data, plays the role of “adaptation”
towards the target domain, which differs from that used to
train the phone estimators from which the 3-state phone
posteriors are computed, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the use of a single example to
train the query model is more robust against uncertainties
than the set of features itself (used in the DTW approach)
and the phone transcription got from phone decoding (used
in the PAKWS approach), (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the prior combination of the
most relevant features after KLT given to the GMM/HMM
approach, opposite to the PAKWS approach, based on a
posterior combination from the detections got from each
individual AKWS system and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) by comparing the MTWV
(pooled) and UBTWV (non pooled) for PAKWS Qdev-Ddev
0.133 and 0.253, Qeval-Ddev 0.002 and 0.056, Qdev-Deval
0.030 and 0.157 and Qeval-Deval 0.033 and 0.223
respectively, suggests that the PAKWS system is the most
sensitive to data mismatch. The GMM/HMM-based term
modeling approach is less sensitive to data mismatch with the
following values: Qdev-Ddev 0.103 and 0.238, Qeval-Ddev
0.019 and 0.035, Qdev-Deval 0.010 and 0.179 and
QevalDeval 0.131 and 0.267 respectively. For the DTW similar
pattern as that of GMM/HMM is observed: Qdev-Ddev
0.020 and 0.106, Qeval-Ddev 0 and 0.011, Qdev-Deval 0 and
0.099 and Qeval-Deval 0.014 and 0.055.
6.
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>Our GMM/HMM-based term modeling approach achieves
the best performance, whereas the two other, PAKWS and
DTW, fail due to the unreliable phone transcription derived
in the former and the “meaningless” phone posteriors by
themselves used in the latter when facing to the
languageindependency issue. Future work will investigate new
features to enhance the performance of the best approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <article-title>Spoken web search</article-title>
          . In MediaEval 2011 Workshop, Pisa, Italy,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          et al.
          <article-title>Towards lower error rates in phoneme recognition</article-title>
          .
          <source>In Proc. of TSD</source>
          , pages
          <fpage>465</fpage>
          -
          <lpage>472</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke. Hybrid word-subword spoken term detection</article-title>
          .
          <source>PhD thesis</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szo¨ke et al</article-title>
          .
          <article-title>Phoneme based acoustics keyword spotting in informal continuous speech</article-title>
          .
          <source>LNAI</source>
          ,
          <volume>3658</volume>
          (
          <year>2005</year>
          ):
          <fpage>302</fpage>
          -
          <lpage>309</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tejedor</surname>
          </string-name>
          et al.
          <article-title>Novel methods for query selection and query combination in query-by-example spoken term detection</article-title>
          .
          <source>In Proc. of SSCS</source>
          , pages
          <fpage>15</fpage>
          -
          <lpage>20</lpage>
          , Florence, Italy,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>