<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The L2F Spoken Web Search system for Mediaeval 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Isabel Trancoso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alberto Abad INESC-ID Lisboa / Instituto Superior Técnico</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INESC-ID Lisboa/ Instituto Superior Técnico</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ramón F. Astudillo INESC-ID Lisboa</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The INESC-ID's Spoken Language Systems Laboratory (L2F) primary system developed for the Spoken Web Search task of the Mediaeval 2013 evaluation campaign consists of the fusion of six individual sub-systems exploiting 3 di erent language-dependent phonetic classi ers. For each phonetic classi er, an acoustic keyword spotting (AKWS) sub-system based on connectionist speech recognition and a dynamic time warping (DTW) based sub-system have been developed. The diversity in terms of phonetic classi ers and methods, together with the e cient fusion and calibration approach applied for heterogeneous sub-systems, are the key elements of the L2F submission. Besides the primary submission, two additional systems based on the fusion of only the AKWS and the DTW sub-systems have been developed for comparison purposes. A nal multi-site system formed by the fusion of the L2F and the GTTS primary submissions has been also submitted to explore the potential of the fusion approach for very heterogeneous systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This document introduces the Spoken Web Search systems
developed by the INESC-ID's Spoken Language Systems
Laboratory (L2F) for the Mediaeval 2013 campaign. The
targeted task in this challenge is query-by-example spoken
term detection. Detailed information about the task and
the data used can be found in the evaluation plan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One
primary and three contrastive systems (one of them in
collaboration with another participating group) have been
submitted. The primary system consists of the fusion of six
individual sub-systems. The proposed systems present three
main novelties with respect to the systems developed for
the previous year evaluation campaign [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: 1) the number of
language-dependent phonetic networks has been limited to
three; 2) DTW-based sub-systems exploiting log-posterior
features have been incorporated; and 3) a recently proposed
method for discriminative calibration and fusion of
heterogeneous spoken term detection systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has been applied.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. THE L2F SWS SYSTEM DESCRIPTION</title>
      <p>Six sub-systems form the core of the L2F SWS system
exploiting three di erent language-dependent phonetic
networks trained for European Portuguese (pt ), Brazilian
Portuguese (br ) and European Spanish (es). The phonetic
networks are used either as acoustic models in acoustic KWS
based on hybrid connectionist methods or as a feature
extraction component for DTW based term detection.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Phonetic network classifiers</title>
      <p>
        L2F systems exploit multi-layer perceptron (MLP) networks
that are part of our in-house hybrid connectionist ASR
system. The phonetic class posterior probabilities are in fact
the result of the combination of four MLP outputs trained
with Perceptual Linear Prediction features (PLP, 13 static
+ rst derivative), PLP with log-RelAtive SpecTrAl speech
processing features (PLP-RASTA, 13 static + rst
derivative), Modulation SpectroGram features (MSG, 28 static)
and Advanced Font-End from ETSI features (ETSI, 13 static
+ rst and second derivatives). The language-dependent
MLP networks were trained using di erent amounts of
annotated data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each MLP network is characterized by
the size of its input layer that depends on the particular
parametrization and the frame context size (13 for PLP,
PLP-RASTA and ETSI; 15 for MSG), the number of units
of the two hidden layers (500), and the size of the output
layer. In this case, only monophone units are modelled,
resulting in MLP networks of 39 (38 phonemes + 1 silence)
soft-max outputs in the case of pt, 40 for br (39 phonemes
+ 1 silence) and 30 for es (29 phonemes + 1 silence).
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Acoustic KWS systems</title>
      <p>
        AKWS sub-systems exploit the phonetic networks as
acoustic models for both phonetic tokenization and query search
based on hybrid ANN/HMM approaches for ASR [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
decoder used is based on a weighted nite-state transducer
(WFST) approach to large vocabulary speech recognition.
First, the phonetic transcription of each spoken query is
obtained for every sub-system using a phone-loop grammar.
Simple 1-best phoneme chain output has been used. Then,
search is carried out with a sliding window of 5 seconds (2.5
seconds time shift) using an equally-likely 1-gram language
model formed by the target query and a competing speech
background model. On the one hand, keyword/query
models are described by the sequence of phonetic units obtained
in the tokenization. On the other hand, the likelihood of
a background speech unit representing \general speech" is
estimated based on the other phonetic classes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The
output score for each candidate detection is computed as the
average of the phonetic log-likelihood ratios that form the
detected query term. More details can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Dynamic Time Warping systems</title>
      <p>DTW sub-systems use the language-dependent phonetic
networks to extract log-posterior features. The silence class
of the phonetic network is also used for voice activity
detection. To this end, the segments identi ed as silence at
the beginning and end of each query and document are
removed. For each query-document pair, N euclidean distance
based DTWs are run on N starting candidate positions of
the document. To select the candidate positions, the
querydocument euclidean distance matrix of the DTW is used.
The minimum of each column of the matrix represents the
minimum distance among all query feature vectors to a given
document feature vector. The average of these minima on
a sliding window of query size is used as an approximation
of DTW without the warping constraints, from which the
best N candidates are selected. The number of candidates
N was made equal to the length of the document in
feature vectors divided by 100 with a minimum of 100
candidates. In a second stage, DTWs of the size of the query
are evaluated at each one of the N candidate positions, and
the three candidates with lower normalized cumulative
distance, and separated by at least 0.5 seconds, are kept. The
reduction of the search space to N candidates as explained
above provided a reduction of the search time by a factor
of around 5, while having a minimal impact on the
performance. It should be noted that the DTW, including the
distance matrix, was computed using the R programming
language, while the candidate selection and remaining tasks
were implemented in Python1. This framework bene ted
particularly from the candidate selection scheme proposed.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Discriminative calibration and fusion</title>
      <p>
        The combination of systems is based on a recently
proposed method for discriminative calibration/fusion of
heterogeneous spoken term detection (STD) systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]2.
Under this approach, missing scores for systems that do not
detect a given candidate are hypothesized based on heuristics.
In this way, the original problem of several unaligned
detection candidates is converted into a veri cation task. As for
other veri cation tasks, system weights and o sets are then
estimated through linear logistic regression. As a result,
the combined scores are well calibrated, and the detection
threshold is automatically given by application parameters
(priors and costs). The method permits easy integration
with majority voting schemes and it is convenient if scores
from heterogeneous systems are in the same ranges (we
apply a per-query zero-mean and unit-variance normalization
q-norm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Moreover, the maximum number of detection
candidates for a certain query provided by any sub-system
was limited to 200 before score normalization and fusion.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3. SUBMITTED SYSTEMS AND RESULTS</title>
      <p>
        One primary and two contrastive \on-time" systems were
submitted. The primary system consists of the fusion of the
six sub-systems previously described, while the contrastive1
and contrastive2 submissions correspond to the fusion of
only the DTW and only the AKWS sub-systems,
respectively. Additionally, a \late" contrastive3 system based on
the fusion of the primary systems of the L2F and GTTS[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
teams was also submitted. All the submitted systems are
expected to generate well-calibrated log-likelihood ratios, such
that the theoretical minimum expected cost Bayes
threshold can be used ( Bayes = log , see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for more details).
1https://www.l2f.inesc-id.pt/wiki/index.php/DTW
2https://www.l2f.inesc-id.pt/wiki/index.php/STDfusion
      </p>
    </sec>
    <sec id="sec-8">
      <title>4. ACKNOWLEDGEMENTS</title>
      <p>This work was partially funded by the DIRHA European
project (FP7-ICT-2011-7-288121) and the Portuguese
Foundation for Science and Technology (FCT), through the projects
PEst-OE/EEI/LA0021/2013 and PTDC/EIA-CCO/122542/
2010, and the grant number SFRH/BPD/68428/2010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Astudillo</surname>
          </string-name>
          .
          <article-title>The L2F Spoken Web Search system for Mediaeval 2012</article-title>
          . In MediaEval 2012 Workshop, Pisa, Italy, October 4-5
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luque</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Trancoso.</surname>
          </string-name>
          <article-title>Parallel Transformation Network features for Speaker Recognition</article-title>
          . In ICASSP, May
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pompili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Trancoso.</surname>
          </string-name>
          <article-title>Automatic word naming recognition for treatment and assessment of aphasia</article-title>
          .
          <source>In Interspeech</source>
          <year>2012</year>
          ,
          <year>Sep 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J. Rodriguez</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bordel</surname>
          </string-name>
          .
          <article-title>On the Calibration and Fusion of Heterogeneous Spoken Term Detection Systems</article-title>
          .
          <source>In Interspeech</source>
          <year>2013</year>
          ,
          <year>August</year>
          25-29
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buso</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The Spoken Web Search Task</article-title>
          . In MediaEval 2013 Workshop, October 18-19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Morgan</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Bourlad</surname>
          </string-name>
          .
          <article-title>An introduction to hybrid HMM/connectionist continuous speech recognition</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <volume>25</volume>
          {
          <fpage>42</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano. MediaEval 2013 Spoken Web</surname>
          </string-name>
          <article-title>Search Task: System Performance Measures</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Penagarikano</surname>
          </string-name>
          , G. Bordel, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Diez</surname>
          </string-name>
          .
          <article-title>GTTS Systems for the SWS Task at MediaEval 2013</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>